# Deep dive into container internals In this chapter, we will explain some of the fundamental building blocks of containers. This will give you a solid foundation so you can: - understand "what's going on" in complex situations, - anticipate the behavior of containers (performance, security...) in new scenarios, - implement your own container engine. The last item should be done for educational purposes only! --- ## There is no container code in the Linux kernel - If we search "container" in the Linux kernel code, we find: - generic code to manipulate data structures (like linked lists, etc.), - unrelated concepts like "ACPI containers", - *nothing* relevant to "our" containers! - Containers are composed using multiple independent features. - On Linux, containers rely on "namespaces, cgroups, and some filesystem magic." - Security also requires features like capabilities, seccomp, LSMs... --- # Control groups - Control groups provide resource *metering* and *limiting*. - This covers: - "classic" compute resources like memory, CPU, I/O - system resources like number of processes (PID) - "exotic" resources like GPU VRAM, huge pages, RDMA - other things like device node access (`/dev`) and perf events --- ## Crowd control - Control groups also allow to group processes for special operations: - freeze (conceptually similar to a "mass-SIGSTOP/SIGCONT") - kill (safe mass-SIGKILL) --- ## Generalities - Cgroups form a hierarchy (a tree) - We can create nodes in that hierarchy - We can associate limits to a node - We can move a process (or multiple processes) to a leaf - The process (or processes) will then respect these limits - We can check the current usage of each node - In other words: limits are optional (if we only want accounting) - When a process is created, it is placed in its parent's groups - The main interface is a pseudo-filesystem (typically mounted on `/sys/fs/cgroup`) --- ## Example .small[ ```bash $ tree /sys/fs/cgroup/ -d /sys/fs/cgroup/ ├── init.scope ├── machine.slice ├── system.slice │ ├── avahi-daemon.service │ ├── ... │ ├── docker-de3ee38bc8d90b7da218523004cae504a2fa821224fd49f53521d862db583fef.scope │ ├── docker-e9e55ba69f0a4639793464972a8645cdb23ae9f60567384479a175e3226776b4.scope │ ├── docker.service │ ├── docker.socket │ ├── ... │ └── wpa_supplicant.service └── user.slice └── user-1000.slice ├── session-1.scope └── user@1000.service ├── app.slice │ └── ... ├── init.scope └── session.slice └── ... ``` ] --- class: extra-details, deep-dive ## Cgroups v1 vs v2 - Cgroups v1 were the original implementation (back when Docker was created) - Cgroups v2 are a huge refactor (development started in Linux 3.10, released in 4.5.) - Cgroups v2 have a number of differences: - single hierarchy (instead of one tree per controller) - processes can only be on leaf nodes (not inner nodes) - and of course many improvements / refactorings - Cgroups v2 should be the default on all modern distros! --- class: extra-details, deep-dive ## Example of cgroup v1 hierarchy The numbers are PIDs. The names are the names of our nodes (arbitrarily chosen). .small[ ```bash cpu memory ├── batch ├── stateless │ ├── cryptoscam │ ├── 25 │ │ └── 52 │ ├── 26 │ └── ffmpeg │ ├── 27 │ ├── 109 │ ├── 52 │ └── 88 │ ├── 109 └── realtime │ └── 88 ├── nginx └── databases │ ├── 25 ├── 1008 │ ├── 26 └── 524 │ └── 27 ├── postgres │ └── 524 └── redis └── 1008 ``` ] --- ## CPU cgroup - Keeps track of CPU time used by a group of processes (this is easier and more accurate than `getrusage` and `/proc`) - Allows setting relative weights used by the scheduler - Allows setting maximum time usage per time period (e.g. "50ms every 100ms", which would cap the group to 50% of one CPU core) - Allows setting reservations and caps ("utilization clamping") (particularly relevant for realtime processes) --- ## Checking current CPU limits - Getting the cgroup for the current user session: ```bash cat /proc/$$/cgroup ``` (it should start with `/user.slice/...`) - Checking the current CPU limit: ```bash cat /sys/fs/cgroup/user.slice/.../cpu.max ``` (it should look like `max 100000`) - `max` means unlimited; `100000` means "over a period of 100000 microseconds" (unless specified, all cgroup time durations are in microseconds) --- ## Setting a CPU limit - Run `top` in a terminal to view CPU usage - In a separate terminal, burn CPU cycles with e.g.: ```bash while : ; do : ; done ``` - Set a 50% CPU limit for that user or session: ```bash echo 50000 > /sys/fs/cgroup/user.slice/.../cpu.max ``` - Notice that CPU usage goes down (probably to *less* than 50% since this is a limit for the whole user/session!) --- ## Removing the CPU limit - Remember to remove the limit when you're done: ```bash echo max > /sys/fs/cgroup/user.slice/.../cpu.max ``` --- ## Cpuset cgroup - Pin groups to specific CPU(s) - Features: - limit apps to specific CPUs (`cpuset.cpus`) - reserve CPUs for exclusive use (`cpuset.cpus.exclusive`) - assign apps to specific NUMA memory nodes (`cpuset.mems`) - Use-cases: - dedicate CPUs to avoid performance loss due to cache flushes - improve memory performance in NUMA systems --- ## Cpuset concepts - `cpuset.cpus` / `cpuset.mems` *express what we allow the cgroup to use (can be empty to allow everything)* - `cpuset.cpus.effective` / `cpusets.mems.effective` *express what the cgroup can actually use after accounting for other restrictions* - `cpuset.cpus.exclusive` / `cpuset.cpus.partition` *used to create "partitions" = sets of CPU(s) exclusively reserved for a cgroup* --- ## Memory cgroup: accounting - Keeps track of pages used by each group: - file (read/write/mmap from block devices) - anonymous (stack, heap, anonymous mmap) - active (recently accessed) - inactive (candidate for eviction) - ...many other categories! - Each page is "charged" to a single group (this can result in non-deterministic "charges" for shared pages, e.g. mapped files) - To view all the counters kept by this cgroup: ```bash $ cat /sys/fs/cgroup/memory.stat ``` --- ## Memory cgroup: limits and reservations - Cgroups v1 allowed to set soft and hard limits (soft limits influenced reclaim but it wasn't straightforward to use) - Cgroups v2 are way more sophisticated: - hard limits (`.max`) - thresholds triggering more evictions (`.high`) - thresholds triggering less evictions (`.low`) - reservations (`.min`) - Also limits for swap and zswap --- ## Hard limits - A cgroup can *never* exceed its hard limits - When a cgroup tries to use more than the hard limit: - the kernel tries to reclaim memory (buffers, mapped files...) - when there is nothing to reclaim, the OOM killer is invoked - There is a `memory.oom.group` flag to alter OOM behavior: - `0` (default) = kill processes one by one - `1` = consider the cgroup as a unit; OOM will kill it entirely --- ## Also... - A `.peak` value is also exposed for each tracked amount (memory, swap, zswap) - Write an amount to `memory.reclaim` to trigger reclaim (=ask the kernel to recover memory from the cgroup) - Check memory stats per NUMA nopde (`memory.numa_stat`) - And more! --- ## Block I/O cgroup - Keep track of I/Os for each group: - per block device - read, write, and discard - in bytes and in operations - Set hard limits for each counter - Set relative weights and latency targets --- ## `io.max` - Enforce hard limits (set max number of operations, of bytes read/written...) - Each limit is per-device - Doesn't offer performance guarantees (once a device is saturated, performance will degrade for everyone) --- ## `io.cost.qos` - Try to offer latency guarantees - Define per-device thresholds to throttle operations "if the 95% percentile latency of read operations on this device is above 100ms... ...throttle operations on this device (queue them)" - Can also define `io.weight` for relative priorities between cgroups - Check [this document](https://facebookmicrosites.github.io/resctl-demo-website/docs/demo_docs/setting_benchmarks/iocost/) for some details and hints --- ## Network I/O - Cgroups v1 had net_cls and net_prio controllers - These have been deprecated in cgroups v2: *There is no direct equivalent of the net_cls and net_prio controllers from cgroups version 1. Instead, support has been added to iptables(8) to allow eBPF filters that hook on cgroup v2 pathnames to make decisions about network traffic on a per-cgroup basis.* --- ## Pid - Limit (and count) number of processes in a cgroup - Protects against e.g. fork bombs --- ## Devices - We need to limit access to device nodes - Containers should not be able to open e.g. disks and partitions directly (/dev/sda\*, /dev/nvme\*...) - However, some devices are expected to be available at all times: /dev/tty, /dev/zero, /dev/null, /dev/random... --- ## Cgroups v1 - There used to be a special "devices" control group - It made it easy to grand read/write/mknod permissions (individually for each device and each container) - Access could be granted/revoked/viewed through a pseudo-file: ```bash echo 'c 1:3 mr' > /sys/fs/cgroup/.../devices.allow ``` - This file doesn't exist anymore in cgroups v2! --- ## Cgroups v2 - Device access is controlled with eBPF programs (there is a special program type, [`cgroup_device`][bpf-cgroup-device], for that purpose) - This requires writing and compiling eBPF programs (😰) - Viewing permissions requires disassembling eBPF programs (😱) [bpf-cgroup-device]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_DEVICE/ --- ## Viewing eBPF programs - Install bpf tools (package name `bpftool` or `bpf`) - View all eBPF programs attached to cgroups: ```bash sudo bpftool cgroup tree ``` - View eBPF programs attached to a Docker container: ```bash sudo bpftool cgroup list /sys/fs/cgroup/system.slice/docker-.scope ``` - Disassemble an eBPF program: ```bash sudo bpftool prog dump xlated id ``` - *Bon chance* 😬 --- ## Some interesting nodes - `/dev/net/tun` (network interface manipulation) - `/dev/fuse` (filesystems in user space) - `/dev/kvm` (run VMs in containers) - `/dev/dri` (GPU) - `/dev/ttyUSB*`, `/dev/ttyACM*` (serial devices) - `/dev/snd/*` (sound cards) --- ## And the exotic ones... - `rdma`: remote memory access, infiniband - `dmem`: device memory (VRAM), relatively new (kernel 6.14, January 2025; only Intel and AMD GPU for now) - `hugetlb`: huge pages - `perf_event`: [performance profiling](https://perfwiki.github.io/main/) - `misc`: generic cgroup for other discrete resources (extension point to plug even more exotic resources) --- # Namespaces - Provide processes with their own view of the system - Namespaces limit what you can see (and therefore, what you can use) - These namespaces are available in modern kernels: - pid - net - mnt - uts - ipc - user - time - cgroup (we are going to detail them individually) - Each process belongs to one namespace of each type --- ## Namespaces are always active - Namespaces exist even when you don't use containers - This is a bit similar to the UID field in UNIX processes: - all processes have the UID field, even if no user exists on the system - the field always has a value / the value is always defined
(i.e. any process running on the system has some UID) - the value of the UID field is used when checking permissions
(the UID field determines which resources the process can access) - You can replace "UID field" with "namespace" above and it still works! - In other words: even when you don't use containers,
there is one namespace of each type, containing all the processes on the system --- class: extra-details, deep-dive ## Manipulating namespaces - Namespaces are created with two methods: - the `clone()` system call (used when creating new threads and processes) - the `unshare()` system call - The Linux tool `unshare` allows doing that from a shell - A new process can re-use none / all / some of the namespaces of its parent - It is possible to "enter" a namespace with the `setns()` system call - The Linux tool `nsenter` allows doing that from a shell --- class: extra-details, deep-dive ## Namespaces lifecycle - When the last process of a namespace exits, the namespace is destroyed - All the associated resources are then removed - Namespaces are materialized by pseudo-files in `/proc//ns`. ```bash ls -l /proc/self/ns ``` - It is possible to compare namespaces by checking these files (this helps to answer the question, "are these two processes in the same namespace?") - It is possible to preserve a namespace by bind-mounting its pseudo-file --- class: extra-details, deep-dive ## Namespaces can be used independently - As mentioned in the previous slides: *a new process can re-use none / all / some of the namespaces of its parent* - It's possible to create e.g.: - mount namespaces to have "private" `/tmp` for each user / app - network namespaces to isolate apps or give them a special network access - It's possible to use namespaces without cgroups (and totally outside of container contexts) --- ## UTS namespace - gethostname / sethostname - Allows setting a custom hostname for a container - That's (mostly) it! - Also allows setting the NIS domain (if you don't know what a NIS domain is, you don't have to worry about it!) - If you're wondering: UTS = UNIX time sharing - This namespace was named like this because of the `struct utsname`,
which is commonly used to obtain the machine's hostname, architecture, etc. (the more you know!) --- class: extra-details, deep-dive ## Creating our first namespace Let's use `unshare` to create a new process that will have its own UTS namespace: ```bash $ sudo unshare --uts ``` - We have to use `sudo` for most `unshare` operations - We indicate that we want a new uts namespace, and nothing else - If we don't specify a program to run, a `$SHELL` is started --- class: extra-details, deep-dive ## Demonstrating our uts namespace In our new "container", check the hostname, change it, and check it: ```bash # hostname nodeX # hostname tupperware # hostname tupperware ``` In another shell, check that the machine's hostname hasn't changed: ```bash $ hostname nodeX ``` Exit the "container" with `exit` or `Ctrl-D`. --- ## Net namespace overview - Each network namespace has its own private network stack - The network stack includes: - network interfaces (including `lo`) - routing table**s** (as in `ip rule` etc.) - iptables chains and rules - sockets (as seen by `ss`, `netstat`) - You can move a network interface from a network namespace to another: ```bash ip link set dev eth0 netns PID ``` --- ## Net namespace typical use - Each container is given its own network namespace - For each network namespace (i.e. each container), a `veth` pair is created (two `veth` interfaces act as if they were connected with a cross-over cable) - One `veth` is moved to the container network namespace (and renamed `eth0`) - The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge) --- class: extra-details ## Creating a network namespace Start a new process with its own network namespace: ```bash $ sudo unshare --net ``` See that this new network namespace is unconfigured: ```bash # ping 1.1 connect: Network is unreachable # ifconfig # ip link ls 1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 ``` --- class: extra-details ## Creating the `veth` interfaces In another shell (on the host), create a `veth` pair: ```bash $ sudo ip link add name in_host type veth peer name in_netns ``` Configure the host side (`in_host`): ```bash $ sudo ip link set in_host up $ sudo ip addr add 172.22.0.1/24 dev in_host ``` --- class: extra-details ## Moving the `veth` interface *In the process created by `unshare`,* check the PID of our "network container": ```bash # echo $$ 533 ``` *On the host*, move the other side (`in_netns`) to the network namespace: ```bash $ sudo ip link set in_netns netns 533 ``` (Make sure to update "533" with the actual PID obtained above!) --- class: extra-details ## Basic network configuration Let's set up `lo` (the loopback interface): ```bash # ip link set lo up ``` Activate the `veth` interface and rename it to `eth0`: ```bash # ip link set in_netns name eth0 up ``` --- class: extra-details ## Allocating IP address and default route *In the process created by `unshare`,* configure the interface: ```bash # ip addr add 172.22.0.2/24 dev eth0 # ip route add default via 172.22.0.1 ``` (Make sure to update the IP addresses if necessary.) Check that we can ping the host: ```bash # ping 172.22.0.1 ``` --- class: extra-details ## Reaching the outside world This requires to: - enable forwarding on the host - add a masquerading (SNAT) rule for traffic coming from the namespace If Docker is running on the host, we can also add the `in_host` interface to the Docker bridge, and configure the `in_netns` interface with an IP address belonging to the subnet of the Docker bridge! --- class: extra-details ## Cleaning up network namespaces - Terminate the process created by `unshare` (with `exit` or `Ctrl-D`). - Since this was the only process in the network namespace, it is destroyed. - All the interfaces in the network namespace are destroyed. - When a `veth` interface is destroyed, it also destroys the other half of the pair. - So we don't have anything else to do to clean up! --- ## Docker options leveraging network namespaces - `--net none` gives an empty network namespace to a container (effectively isolating it completely from the network) - `--net host` means "do not containerize the network" (no network namespace is created; the container uses the host network stack) - `--net container` means "reuse the network namespace of another container" (as a result, both containers share the same interfaces, routes, etc.) --- ## Mnt namespace - Processes can have their own root fs (à la chroot) - Processes can also have "private" mounts; this allows: - isolating `/tmp` (per user, per service...) - masking `/proc`, `/sys` (for processes that don't need them) - mounting remote filesystems or sensitive data,
but make it visible only for allowed processes - Mounts can be totally private, or shared - For a long time, there was no easy way to "move" a mount to another namespace - It's now possible; see [justincormack/addmount](https://github.com/justincormack/addmount) for a simple example --- class: extra-details, deep-dive ## Setting up a private `/tmp` Create a new mount namespace: ```bash $ sudo unshare --mount ``` In that new namespace, mount a brand new `/tmp`: ```bash # mount -t tmpfs none /tmp ``` Check the content of `/tmp` in the new namespace, and compare to the host. The mount is automatically cleaned up when you exit the process. --- ## PID namespace - Processes within a PID namespace only "see" processes in the same PID namespace - Each PID namespace has its own numbering (starting at 1) - When PID 1 goes away, the whole namespace is killed (when PID 1 goes away on a normal UNIX system, the kernel panics!) - Those namespaces can be nested - A process ends up having multiple PIDs (one per namespace in which it is nested) --- class: extra-details, deep-dive ## PID namespace in action Create a new PID namespace: ```bash $ sudo unshare --pid --fork ``` (We need the `--fork` flag because the PID namespace is special.) Check the process tree in the new namespace: ```bash # ps faux ``` -- class: extra-details, deep-dive 🤔 Why do we see all the processes?!? --- class: extra-details, deep-dive ## PID namespaces and `/proc` - Tools like `ps` rely on the `/proc` pseudo-filesystem - Our new namespace still has access to the original `/proc` - Therefore, it still sees host processes - But it cannot affect them (try to `kill` a process: you will get `No such process`) --- class: extra-details, deep-dive ## PID namespaces, take 2 - This can be solved by mounting `/proc` in the namespace - The `unshare` utility provides a convenience flag, `--mount-proc` - This flag will mount `/proc` in the namespace - It will also unshare the mount namespace, so that this mount is local Try it: ```bash $ sudo unshare --pid --fork --mount-proc # ps faux ``` --- class: extra-details ## OK, really, why do we need `--fork`? *It is not necessary to remember all these details.
This is just an illustration of the complexity of namespaces!* The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary.
A process calling `unshare` to create new namespaces is moved to the new namespaces...
... Except for the PID namespace.
(Because this would change the current PID of the process from X to 1.) The processes created by the new binary are placed into the new PID namespace.
The first one will be PID 1.
If PID 1 exits, it is not possible to create additional processes in the namespace.
(Attempting to do so will result in `ENOMEM`.) Without the `--fork` flag, the first command that we execute will be PID 1 ...
... And once it exits, we cannot create more processes in the namespace! Check `man 2 unshare` and `man pid_namespaces` if you want more details. --- ## IPC namespace -- - Does anybody know about IPC? -- - Does anybody *care* about IPC? -- - Allows a process (or group of processes) to have own: - IPC semaphores - IPC message queues - IPC shared memory ... without risk of conflict with other instances. - Older versions of PostgreSQL cared about this. *No demo for that one.* --- ## User namespace - Allows mapping UID/GID; e.g.: - UID 0→1999 in container C1 is mapped to UID 10000→11999 on host - UID 0→1999 in container C2 is mapped to UID 12000→13999 on host - etc. - UID 0 in the container can still perform privileged operations in the container (for instance: setting up network interfaces) - But outside of the container, it is a non-privileged user - It also means that the UID in containers becomes unimportant (just use UID 0 in the container, since it gets squashed to a non-privileged user outside) - Ultimately enables better privilege separation in container engines --- class: extra-details, deep-dive ## User namespace challenges - UID needs to be mapped when passed between processes or kernel subsystems - Filesystem permissions and file ownership are more complicated .small[(e.g. when the same root filesystem is shared by multiple containers running with different UIDs)] - With the Docker Engine: - some feature combinations are not allowed
(e.g. user namespace + host network namespace sharing) - user namespaces need to be enabled/disabled globally
(when the daemon is started) - container images are stored separately
(so the first time you toggle user namespaces, you need to re-pull images) *No demo for that one.* --- ## Time namespace - Virtualize time - Expose a slower/faster clock to some processes (for e.g. simulation purposes) - Expose a clock offset to some processes (simulation, suspend/restore...) --- ## Cgroup namespace - Virtualize access to `/proc//cgroup` - Lets containerized processes view their relative cgroup tree --- # Security features - Namespaces and cgroups are not enough to ensure strong security - We need extra mechanisms: capabilities, seccomp, LSMs - These mechanisms were already used before containers to harden security - They can be used together with containers - Good container engines will automatically leverage these features. (so that you don't have to worry about it) --- ## Capabilities - In traditional UNIX, many operations are possible if and only if UID=0 (root) - Some of these operations are very powerful: - changing file ownership, accessing all files ... - Some of these operations deal with system configuration, but can be abused: - setting up network interfaces, mounting filesystems ... - Some of these operations are not very dangerous but are needed by servers: - binding to a port below 1024. - Capabilities are per-process flags to allow these operations individually --- ## Some capabilities - `CAP_CHOWN`: arbitrarily change file ownership and permissions - `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions - `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc. - `CAP_NET_BIND_SERVICE`: bind a port below 1024 See `man capabilities` for the full list and details --- ## Using capabilities - Container engines will typically drop all "dangerous" capabilities - You can then re-enable capabilities on a per-container basis, as needed - With the Docker engine: `docker run --cap-add ...` - From the shell: `capsh --drop=cap_net_admin --` `capsh --drop=all --` --- ## File capabilities - It is also possible to give capabilities to executable files - This is comparable to the SUID bit, but with finer grain (e.g., `setcap cap_net_raw+ep /bin/ping`) - There are differences between *permitted* and *inheritable* capabilities... 🤔 --- class: extra-details ## Capability sets - Permitted set (=what a process could use, provided the file has the cap) - Effective set (=what a process can actually use) - Inheritable set (=capabilities preserved across exexcve calls) - Bounding set (=system-wide limit over what can be acquired through execve / capset) - Ambient set (=capabilities retained across execve for non-privileged users) - Files can have *permitted*, *effective*, *inheritable* capability sets --- ## More about capabilities - Capabilities manpage: https://man7.org/linux/man-pages/man7/capabilities.7.html - Subtleties about `capsh`: https://sites.google.com/site/fullycapable/why-didnt-that-work --- ## Seccomp - Seccomp is secure computing. - Achieve high level of security by restricting drastically available syscalls. - Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()`. - The seccomp-bpf extension allows specifying custom filters with BPF rules. - This allows filtering by syscall, and by parameter. - BPF code can perform arbitrarily complex checks, quickly, and safely. - Container engines take care of this so you don't have to. --- ## Linux Security Modules - The most popular ones are SELinux and AppArmor. - Red Hat distros generally use SELinux. - Debian distros (in particular, Ubuntu) generally use AppArmor. - LSMs add a layer of access control to all process operations. - Container engines take care of this so you don't have to. ??? :EN:Containers internals :EN:- Control groups (cgroups) :EN:- Linux kernel namespaces :FR:Fonctionnement interne des conteneurs :FR:- Les "control groups" (cgroups) :FR:- Les namespaces du noyau Linux