github/container.training

Fork 0

mirror of https://github.com/jpetazzo/container.training.git synced 2026-05-06 08:56:35 +00:00

Files

Jérôme Petazzoni b291243472 ➕️ Add container from scratch exercise; update cgroup to v2

2025-09-04 15:01:11 +02:00

28 KiB

Raw Blame History

Deep dive into container internals

In this chapter, we will explain some of the fundamental building blocks of containers.

This will give you a solid foundation so you can:

understand "what's going on" in complex situations,
anticipate the behavior of containers (performance, security...) in new scenarios,
implement your own container engine.

The last item should be done for educational purposes only!

There is no container code in the Linux kernel

If we search "container" in the Linux kernel code, we find:
- generic code to manipulate data structures (like linked lists, etc.),
- unrelated concepts like "ACPI containers",
- nothing relevant to "our" containers!
Containers are composed using multiple independent features.
On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."
Security also requires features like capabilities, seccomp, LSMs...

Control groups

Control groups provide resource metering and limiting.
This covers:
- "classic" compute resources like memory, CPU, I/O
- system resources like number of processes (PID)
- "exotic" resources like GPU VRAM, huge pages, RDMA
- other things like device node access (/dev) and perf events

Crowd control

Control groups also allow to group processes for special operations:
- freeze (conceptually similar to a "mass-SIGSTOP/SIGCONT")
- kill (safe mass-SIGKILL)

Generalities

Cgroups form a hierarchy (a tree)
We can create nodes in that hierarchy
We can associate limits to a node
We can move a process (or multiple processes) to a leaf
The process (or processes) will then respect these limits
We can check the current usage of each node
In other words: limits are optional (if we only want accounting)
When a process is created, it is placed in its parent's groups
The main interface is a pseudo-filesystem (typically mounted on /sys/fs/cgroup)

Example

$ tree /sys/fs/cgroup/  -d
/sys/fs/cgroup/
├── init.scope
├── machine.slice
├── system.slice
│   ├── avahi-daemon.service
│   ├── ...
│   ├── docker-de3ee38bc8d90b7da218523004cae504a2fa821224fd49f53521d862db583fef.scope
│   ├── docker-e9e55ba69f0a4639793464972a8645cdb23ae9f60567384479a175e3226776b4.scope
│   ├── docker.service
│   ├── docker.socket
│   ├── ...
│   └── wpa_supplicant.service
└── user.slice
    └── user-1000.slice
        ├── session-1.scope
        └── user@1000.service
            ├── app.slice
            │   └── ...
            ├── init.scope
            └── session.slice
                └── ...

]

Cgroups v1 vs v2

Cgroups v1 were the original implementation

(back when Docker was created)
Cgroups v2 are a huge refactor

(development started in Linux 3.10, released in 4.5.)
Cgroups v2 have a number of differences:
- single hierarchy (instead of one tree per controller)
- processes can only be on leaf nodes (not inner nodes)
- and of course many improvements / refactorings
Cgroups v2 should be the default on all modern distros!

Example of cgroup v1 hierarchy

The numbers are PIDs.

The names are the names of our nodes (arbitrarily chosen).

cpu                      memory
├── batch                ├── stateless
│   ├── cryptoscam       │   ├── 25
│   │   └── 52           │   ├── 26
│   └── ffmpeg           │   ├── 27
│       ├── 109          │   ├── 52
│       └── 88           │   ├── 109
└── realtime             │   └── 88
    ├── nginx            └── databases
    │   ├── 25               ├── 1008
    │   ├── 26               └── 524
    │   └── 27
    ├── postgres
    │   └── 524
    └── redis
        └── 1008

]

CPU cgroup

Keeps track of CPU time used by a group of processes

(this is easier and more accurate than getrusage and /proc)
Allows setting relative weights used by the scheduler
Allows setting maximum time usage per time period

(e.g. "50ms every 100ms", which would cap the group to 50% of one CPU core)
Allows setting reservations and caps ("utilization clamping")

(particularly relevant for realtime processes)

Checking current CPU limits

Getting the cgroup for the current user session:
```
cat /proc/$$/cgroup
```
(it should start with /user.slice/...)
Checking the current CPU limit:
```
cat /sys/fs/cgroup/user.slice/.../cpu.max
```
(it should look like max 100000)
max means unlimited; 100000 means "over a period of 100000 microseconds"

(unless specified, all cgroup time durations are in microseconds)

Setting a CPU limit

Run top in a terminal to view CPU usage
In a separate terminal, burn CPU cycles with e.g.:
```
while : ; do : ; done
```

Set a 50% CPU limit for that user or session:

echo 50000 > /sys/fs/cgroup/user.slice/.../cpu.max

Notice that CPU usage goes down

(probably to less than 50% since this is a limit for the whole user/session!)

Removing the CPU limit

Remember to remove the limit when you're done:

echo max > /sys/fs/cgroup/user.slice/.../cpu.max

Cpuset cgroup

Pin groups to specific CPU(s)
Features:
- limit apps to specific CPUs (cpuset.cpus)
- reserve CPUs for exclusive use (cpuset.cpus.exclusive)
- assign apps to specific NUMA memory nodes (cpuset.mems)
Use-cases:
- dedicate CPUs to avoid performance loss due to cache flushes
- improve memory performance in NUMA systems

Cpuset concepts

cpuset.cpus / cpuset.mems

express what we allow the cgroup to use (can be empty to allow everything)
cpuset.cpus.effective / cpusets.mems.effective

express what the cgroup can actually use after accounting for other restrictions
cpuset.cpus.exclusive / cpuset.cpus.partition

used to create "partitions" = sets of CPU(s) exclusively reserved for a cgroup

Memory cgroup: accounting

Keeps track of pages used by each group:
- file (read/write/mmap from block devices)
- anonymous (stack, heap, anonymous mmap)
- active (recently accessed)
- inactive (candidate for eviction)
- ...many other categories!
Each page is "charged" to a single group

(this can result in non-deterministic "charges" for shared pages, e.g. mapped files)
To view all the counters kept by this cgroup:
```
$ cat /sys/fs/cgroup/memory.stat
```

Memory cgroup: limits and reservations

Cgroups v1 allowed to set soft and hard limits

(soft limits influenced reclaim but it wasn't straightforward to use)
Cgroups v2 are way more sophisticated:
- hard limits (.max)
- thresholds triggering more evictions (.high)
- thresholds triggering less evictions (.low)
- reservations (.min)
Also limits for swap and zswap

Hard limits

A cgroup can never exceed its hard limits
When a cgroup tries to use more than the hard limit:
- the kernel tries to reclaim memory (buffers, mapped files...)
- when there is nothing to reclaim, the OOM killer is invoked
There is a memory.oom.group flag to alter OOM behavior:
- 0 (default) = kill processes one by one
- 1 = consider the cgroup as a unit; OOM will kill it entirely

Also...

A .peak value is also exposed for each tracked amount

(memory, swap, zswap)
Write an amount to memory.reclaim to trigger reclaim

(=ask the kernel to recover memory from the cgroup)
Check memory stats per NUMA nopde (memory.numa_stat)
And more!

Block I/O cgroup

Keep track of I/Os for each group:
- per block device
- read, write, and discard
- in bytes and in operations
Set hard limits for each counter
Set relative weights and latency targets

`io.max`

Enforce hard limits

(set max number of operations, of bytes read/written...)
Each limit is per-device
Doesn't offer performance guarantees

(once a device is saturated, performance will degrade for everyone)

`io.cost.qos`

Try to offer latency guarantees
Define per-device thresholds to throttle operations

"if the 95% percentile latency of read operations on this device is above 100ms...

...throttle operations on this device (queue them)"
Can also define io.weight for relative priorities between cgroups
Check this document for some details and hints

Network I/O

Cgroups v1 had net_cls and net_prio controllers

These have been deprecated in cgroups v2:

 *There is no direct equivalent of the net_cls and net_prio
 controllers from cgroups version 1.  Instead, support has been
 added to iptables(8) to allow eBPF filters that hook on cgroup v2
 pathnames to make decisions about network traffic on a per-cgroup
 basis.*

Pid

Limit (and count) number of processes in a cgroup
Protects against e.g. fork bombs

Devices

We need to limit access to device nodes
Containers should not be able to open e.g. disks and partitions directly

(/dev/sda*, /dev/nvme*...)
However, some devices are expected to be available at all times:

/dev/tty, /dev/zero, /dev/null, /dev/random...

Cgroups v1

There used to be a special "devices" control group
It made it easy to grand read/write/mknod permissions

(individually for each device and each container)
Access could be granted/revoked/viewed through a pseudo-file:
```
echo 'c 1:3 mr' > /sys/fs/cgroup/.../devices.allow
```
This file doesn't exist anymore in cgroups v2!

Cgroups v2

Device access is controlled with eBPF programs

(there is a special program type, cgroup_device, for that purpose)
This requires writing and compiling eBPF programs (😰)
Viewing permissions requires disassembling eBPF programs (😱)

Viewing eBPF programs

Install bpf tools (package name bpftool or bpf)
View all eBPF programs attached to cgroups:
```
sudo bpftool cgroup tree
```

View eBPF programs attached to a Docker container:

sudo bpftool cgroup list /sys/fs/cgroup/system.slice/docker-<CONTAINER_ID>.scope

Disassemble an eBPF program:
```
sudo bpftool prog dump xlated id <ID>
```
Bon chance 😬

Some interesting nodes

/dev/net/tun (network interface manipulation)
/dev/fuse (filesystems in user space)
/dev/kvm (run VMs in containers)
/dev/dri (GPU)
/dev/ttyUSB*, /dev/ttyACM* (serial devices)
/dev/snd/* (sound cards)

And the exotic ones...

rdma: remote memory access, infiniband
dmem: device memory (VRAM), relatively new

(kernel 6.14, January 2025; only Intel and AMD GPU for now)
hugetlb: huge pages
perf_event: performance profiling
misc: generic cgroup for other discrete resources

(extension point to plug even more exotic resources)

Namespaces

Provide processes with their own view of the system
Namespaces limit what you can see (and therefore, what you can use)
These namespaces are available in modern kernels:
- pid
- net
- mnt
- uts
- ipc
- user
- time
- cgroup
(we are going to detail them individually)
Each process belongs to one namespace of each type

Namespaces are always active

Namespaces exist even when you don't use containers
This is a bit similar to the UID field in UNIX processes:
- all processes have the UID field, even if no user exists on the system
- the field always has a value / the value is always defined
  (i.e. any process running on the system has some UID)
- the value of the UID field is used when checking permissions
  (the UID field determines which resources the process can access)
You can replace "UID field" with "namespace" above and it still works!
In other words: even when you don't use containers,
there is one namespace of each type, containing all the processes on the system

Manipulating namespaces

Namespaces are created with two methods:
- the clone() system call (used when creating new threads and processes)
- the unshare() system call
The Linux tool unshare allows doing that from a shell
A new process can re-use none / all / some of the namespaces of its parent
It is possible to "enter" a namespace with the setns() system call
The Linux tool nsenter allows doing that from a shell

Namespaces lifecycle

When the last process of a namespace exits, the namespace is destroyed
All the associated resources are then removed
Namespaces are materialized by pseudo-files in /proc/<pid>/ns.
```
ls -l /proc/self/ns
```
It is possible to compare namespaces by checking these files

(this helps to answer the question, "are these two processes in the same namespace?")
It is possible to preserve a namespace by bind-mounting its pseudo-file

Namespaces can be used independently

As mentioned in the previous slides:

a new process can re-use none / all / some of the namespaces of its parent
It's possible to create e.g.:
- mount namespaces to have "private" /tmp for each user / app
- network namespaces to isolate apps or give them a special network access
It's possible to use namespaces without cgroups

(and totally outside of container contexts)

UTS namespace

gethostname / sethostname
Allows setting a custom hostname for a container
That's (mostly) it!
Also allows setting the NIS domain

(if you don't know what a NIS domain is, you don't have to worry about it!)
If you're wondering: UTS = UNIX time sharing
This namespace was named like this because of the struct utsname,
which is commonly used to obtain the machine's hostname, architecture, etc.

(the more you know!)

Creating our first namespace

Let's use unshare to create a new process that will have its own UTS namespace:

$ sudo unshare --uts

We have to use sudo for most unshare operations
We indicate that we want a new uts namespace, and nothing else
If we don't specify a program to run, a $SHELL is started

Demonstrating our uts namespace

In our new "container", check the hostname, change it, and check it:

 # hostname
 nodeX
 # hostname tupperware
 # hostname
 tupperware

In another shell, check that the machine's hostname hasn't changed:

$ hostname
nodeX

Exit the "container" with exit or Ctrl-D.

Net namespace overview

Each network namespace has its own private network stack
The network stack includes:
- network interfaces (including lo)
- routing tables (as in ip rule etc.)
- iptables chains and rules
- sockets (as seen by ss, netstat)
You can move a network interface from a network namespace to another:
```
ip link set dev eth0 netns PID
```

Net namespace typical use

Each container is given its own network namespace
For each network namespace (i.e. each container), a veth pair is created

(two veth interfaces act as if they were connected with a cross-over cable)
One veth is moved to the container network namespace (and renamed eth0)
The other veth is moved to a bridge on the host (e.g. the docker0 bridge)

Creating a network namespace

Start a new process with its own network namespace:

$ sudo unshare --net

See that this new network namespace is unconfigured:

 # ping 1.1
 connect: Network is unreachable
 # ifconfig
 # ip link ls
 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Creating the `veth` interfaces

In another shell (on the host), create a veth pair:

$ sudo ip link add name in_host type veth peer name in_netns

Configure the host side (in_host):

$ sudo ip link set in_host up
$ sudo ip addr add 172.22.0.1/24 dev in_host

Moving the `veth` interface

In the process created by unshare, check the PID of our "network container":

 # echo $$
 533

On the host, move the other side (in_netns) to the network namespace:

$ sudo ip link set in_netns netns 533

(Make sure to update "533" with the actual PID obtained above!)

Basic network configuration

Let's set up lo (the loopback interface):

 # ip link set lo up

Activate the veth interface and rename it to eth0:

 # ip link set in_netns name eth0 up

Allocating IP address and default route

In the process created by unshare, configure the interface:

 # ip addr add 172.22.0.2/24 dev eth0
 # ip route add default via 172.22.0.1

(Make sure to update the IP addresses if necessary.)

Check that we can ping the host:

 # ping 172.22.0.1

Reaching the outside world

This requires to:

enable forwarding on the host
add a masquerading (SNAT) rule for traffic coming from the namespace

If Docker is running on the host, we can also add the in_host interface to the Docker bridge, and configure the in_netns interface with an IP address belonging to the subnet of the Docker bridge!

Cleaning up network namespaces

Terminate the process created by unshare (with exit or Ctrl-D).
Since this was the only process in the network namespace, it is destroyed.
All the interfaces in the network namespace are destroyed.
When a veth interface is destroyed, it also destroys the other half of the pair.
So we don't have anything else to do to clean up!

Docker options leveraging network namespaces

--net none gives an empty network namespace to a container

(effectively isolating it completely from the network)
--net host means "do not containerize the network"

(no network namespace is created; the container uses the host network stack)
--net container means "reuse the network namespace of another container"

(as a result, both containers share the same interfaces, routes, etc.)

Mnt namespace

Processes can have their own root fs (à la chroot)
Processes can also have "private" mounts; this allows:
- isolating /tmp (per user, per service...)
- masking /proc, /sys (for processes that don't need them)
- mounting remote filesystems or sensitive data,
  but make it visible only for allowed processes
Mounts can be totally private, or shared
For a long time, there was no easy way to "move" a mount to another namespace
It's now possible; see justincormack/addmount for a simple example

Setting up a private `/tmp`

Create a new mount namespace:

$ sudo unshare --mount

In that new namespace, mount a brand new /tmp:

 # mount -t tmpfs none /tmp

Check the content of /tmp in the new namespace, and compare to the host.

The mount is automatically cleaned up when you exit the process.

PID namespace

Processes within a PID namespace only "see" processes in the same PID namespace
Each PID namespace has its own numbering (starting at 1)
When PID 1 goes away, the whole namespace is killed

(when PID 1 goes away on a normal UNIX system, the kernel panics!)
Those namespaces can be nested
A process ends up having multiple PIDs (one per namespace in which it is nested)

PID namespace in action

Create a new PID namespace:

$ sudo unshare --pid --fork

(We need the --fork flag because the PID namespace is special.)

Check the process tree in the new namespace:

 # ps faux

🤔 Why do we see all the processes?!?

PID namespaces and `/proc`

Tools like ps rely on the /proc pseudo-filesystem
Our new namespace still has access to the original /proc
Therefore, it still sees host processes
But it cannot affect them

(try to kill a process: you will get No such process)

PID namespaces, take 2

This can be solved by mounting /proc in the namespace
The unshare utility provides a convenience flag, --mount-proc
This flag will mount /proc in the namespace
It will also unshare the mount namespace, so that this mount is local

Try it:

 $ sudo unshare --pid --fork --mount-proc
 # ps faux

OK, really, why do we need `--fork`?

It is not necessary to remember all these details.
This is just an illustration of the complexity of namespaces!

The unshare tool calls the unshare syscall, then execs the new binary.
A process calling unshare to create new namespaces is moved to the new namespaces...
... Except for the PID namespace.
(Because this would change the current PID of the process from X to 1.)

The processes created by the new binary are placed into the new PID namespace.
The first one will be PID 1.
If PID 1 exits, it is not possible to create additional processes in the namespace.
(Attempting to do so will result in ENOMEM.)

Without the --fork flag, the first command that we execute will be PID 1 ...
... And once it exits, we cannot create more processes in the namespace!

Check man 2 unshare and man pid_namespaces if you want more details.

IPC namespace

Does anybody know about IPC?

Does anybody care about IPC?

Allows a process (or group of processes) to have own:
- IPC semaphores
- IPC message queues
- IPC shared memory
... without risk of conflict with other instances.
Older versions of PostgreSQL cared about this.

No demo for that one.

User namespace

Allows mapping UID/GID; e.g.:
- UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
- UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
- etc.
UID 0 in the container can still perform privileged operations in the container

(for instance: setting up network interfaces)
But outside of the container, it is a non-privileged user
It also means that the UID in containers becomes unimportant

(just use UID 0 in the container, since it gets squashed to a non-privileged user outside)
Ultimately enables better privilege separation in container engines

User namespace challenges

UID needs to be mapped when passed between processes or kernel subsystems
Filesystem permissions and file ownership are more complicated

.small[(e.g. when the same root filesystem is shared by multiple containers running with different UIDs)]
With the Docker Engine:
- some feature combinations are not allowed
  (e.g. user namespace + host network namespace sharing)
- user namespaces need to be enabled/disabled globally
  (when the daemon is started)
- container images are stored separately
  (so the first time you toggle user namespaces, you need to re-pull images)