28 KiB
Deep dive into container internals
In this chapter, we will explain some of the fundamental building blocks of containers.
This will give you a solid foundation so you can:
-
understand "what's going on" in complex situations,
-
anticipate the behavior of containers (performance, security...) in new scenarios,
-
implement your own container engine.
The last item should be done for educational purposes only!
There is no container code in the Linux kernel
-
If we search "container" in the Linux kernel code, we find:
-
generic code to manipulate data structures (like linked lists, etc.),
-
unrelated concepts like "ACPI containers",
-
nothing relevant to "our" containers!
-
-
Containers are composed using multiple independent features.
-
On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."
-
Security also requires features like capabilities, seccomp, LSMs...
Control groups
-
Control groups provide resource metering and limiting.
-
This covers:
-
"classic" compute resources like memory, CPU, I/O
-
system resources like number of processes (PID)
-
"exotic" resources like GPU VRAM, huge pages, RDMA
-
other things like device node access (
/dev) and perf events
-
Crowd control
-
Control groups also allow to group processes for special operations:
-
freeze (conceptually similar to a "mass-SIGSTOP/SIGCONT")
-
kill (safe mass-SIGKILL)
-
Generalities
-
Cgroups form a hierarchy (a tree)
-
We can create nodes in that hierarchy
-
We can associate limits to a node
-
We can move a process (or multiple processes) to a leaf
-
The process (or processes) will then respect these limits
-
We can check the current usage of each node
-
In other words: limits are optional (if we only want accounting)
-
When a process is created, it is placed in its parent's groups
-
The main interface is a pseudo-filesystem (typically mounted on
/sys/fs/cgroup)
Example
.small[
$ tree /sys/fs/cgroup/ -d
/sys/fs/cgroup/
├── init.scope
├── machine.slice
├── system.slice
│ ├── avahi-daemon.service
│ ├── ...
│ ├── docker-de3ee38bc8d90b7da218523004cae504a2fa821224fd49f53521d862db583fef.scope
│ ├── docker-e9e55ba69f0a4639793464972a8645cdb23ae9f60567384479a175e3226776b4.scope
│ ├── docker.service
│ ├── docker.socket
│ ├── ...
│ └── wpa_supplicant.service
└── user.slice
└── user-1000.slice
├── session-1.scope
└── user@1000.service
├── app.slice
│ └── ...
├── init.scope
└── session.slice
└── ...
]
class: extra-details, deep-dive
Cgroups v1 vs v2
-
Cgroups v1 were the original implementation
(back when Docker was created)
-
Cgroups v2 are a huge refactor
(development started in Linux 3.10, released in 4.5.)
-
Cgroups v2 have a number of differences:
-
single hierarchy (instead of one tree per controller)
-
processes can only be on leaf nodes (not inner nodes)
-
and of course many improvements / refactorings
-
-
Cgroups v2 should be the default on all modern distros!
class: extra-details, deep-dive
Example of cgroup v1 hierarchy
The numbers are PIDs.
The names are the names of our nodes (arbitrarily chosen).
.small[
cpu memory
├── batch ├── stateless
│ ├── cryptoscam │ ├── 25
│ │ └── 52 │ ├── 26
│ └── ffmpeg │ ├── 27
│ ├── 109 │ ├── 52
│ └── 88 │ ├── 109
└── realtime │ └── 88
├── nginx └── databases
│ ├── 25 ├── 1008
│ ├── 26 └── 524
│ └── 27
├── postgres
│ └── 524
└── redis
└── 1008
]
CPU cgroup
-
Keeps track of CPU time used by a group of processes
(this is easier and more accurate than
getrusageand/proc) -
Allows setting relative weights used by the scheduler
-
Allows setting maximum time usage per time period
(e.g. "50ms every 100ms", which would cap the group to 50% of one CPU core)
-
Allows setting reservations and caps ("utilization clamping")
(particularly relevant for realtime processes)
Checking current CPU limits
-
Getting the cgroup for the current user session:
cat /proc/$$/cgroup(it should start with
/user.slice/...) -
Checking the current CPU limit:
cat /sys/fs/cgroup/user.slice/.../cpu.max(it should look like
max 100000) -
maxmeans unlimited;100000means "over a period of 100000 microseconds"(unless specified, all cgroup time durations are in microseconds)
Setting a CPU limit
-
Run
topin a terminal to view CPU usage -
In a separate terminal, burn CPU cycles with e.g.:
while : ; do : ; done -
Set a 50% CPU limit for that user or session:
echo 50000 > /sys/fs/cgroup/user.slice/.../cpu.max -
Notice that CPU usage goes down
(probably to less than 50% since this is a limit for the whole user/session!)
Removing the CPU limit
- Remember to remove the limit when you're done:
echo max > /sys/fs/cgroup/user.slice/.../cpu.max
Cpuset cgroup
-
Pin groups to specific CPU(s)
-
Features:
-
limit apps to specific CPUs (
cpuset.cpus) -
reserve CPUs for exclusive use (
cpuset.cpus.exclusive) -
assign apps to specific NUMA memory nodes (
cpuset.mems)
-
-
Use-cases:
-
dedicate CPUs to avoid performance loss due to cache flushes
-
improve memory performance in NUMA systems
-
Cpuset concepts
-
cpuset.cpus/cpuset.memsexpress what we allow the cgroup to use (can be empty to allow everything)
-
cpuset.cpus.effective/cpusets.mems.effectiveexpress what the cgroup can actually use after accounting for other restrictions
-
cpuset.cpus.exclusive/cpuset.cpus.partitionused to create "partitions" = sets of CPU(s) exclusively reserved for a cgroup
Memory cgroup: accounting
-
Keeps track of pages used by each group:
- file (read/write/mmap from block devices)
- anonymous (stack, heap, anonymous mmap)
- active (recently accessed)
- inactive (candidate for eviction)
- ...many other categories!
-
Each page is "charged" to a single group
(this can result in non-deterministic "charges" for shared pages, e.g. mapped files)
-
To view all the counters kept by this cgroup:
$ cat /sys/fs/cgroup/memory.stat
Memory cgroup: limits and reservations
-
Cgroups v1 allowed to set soft and hard limits
(soft limits influenced reclaim but it wasn't straightforward to use)
-
Cgroups v2 are way more sophisticated:
-
hard limits (
.max) -
thresholds triggering more evictions (
.high) -
thresholds triggering less evictions (
.low) -
reservations (
.min)
-
-
Also limits for swap and zswap
Hard limits
-
A cgroup can never exceed its hard limits
-
When a cgroup tries to use more than the hard limit:
-
the kernel tries to reclaim memory (buffers, mapped files...)
-
when there is nothing to reclaim, the OOM killer is invoked
-
-
There is a
memory.oom.groupflag to alter OOM behavior:-
0(default) = kill processes one by one -
1= consider the cgroup as a unit; OOM will kill it entirely
-
Also...
-
A
.peakvalue is also exposed for each tracked amount(memory, swap, zswap)
-
Write an amount to
memory.reclaimto trigger reclaim(=ask the kernel to recover memory from the cgroup)
-
Check memory stats per NUMA nopde (
memory.numa_stat) -
And more!
Block I/O cgroup
-
Keep track of I/Os for each group:
-
per block device
-
read, write, and discard
-
in bytes and in operations
-
-
Set hard limits for each counter
-
Set relative weights and latency targets
io.max
-
Enforce hard limits
(set max number of operations, of bytes read/written...)
-
Each limit is per-device
-
Doesn't offer performance guarantees
(once a device is saturated, performance will degrade for everyone)
io.cost.qos
-
Try to offer latency guarantees
-
Define per-device thresholds to throttle operations
"if the 95% percentile latency of read operations on this device is above 100ms...
...throttle operations on this device (queue them)"
-
Can also define
io.weightfor relative priorities between cgroups -
Check this document for some details and hints
Network I/O
-
Cgroups v1 had net_cls and net_prio controllers
-
These have been deprecated in cgroups v2:
*There is no direct equivalent of the net_cls and net_prio controllers from cgroups version 1. Instead, support has been added to iptables(8) to allow eBPF filters that hook on cgroup v2 pathnames to make decisions about network traffic on a per-cgroup basis.*
Pid
-
Limit (and count) number of processes in a cgroup
-
Protects against e.g. fork bombs
Devices
-
We need to limit access to device nodes
-
Containers should not be able to open e.g. disks and partitions directly
(/dev/sda*, /dev/nvme*...)
-
However, some devices are expected to be available at all times:
/dev/tty, /dev/zero, /dev/null, /dev/random...
Cgroups v1
-
There used to be a special "devices" control group
-
It made it easy to grand read/write/mknod permissions
(individually for each device and each container)
-
Access could be granted/revoked/viewed through a pseudo-file:
echo 'c 1:3 mr' > /sys/fs/cgroup/.../devices.allow -
This file doesn't exist anymore in cgroups v2!
Cgroups v2
-
Device access is controlled with eBPF programs
(there is a special program type,
cgroup_device, for that purpose) -
This requires writing and compiling eBPF programs (😰)
-
Viewing permissions requires disassembling eBPF programs (😱)
Viewing eBPF programs
-
Install bpf tools (package name
bpftoolorbpf) -
View all eBPF programs attached to cgroups:
sudo bpftool cgroup tree -
View eBPF programs attached to a Docker container:
sudo bpftool cgroup list /sys/fs/cgroup/system.slice/docker-<CONTAINER_ID>.scope -
Disassemble an eBPF program:
sudo bpftool prog dump xlated id <ID> -
Bon chance 😬
Some interesting nodes
-
/dev/net/tun(network interface manipulation) -
/dev/fuse(filesystems in user space) -
/dev/kvm(run VMs in containers) -
/dev/dri(GPU) -
/dev/ttyUSB*,/dev/ttyACM*(serial devices) -
/dev/snd/*(sound cards)
And the exotic ones...
-
rdma: remote memory access, infiniband -
dmem: device memory (VRAM), relatively new(kernel 6.14, January 2025; only Intel and AMD GPU for now)
-
hugetlb: huge pages -
perf_event: performance profiling -
misc: generic cgroup for other discrete resources(extension point to plug even more exotic resources)
Namespaces
-
Provide processes with their own view of the system
-
Namespaces limit what you can see (and therefore, what you can use)
-
These namespaces are available in modern kernels:
- pid
- net
- mnt
- uts
- ipc
- user
- time
- cgroup
(we are going to detail them individually)
-
Each process belongs to one namespace of each type
Namespaces are always active
-
Namespaces exist even when you don't use containers
-
This is a bit similar to the UID field in UNIX processes:
-
all processes have the UID field, even if no user exists on the system
-
the field always has a value / the value is always defined
(i.e. any process running on the system has some UID) -
the value of the UID field is used when checking permissions
(the UID field determines which resources the process can access)
-
-
You can replace "UID field" with "namespace" above and it still works!
-
In other words: even when you don't use containers,
there is one namespace of each type, containing all the processes on the system
class: extra-details, deep-dive
Manipulating namespaces
-
Namespaces are created with two methods:
-
the
clone()system call (used when creating new threads and processes) -
the
unshare()system call
-
-
The Linux tool
unshareallows doing that from a shell -
A new process can re-use none / all / some of the namespaces of its parent
-
It is possible to "enter" a namespace with the
setns()system call -
The Linux tool
nsenterallows doing that from a shell
class: extra-details, deep-dive
Namespaces lifecycle
-
When the last process of a namespace exits, the namespace is destroyed
-
All the associated resources are then removed
-
Namespaces are materialized by pseudo-files in
/proc/<pid>/ns.ls -l /proc/self/ns -
It is possible to compare namespaces by checking these files
(this helps to answer the question, "are these two processes in the same namespace?")
-
It is possible to preserve a namespace by bind-mounting its pseudo-file
class: extra-details, deep-dive
Namespaces can be used independently
-
As mentioned in the previous slides:
a new process can re-use none / all / some of the namespaces of its parent
-
It's possible to create e.g.:
-
mount namespaces to have "private"
/tmpfor each user / app -
network namespaces to isolate apps or give them a special network access
-
-
It's possible to use namespaces without cgroups
(and totally outside of container contexts)
UTS namespace
-
gethostname / sethostname
-
Allows setting a custom hostname for a container
-
That's (mostly) it!
-
Also allows setting the NIS domain
(if you don't know what a NIS domain is, you don't have to worry about it!)
-
If you're wondering: UTS = UNIX time sharing
-
This namespace was named like this because of the
struct utsname,
which is commonly used to obtain the machine's hostname, architecture, etc.(the more you know!)
class: extra-details, deep-dive
Creating our first namespace
Let's use unshare to create a new process that will have its own UTS namespace:
$ sudo unshare --uts
-
We have to use
sudofor mostunshareoperations -
We indicate that we want a new uts namespace, and nothing else
-
If we don't specify a program to run, a
$SHELLis started
class: extra-details, deep-dive
Demonstrating our uts namespace
In our new "container", check the hostname, change it, and check it:
# hostname
nodeX
# hostname tupperware
# hostname
tupperware
In another shell, check that the machine's hostname hasn't changed:
$ hostname
nodeX
Exit the "container" with exit or Ctrl-D.
Net namespace overview
-
Each network namespace has its own private network stack
-
The network stack includes:
-
network interfaces (including
lo) -
routing tables (as in
ip ruleetc.) -
iptables chains and rules
-
sockets (as seen by
ss,netstat)
-
-
You can move a network interface from a network namespace to another:
ip link set dev eth0 netns PID
Net namespace typical use
-
Each container is given its own network namespace
-
For each network namespace (i.e. each container), a
vethpair is created(two
vethinterfaces act as if they were connected with a cross-over cable) -
One
vethis moved to the container network namespace (and renamedeth0) -
The other
vethis moved to a bridge on the host (e.g. thedocker0bridge)
class: extra-details
Creating a network namespace
Start a new process with its own network namespace:
$ sudo unshare --net
See that this new network namespace is unconfigured:
# ping 1.1
connect: Network is unreachable
# ifconfig
# ip link ls
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
class: extra-details
Creating the veth interfaces
In another shell (on the host), create a veth pair:
$ sudo ip link add name in_host type veth peer name in_netns
Configure the host side (in_host):
$ sudo ip link set in_host up
$ sudo ip addr add 172.22.0.1/24 dev in_host
class: extra-details
Moving the veth interface
In the process created by unshare, check the PID of our "network container":
# echo $$
533
On the host, move the other side (in_netns) to the network namespace:
$ sudo ip link set in_netns netns 533
(Make sure to update "533" with the actual PID obtained above!)
class: extra-details
Basic network configuration
Let's set up lo (the loopback interface):
# ip link set lo up
Activate the veth interface and rename it to eth0:
# ip link set in_netns name eth0 up
class: extra-details
Allocating IP address and default route
In the process created by unshare, configure the interface:
# ip addr add 172.22.0.2/24 dev eth0
# ip route add default via 172.22.0.1
(Make sure to update the IP addresses if necessary.)
Check that we can ping the host:
# ping 172.22.0.1
class: extra-details
Reaching the outside world
This requires to:
-
enable forwarding on the host
-
add a masquerading (SNAT) rule for traffic coming from the namespace
If Docker is running on the host, we can also add the in_host interface
to the Docker bridge, and configure the in_netns interface with an
IP address belonging to the subnet of the Docker bridge!
class: extra-details
Cleaning up network namespaces
-
Terminate the process created by
unshare(withexitorCtrl-D). -
Since this was the only process in the network namespace, it is destroyed.
-
All the interfaces in the network namespace are destroyed.
-
When a
vethinterface is destroyed, it also destroys the other half of the pair. -
So we don't have anything else to do to clean up!
Docker options leveraging network namespaces
-
--net nonegives an empty network namespace to a container(effectively isolating it completely from the network)
-
--net hostmeans "do not containerize the network"(no network namespace is created; the container uses the host network stack)
-
--net containermeans "reuse the network namespace of another container"(as a result, both containers share the same interfaces, routes, etc.)
Mnt namespace
-
Processes can have their own root fs (à la chroot)
-
Processes can also have "private" mounts; this allows:
-
isolating
/tmp(per user, per service...) -
masking
/proc,/sys(for processes that don't need them) -
mounting remote filesystems or sensitive data,
but make it visible only for allowed processes
-
-
Mounts can be totally private, or shared
-
For a long time, there was no easy way to "move" a mount to another namespace
-
It's now possible; see justincormack/addmount for a simple example
class: extra-details, deep-dive
Setting up a private /tmp
Create a new mount namespace:
$ sudo unshare --mount
In that new namespace, mount a brand new /tmp:
# mount -t tmpfs none /tmp
Check the content of /tmp in the new namespace, and compare to the host.
The mount is automatically cleaned up when you exit the process.
PID namespace
-
Processes within a PID namespace only "see" processes in the same PID namespace
-
Each PID namespace has its own numbering (starting at 1)
-
When PID 1 goes away, the whole namespace is killed
(when PID 1 goes away on a normal UNIX system, the kernel panics!)
-
Those namespaces can be nested
-
A process ends up having multiple PIDs (one per namespace in which it is nested)
class: extra-details, deep-dive
PID namespace in action
Create a new PID namespace:
$ sudo unshare --pid --fork
(We need the --fork flag because the PID namespace is special.)
Check the process tree in the new namespace:
# ps faux
--
class: extra-details, deep-dive
🤔 Why do we see all the processes?!?
class: extra-details, deep-dive
PID namespaces and /proc
-
Tools like
psrely on the/procpseudo-filesystem -
Our new namespace still has access to the original
/proc -
Therefore, it still sees host processes
-
But it cannot affect them
(try to
killa process: you will getNo such process)
class: extra-details, deep-dive
PID namespaces, take 2
-
This can be solved by mounting
/procin the namespace -
The
unshareutility provides a convenience flag,--mount-proc -
This flag will mount
/procin the namespace -
It will also unshare the mount namespace, so that this mount is local
Try it:
$ sudo unshare --pid --fork --mount-proc
# ps faux
class: extra-details
OK, really, why do we need --fork?
It is not necessary to remember all these details.
This is just an illustration of the complexity of namespaces!
The unshare tool calls the unshare syscall, then execs the new binary.
A process calling unshare to create new namespaces is moved to the new namespaces...
... Except for the PID namespace.
(Because this would change the current PID of the process from X to 1.)
The processes created by the new binary are placed into the new PID namespace.
The first one will be PID 1.
If PID 1 exits, it is not possible to create additional processes in the namespace.
(Attempting to do so will result in ENOMEM.)
Without the --fork flag, the first command that we execute will be PID 1 ...
... And once it exits, we cannot create more processes in the namespace!
Check man 2 unshare and man pid_namespaces if you want more details.
IPC namespace
--
- Does anybody know about IPC?
--
- Does anybody care about IPC?
--
-
Allows a process (or group of processes) to have own:
- IPC semaphores
- IPC message queues
- IPC shared memory
... without risk of conflict with other instances.
-
Older versions of PostgreSQL cared about this.
No demo for that one.
User namespace
-
Allows mapping UID/GID; e.g.:
- UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
- UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
- etc.
-
UID 0 in the container can still perform privileged operations in the container
(for instance: setting up network interfaces)
-
But outside of the container, it is a non-privileged user
-
It also means that the UID in containers becomes unimportant
(just use UID 0 in the container, since it gets squashed to a non-privileged user outside)
-
Ultimately enables better privilege separation in container engines
class: extra-details, deep-dive
User namespace challenges
-
UID needs to be mapped when passed between processes or kernel subsystems
-
Filesystem permissions and file ownership are more complicated
.small[(e.g. when the same root filesystem is shared by multiple containers running with different UIDs)]
-
With the Docker Engine:
-
some feature combinations are not allowed
(e.g. user namespace + host network namespace sharing) -
user namespaces need to be enabled/disabled globally
(when the daemon is started) -
container images are stored separately
(so the first time you toggle user namespaces, you need to re-pull images)
-
No demo for that one.
Time namespace
-
Virtualize time
-
Expose a slower/faster clock to some processes
(for e.g. simulation purposes)
-
Expose a clock offset to some processes
(simulation, suspend/restore...)
Cgroup namespace
-
Virtualize access to
/proc/<PID>/cgroup -
Lets containerized processes view their relative cgroup tree
Security features
-
Namespaces and cgroups are not enough to ensure strong security
-
We need extra mechanisms: capabilities, seccomp, LSMs
-
These mechanisms were already used before containers to harden security
-
They can be used together with containers
-
Good container engines will automatically leverage these features.
(so that you don't have to worry about it)
Capabilities
-
In traditional UNIX, many operations are possible if and only if UID=0 (root)
-
Some of these operations are very powerful:
- changing file ownership, accessing all files ...
-
Some of these operations deal with system configuration, but can be abused:
- setting up network interfaces, mounting filesystems ...
-
Some of these operations are not very dangerous but are needed by servers:
- binding to a port below 1024.
-
Capabilities are per-process flags to allow these operations individually
Some capabilities
-
CAP_CHOWN: arbitrarily change file ownership and permissions -
CAP_DAC_OVERRIDE: arbitrarily bypass file ownership and permissions -
CAP_NET_ADMIN: configure network interfaces, iptables rules, etc. -
CAP_NET_BIND_SERVICE: bind a port below 1024
See man capabilities for the full list and details
Using capabilities
-
Container engines will typically drop all "dangerous" capabilities
-
You can then re-enable capabilities on a per-container basis, as needed
-
With the Docker engine:
docker run --cap-add ... -
From the shell:
capsh --drop=cap_net_admin --capsh --drop=all --
File capabilities
-
It is also possible to give capabilities to executable files
-
This is comparable to the SUID bit, but with finer grain
(e.g.,
setcap cap_net_raw+ep /bin/ping) -
There are differences between permitted and inheritable capabilities...
🤔
class: extra-details
Capability sets
-
Permitted set (=what a process could use, provided the file has the cap)
-
Effective set (=what a process can actually use)
-
Inheritable set (=capabilities preserved across exexcve calls)
-
Bounding set (=system-wide limit over what can be acquired through execve / capset)
-
Ambient set (=capabilities retained across execve for non-privileged users)
-
Files can have permitted, effective, inheritable capability sets
More about capabilities
-
Capabilities manpage:
-
Subtleties about
capsh:https://sites.google.com/site/fullycapable/why-didnt-that-work
Seccomp
-
Seccomp is secure computing.
-
Achieve high level of security by restricting drastically available syscalls.
-
Original seccomp only allows
read(),write(),exit(),sigreturn(). -
The seccomp-bpf extension allows specifying custom filters with BPF rules.
-
This allows filtering by syscall, and by parameter.
-
BPF code can perform arbitrarily complex checks, quickly, and safely.
-
Container engines take care of this so you don't have to.
Linux Security Modules
-
The most popular ones are SELinux and AppArmor.
-
Red Hat distros generally use SELinux.
-
Debian distros (in particular, Ubuntu) generally use AppArmor.
-
LSMs add a layer of access control to all process operations.
-
Container engines take care of this so you don't have to.
???
:EN:Containers internals :EN:- Control groups (cgroups) :EN:- Linux kernel namespaces :FR:Fonctionnement interne des conteneurs :FR:- Les "control groups" (cgroups) :FR:- Les namespaces du noyau Linux