Files
container.training/slides/containers/Namespaces_Cgroups.md
2025-09-04 15:01:11 +02:00

1279 lines
28 KiB
Markdown

# Deep dive into container internals
In this chapter, we will explain some of the fundamental building blocks of containers.
This will give you a solid foundation so you can:
- understand "what's going on" in complex situations,
- anticipate the behavior of containers (performance, security...) in new scenarios,
- implement your own container engine.
The last item should be done for educational purposes only!
---
## There is no container code in the Linux kernel
- If we search "container" in the Linux kernel code, we find:
- generic code to manipulate data structures (like linked lists, etc.),
- unrelated concepts like "ACPI containers",
- *nothing* relevant to "our" containers!
- Containers are composed using multiple independent features.
- On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."
- Security also requires features like capabilities, seccomp, LSMs...
---
# Control groups
- Control groups provide resource *metering* and *limiting*.
- This covers:
- "classic" compute resources like memory, CPU, I/O
- system resources like number of processes (PID)
- "exotic" resources like GPU VRAM, huge pages, RDMA
- other things like device node access (`/dev`) and perf events
---
## Crowd control
- Control groups also allow to group processes for special operations:
- freeze (conceptually similar to a "mass-SIGSTOP/SIGCONT")
- kill (safe mass-SIGKILL)
---
## Generalities
- Cgroups form a hierarchy (a tree)
- We can create nodes in that hierarchy
- We can associate limits to a node
- We can move a process (or multiple processes) to a leaf
- The process (or processes) will then respect these limits
- We can check the current usage of each node
- In other words: limits are optional (if we only want accounting)
- When a process is created, it is placed in its parent's groups
- The main interface is a pseudo-filesystem (typically mounted on `/sys/fs/cgroup`)
---
## Example
.small[
```bash
$ tree /sys/fs/cgroup/ -d
/sys/fs/cgroup/
├── init.scope
├── machine.slice
├── system.slice
│ ├── avahi-daemon.service
│ ├── ...
│ ├── docker-de3ee38bc8d90b7da218523004cae504a2fa821224fd49f53521d862db583fef.scope
│ ├── docker-e9e55ba69f0a4639793464972a8645cdb23ae9f60567384479a175e3226776b4.scope
│ ├── docker.service
│ ├── docker.socket
│ ├── ...
│ └── wpa_supplicant.service
└── user.slice
└── user-1000.slice
├── session-1.scope
└── user@1000.service
├── app.slice
│ └── ...
├── init.scope
└── session.slice
└── ...
```
]
---
class: extra-details, deep-dive
## Cgroups v1 vs v2
- Cgroups v1 were the original implementation
(back when Docker was created)
- Cgroups v2 are a huge refactor
(development started in Linux 3.10, released in 4.5.)
- Cgroups v2 have a number of differences:
- single hierarchy (instead of one tree per controller)
- processes can only be on leaf nodes (not inner nodes)
- and of course many improvements / refactorings
- Cgroups v2 should be the default on all modern distros!
---
class: extra-details, deep-dive
## Example of cgroup v1 hierarchy
The numbers are PIDs.
The names are the names of our nodes (arbitrarily chosen).
.small[
```bash
cpu memory
├── batch ├── stateless
│ ├── cryptoscam │ ├── 25
│ │ └── 52 │ ├── 26
│ └── ffmpeg │ ├── 27
│ ├── 109 │ ├── 52
│ └── 88 │ ├── 109
└── realtime │ └── 88
├── nginx └── databases
│ ├── 25 ├── 1008
│ ├── 26 └── 524
│ └── 27
├── postgres
│ └── 524
└── redis
└── 1008
```
]
---
## CPU cgroup
- Keeps track of CPU time used by a group of processes
(this is easier and more accurate than `getrusage` and `/proc`)
- Allows setting relative weights used by the scheduler
- Allows setting maximum time usage per time period
(e.g. "50ms every 100ms", which would cap the group to 50% of one CPU core)
- Allows setting reservations and caps ("utilization clamping")
(particularly relevant for realtime processes)
---
## Checking current CPU limits
- Getting the cgroup for the current user session:
```bash
cat /proc/$$/cgroup
```
(it should start with `/user.slice/...`)
- Checking the current CPU limit:
```bash
cat /sys/fs/cgroup/user.slice/.../cpu.max
```
(it should look like `max 100000`)
- `max` means unlimited; `100000` means "over a period of 100000 microseconds"
(unless specified, all cgroup time durations are in microseconds)
---
## Setting a CPU limit
- Run `top` in a terminal to view CPU usage
- In a separate terminal, burn CPU cycles with e.g.:
```bash
while : ; do : ; done
```
- Set a 50% CPU limit for that user or session:
```bash
echo 50000 > /sys/fs/cgroup/user.slice/.../cpu.max
```
- Notice that CPU usage goes down
(probably to *less* than 50% since this is a limit for the whole user/session!)
---
## Removing the CPU limit
- Remember to remove the limit when you're done:
```bash
echo max > /sys/fs/cgroup/user.slice/.../cpu.max
```
---
## Cpuset cgroup
- Pin groups to specific CPU(s)
- Features:
- limit apps to specific CPUs (`cpuset.cpus`)
- reserve CPUs for exclusive use (`cpuset.cpus.exclusive`)
- assign apps to specific NUMA memory nodes (`cpuset.mems`)
- Use-cases:
- dedicate CPUs to avoid performance loss due to cache flushes
- improve memory performance in NUMA systems
---
## Cpuset concepts
- `cpuset.cpus` / `cpuset.mems`
*express what we allow the cgroup to use (can be empty to allow everything)*
- `cpuset.cpus.effective` / `cpusets.mems.effective`
*express what the cgroup can actually use after accounting for other restrictions*
- `cpuset.cpus.exclusive` / `cpuset.cpus.partition`
*used to create "partitions" = sets of CPU(s) exclusively reserved for a cgroup*
---
## Memory cgroup: accounting
- Keeps track of pages used by each group:
- file (read/write/mmap from block devices)
- anonymous (stack, heap, anonymous mmap)
- active (recently accessed)
- inactive (candidate for eviction)
- ...many other categories!
- Each page is "charged" to a single group
(this can result in non-deterministic "charges" for shared pages, e.g. mapped files)
- To view all the counters kept by this cgroup:
```bash
$ cat /sys/fs/cgroup/memory.stat
```
---
## Memory cgroup: limits and reservations
- Cgroups v1 allowed to set soft and hard limits
(soft limits influenced reclaim but it wasn't straightforward to use)
- Cgroups v2 are way more sophisticated:
- hard limits (`.max`)
- thresholds triggering more evictions (`.high`)
- thresholds triggering less evictions (`.low`)
- reservations (`.min`)
- Also limits for swap and zswap
---
## Hard limits
- A cgroup can *never* exceed its hard limits
- When a cgroup tries to use more than the hard limit:
- the kernel tries to reclaim memory (buffers, mapped files...)
- when there is nothing to reclaim, the OOM killer is invoked
- There is a `memory.oom.group` flag to alter OOM behavior:
- `0` (default) = kill processes one by one
- `1` = consider the cgroup as a unit; OOM will kill it entirely
---
## Also...
- A `.peak` value is also exposed for each tracked amount
(memory, swap, zswap)
- Write an amount to `memory.reclaim` to trigger reclaim
(=ask the kernel to recover memory from the cgroup)
- Check memory stats per NUMA nopde (`memory.numa_stat`)
- And more!
---
## Block I/O cgroup
- Keep track of I/Os for each group:
- per block device
- read, write, and discard
- in bytes and in operations
- Set hard limits for each counter
- Set relative weights and latency targets
---
## `io.max`
- Enforce hard limits
(set max number of operations, of bytes read/written...)
- Each limit is per-device
- Doesn't offer performance guarantees
(once a device is saturated, performance will degrade for everyone)
---
## `io.cost.qos`
- Try to offer latency guarantees
- Define per-device thresholds to throttle operations
"if the 95% percentile latency of read operations on this device
is above 100ms...
...throttle operations on this device (queue them)"
- Can also define `io.weight` for relative priorities between cgroups
- Check [this document](https://facebookmicrosites.github.io/resctl-demo-website/docs/demo_docs/setting_benchmarks/iocost/) for some details and hints
---
## Network I/O
- Cgroups v1 had net_cls and net_prio controllers
- These have been deprecated in cgroups v2:
*There is no direct equivalent of the net_cls and net_prio
controllers from cgroups version 1. Instead, support has been
added to iptables(8) to allow eBPF filters that hook on cgroup v2
pathnames to make decisions about network traffic on a per-cgroup
basis.*
---
## Pid
- Limit (and count) number of processes in a cgroup
- Protects against e.g. fork bombs
---
## Devices
- We need to limit access to device nodes
- Containers should not be able to open e.g. disks and partitions directly
(/dev/sda\*, /dev/nvme\*...)
- However, some devices are expected to be available at all times:
/dev/tty, /dev/zero, /dev/null, /dev/random...
---
## Cgroups v1
- There used to be a special "devices" control group
- It made it easy to grand read/write/mknod permissions
(individually for each device and each container)
- Access could be granted/revoked/viewed through a pseudo-file:
```bash
echo 'c 1:3 mr' > /sys/fs/cgroup/.../devices.allow
```
- This file doesn't exist anymore in cgroups v2!
---
## Cgroups v2
- Device access is controlled with eBPF programs
(there is a special program type, [`cgroup_device`][bpf-cgroup-device], for that purpose)
- This requires writing and compiling eBPF programs (😰)
- Viewing permissions requires disassembling eBPF programs (😱)
[bpf-cgroup-device]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_DEVICE/
---
## Viewing eBPF programs
- Install bpf tools (package name `bpftool` or `bpf`)
- View all eBPF programs attached to cgroups:
```bash
sudo bpftool cgroup tree
```
- View eBPF programs attached to a Docker container:
```bash
sudo bpftool cgroup list /sys/fs/cgroup/system.slice/docker-<CONTAINER_ID>.scope
```
- Disassemble an eBPF program:
```bash
sudo bpftool prog dump xlated id <ID>
```
- *Bon chance* 😬
---
## Some interesting nodes
- `/dev/net/tun` (network interface manipulation)
- `/dev/fuse` (filesystems in user space)
- `/dev/kvm` (run VMs in containers)
- `/dev/dri` (GPU)
- `/dev/ttyUSB*`, `/dev/ttyACM*` (serial devices)
- `/dev/snd/*` (sound cards)
---
## And the exotic ones...
- `rdma`: remote memory access, infiniband
- `dmem`: device memory (VRAM), relatively new
(kernel 6.14, January 2025; only Intel and AMD GPU for now)
- `hugetlb`: huge pages
- `perf_event`: [performance profiling](https://perfwiki.github.io/main/)
- `misc`: generic cgroup for other discrete resources
(extension point to plug even more exotic resources)
---
# Namespaces
- Provide processes with their own view of the system
- Namespaces limit what you can see (and therefore, what you can use)
- These namespaces are available in modern kernels:
- pid
- net
- mnt
- uts
- ipc
- user
- time
- cgroup
(we are going to detail them individually)
- Each process belongs to one namespace of each type
---
## Namespaces are always active
- Namespaces exist even when you don't use containers
- This is a bit similar to the UID field in UNIX processes:
- all processes have the UID field, even if no user exists on the system
- the field always has a value / the value is always defined
<br/>
(i.e. any process running on the system has some UID)
- the value of the UID field is used when checking permissions
<br/>
(the UID field determines which resources the process can access)
- You can replace "UID field" with "namespace" above and it still works!
- In other words: even when you don't use containers,
<br/>there is one namespace of each type, containing all the processes on the system
---
class: extra-details, deep-dive
## Manipulating namespaces
- Namespaces are created with two methods:
- the `clone()` system call (used when creating new threads and processes)
- the `unshare()` system call
- The Linux tool `unshare` allows doing that from a shell
- A new process can re-use none / all / some of the namespaces of its parent
- It is possible to "enter" a namespace with the `setns()` system call
- The Linux tool `nsenter` allows doing that from a shell
---
class: extra-details, deep-dive
## Namespaces lifecycle
- When the last process of a namespace exits, the namespace is destroyed
- All the associated resources are then removed
- Namespaces are materialized by pseudo-files in `/proc/<pid>/ns`.
```bash
ls -l /proc/self/ns
```
- It is possible to compare namespaces by checking these files
(this helps to answer the question, "are these two processes in the same namespace?")
- It is possible to preserve a namespace by bind-mounting its pseudo-file
---
class: extra-details, deep-dive
## Namespaces can be used independently
- As mentioned in the previous slides:
*a new process can re-use none / all / some of the namespaces of its parent*
- It's possible to create e.g.:
- mount namespaces to have "private" `/tmp` for each user / app
- network namespaces to isolate apps or give them a special network access
- It's possible to use namespaces without cgroups
(and totally outside of container contexts)
---
## UTS namespace
- gethostname / sethostname
- Allows setting a custom hostname for a container
- That's (mostly) it!
- Also allows setting the NIS domain
(if you don't know what a NIS domain is, you don't have to worry about it!)
- If you're wondering: UTS = UNIX time sharing
- This namespace was named like this because of the `struct utsname`,
<br/>
which is commonly used to obtain the machine's hostname, architecture, etc.
(the more you know!)
---
class: extra-details, deep-dive
## Creating our first namespace
Let's use `unshare` to create a new process that will have its own UTS namespace:
```bash
$ sudo unshare --uts
```
- We have to use `sudo` for most `unshare` operations
- We indicate that we want a new uts namespace, and nothing else
- If we don't specify a program to run, a `$SHELL` is started
---
class: extra-details, deep-dive
## Demonstrating our uts namespace
In our new "container", check the hostname, change it, and check it:
```bash
# hostname
nodeX
# hostname tupperware
# hostname
tupperware
```
In another shell, check that the machine's hostname hasn't changed:
```bash
$ hostname
nodeX
```
Exit the "container" with `exit` or `Ctrl-D`.
---
## Net namespace overview
- Each network namespace has its own private network stack
- The network stack includes:
- network interfaces (including `lo`)
- routing table**s** (as in `ip rule` etc.)
- iptables chains and rules
- sockets (as seen by `ss`, `netstat`)
- You can move a network interface from a network namespace to another:
```bash
ip link set dev eth0 netns PID
```
---
## Net namespace typical use
- Each container is given its own network namespace
- For each network namespace (i.e. each container), a `veth` pair is created
(two `veth` interfaces act as if they were connected with a cross-over cable)
- One `veth` is moved to the container network namespace (and renamed `eth0`)
- The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge)
---
class: extra-details
## Creating a network namespace
Start a new process with its own network namespace:
```bash
$ sudo unshare --net
```
See that this new network namespace is unconfigured:
```bash
# ping 1.1
connect: Network is unreachable
# ifconfig
# ip link ls
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
```
---
class: extra-details
## Creating the `veth` interfaces
In another shell (on the host), create a `veth` pair:
```bash
$ sudo ip link add name in_host type veth peer name in_netns
```
Configure the host side (`in_host`):
```bash
$ sudo ip link set in_host up
$ sudo ip addr add 172.22.0.1/24 dev in_host
```
---
class: extra-details
## Moving the `veth` interface
*In the process created by `unshare`,* check the PID of our "network container":
```bash
# echo $$
533
```
*On the host*, move the other side (`in_netns`) to the network namespace:
```bash
$ sudo ip link set in_netns netns 533
```
(Make sure to update "533" with the actual PID obtained above!)
---
class: extra-details
## Basic network configuration
Let's set up `lo` (the loopback interface):
```bash
# ip link set lo up
```
Activate the `veth` interface and rename it to `eth0`:
```bash
# ip link set in_netns name eth0 up
```
---
class: extra-details
## Allocating IP address and default route
*In the process created by `unshare`,* configure the interface:
```bash
# ip addr add 172.22.0.2/24 dev eth0
# ip route add default via 172.22.0.1
```
(Make sure to update the IP addresses if necessary.)
Check that we can ping the host:
```bash
# ping 172.22.0.1
```
---
class: extra-details
## Reaching the outside world
This requires to:
- enable forwarding on the host
- add a masquerading (SNAT) rule for traffic coming from the namespace
If Docker is running on the host, we can also add the `in_host` interface
to the Docker bridge, and configure the `in_netns` interface with an
IP address belonging to the subnet of the Docker bridge!
---
class: extra-details
## Cleaning up network namespaces
- Terminate the process created by `unshare` (with `exit` or `Ctrl-D`).
- Since this was the only process in the network namespace, it is destroyed.
- All the interfaces in the network namespace are destroyed.
- When a `veth` interface is destroyed, it also destroys the other half of the pair.
- So we don't have anything else to do to clean up!
---
## Docker options leveraging network namespaces
- `--net none` gives an empty network namespace to a container
(effectively isolating it completely from the network)
- `--net host` means "do not containerize the network"
(no network namespace is created; the container uses the host network stack)
- `--net container` means "reuse the network namespace of another container"
(as a result, both containers share the same interfaces, routes, etc.)
---
## Mnt namespace
- Processes can have their own root fs (à la chroot)
- Processes can also have "private" mounts; this allows:
- isolating `/tmp` (per user, per service...)
- masking `/proc`, `/sys` (for processes that don't need them)
- mounting remote filesystems or sensitive data,
<br/>but make it visible only for allowed processes
- Mounts can be totally private, or shared
- For a long time, there was no easy way to "move" a mount to another namespace
- It's now possible; see [justincormack/addmount](https://github.com/justincormack/addmount) for a simple example
---
class: extra-details, deep-dive
## Setting up a private `/tmp`
Create a new mount namespace:
```bash
$ sudo unshare --mount
```
In that new namespace, mount a brand new `/tmp`:
```bash
# mount -t tmpfs none /tmp
```
Check the content of `/tmp` in the new namespace, and compare to the host.
The mount is automatically cleaned up when you exit the process.
---
## PID namespace
- Processes within a PID namespace only "see" processes
in the same PID namespace
- Each PID namespace has its own numbering (starting at 1)
- When PID 1 goes away, the whole namespace is killed
(when PID 1 goes away on a normal UNIX system, the kernel panics!)
- Those namespaces can be nested
- A process ends up having multiple PIDs (one per namespace in which it is nested)
---
class: extra-details, deep-dive
## PID namespace in action
Create a new PID namespace:
```bash
$ sudo unshare --pid --fork
```
(We need the `--fork` flag because the PID namespace is special.)
Check the process tree in the new namespace:
```bash
# ps faux
```
--
class: extra-details, deep-dive
🤔 Why do we see all the processes?!?
---
class: extra-details, deep-dive
## PID namespaces and `/proc`
- Tools like `ps` rely on the `/proc` pseudo-filesystem
- Our new namespace still has access to the original `/proc`
- Therefore, it still sees host processes
- But it cannot affect them
(try to `kill` a process: you will get `No such process`)
---
class: extra-details, deep-dive
## PID namespaces, take 2
- This can be solved by mounting `/proc` in the namespace
- The `unshare` utility provides a convenience flag, `--mount-proc`
- This flag will mount `/proc` in the namespace
- It will also unshare the mount namespace, so that this mount is local
Try it:
```bash
$ sudo unshare --pid --fork --mount-proc
# ps faux
```
---
class: extra-details
## OK, really, why do we need `--fork`?
*It is not necessary to remember all these details.
<br/>
This is just an illustration of the complexity of namespaces!*
The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary.
<br/>
A process calling `unshare` to create new namespaces is moved to the new namespaces...
<br/>
... Except for the PID namespace.
<br/>
(Because this would change the current PID of the process from X to 1.)
The processes created by the new binary are placed into the new PID namespace.
<br/>
The first one will be PID 1.
<br/>
If PID 1 exits, it is not possible to create additional processes in the namespace.
<br/>
(Attempting to do so will result in `ENOMEM`.)
Without the `--fork` flag, the first command that we execute will be PID 1 ...
<br/>
... And once it exits, we cannot create more processes in the namespace!
Check `man 2 unshare` and `man pid_namespaces` if you want more details.
---
## IPC namespace
--
- Does anybody know about IPC?
--
- Does anybody *care* about IPC?
--
- Allows a process (or group of processes) to have own:
- IPC semaphores
- IPC message queues
- IPC shared memory
... without risk of conflict with other instances.
- Older versions of PostgreSQL cared about this.
*No demo for that one.*
---
## User namespace
- Allows mapping UID/GID; e.g.:
- UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
- UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
- etc.
- UID 0 in the container can still perform privileged operations in the container
(for instance: setting up network interfaces)
- But outside of the container, it is a non-privileged user
- It also means that the UID in containers becomes unimportant
(just use UID 0 in the container, since it gets squashed to a non-privileged user outside)
- Ultimately enables better privilege separation in container engines
---
class: extra-details, deep-dive
## User namespace challenges
- UID needs to be mapped when passed between processes or kernel subsystems
- Filesystem permissions and file ownership are more complicated
.small[(e.g. when the same root filesystem is shared by multiple containers
running with different UIDs)]
- With the Docker Engine:
- some feature combinations are not allowed
<br/>
(e.g. user namespace + host network namespace sharing)
- user namespaces need to be enabled/disabled globally
<br/>
(when the daemon is started)
- container images are stored separately
<br/>
(so the first time you toggle user namespaces, you need to re-pull images)
*No demo for that one.*
---
## Time namespace
- Virtualize time
- Expose a slower/faster clock to some processes
(for e.g. simulation purposes)
- Expose a clock offset to some processes
(simulation, suspend/restore...)
---
## Cgroup namespace
- Virtualize access to `/proc/<PID>/cgroup`
- Lets containerized processes view their relative cgroup tree
---
# Security features
- Namespaces and cgroups are not enough to ensure strong security
- We need extra mechanisms: capabilities, seccomp, LSMs
- These mechanisms were already used before containers to harden security
- They can be used together with containers
- Good container engines will automatically leverage these features.
(so that you don't have to worry about it)
---
## Capabilities
- In traditional UNIX, many operations are possible if and only if UID=0 (root)
- Some of these operations are very powerful:
- changing file ownership, accessing all files ...
- Some of these operations deal with system configuration, but can be abused:
- setting up network interfaces, mounting filesystems ...
- Some of these operations are not very dangerous but are needed by servers:
- binding to a port below 1024.
- Capabilities are per-process flags to allow these operations individually
---
## Some capabilities
- `CAP_CHOWN`: arbitrarily change file ownership and permissions
- `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions
- `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc.
- `CAP_NET_BIND_SERVICE`: bind a port below 1024
See `man capabilities` for the full list and details
---
## Using capabilities
- Container engines will typically drop all "dangerous" capabilities
- You can then re-enable capabilities on a per-container basis, as needed
- With the Docker engine: `docker run --cap-add ...`
- From the shell:
`capsh --drop=cap_net_admin --`
`capsh --drop=all --`
---
## File capabilities
- It is also possible to give capabilities to executable files
- This is comparable to the SUID bit, but with finer grain
(e.g., `setcap cap_net_raw+ep /bin/ping`)
- There are differences between *permitted* and *inheritable* capabilities...
🤔
---
class: extra-details
## Capability sets
- Permitted set (=what a process could use, provided the file has the cap)
- Effective set (=what a process can actually use)
- Inheritable set (=capabilities preserved across exexcve calls)
- Bounding set (=system-wide limit over what can be acquired through execve / capset)
- Ambient set (=capabilities retained across execve for non-privileged users)
- Files can have *permitted*, *effective*, *inheritable* capability sets
---
## More about capabilities
- Capabilities manpage:
https://man7.org/linux/man-pages/man7/capabilities.7.html
- Subtleties about `capsh`:
https://sites.google.com/site/fullycapable/why-didnt-that-work
---
## Seccomp
- Seccomp is secure computing.
- Achieve high level of security by restricting drastically available syscalls.
- Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()`.
- The seccomp-bpf extension allows specifying custom filters with BPF rules.
- This allows filtering by syscall, and by parameter.
- BPF code can perform arbitrarily complex checks, quickly, and safely.
- Container engines take care of this so you don't have to.
---
## Linux Security Modules
- The most popular ones are SELinux and AppArmor.
- Red Hat distros generally use SELinux.
- Debian distros (in particular, Ubuntu) generally use AppArmor.
- LSMs add a layer of access control to all process operations.
- Container engines take care of this so you don't have to.
???
:EN:Containers internals
:EN:- Control groups (cgroups)
:EN:- Linux kernel namespaces
:FR:Fonctionnement interne des conteneurs
:FR:- Les "control groups" (cgroups)
:FR:- Les namespaces du noyau Linux