mirror of
https://github.com/jpetazzo/container.training.git
synced 2026-05-11 11:26:41 +00:00
1279 lines
28 KiB
Markdown
1279 lines
28 KiB
Markdown
# Deep dive into container internals
|
|
|
|
In this chapter, we will explain some of the fundamental building blocks of containers.
|
|
|
|
This will give you a solid foundation so you can:
|
|
|
|
- understand "what's going on" in complex situations,
|
|
|
|
- anticipate the behavior of containers (performance, security...) in new scenarios,
|
|
|
|
- implement your own container engine.
|
|
|
|
The last item should be done for educational purposes only!
|
|
|
|
---
|
|
|
|
## There is no container code in the Linux kernel
|
|
|
|
- If we search "container" in the Linux kernel code, we find:
|
|
|
|
- generic code to manipulate data structures (like linked lists, etc.),
|
|
|
|
- unrelated concepts like "ACPI containers",
|
|
|
|
- *nothing* relevant to "our" containers!
|
|
|
|
- Containers are composed using multiple independent features.
|
|
|
|
- On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."
|
|
|
|
- Security also requires features like capabilities, seccomp, LSMs...
|
|
|
|
---
|
|
|
|
# Control groups
|
|
|
|
- Control groups provide resource *metering* and *limiting*.
|
|
|
|
- This covers:
|
|
|
|
- "classic" compute resources like memory, CPU, I/O
|
|
|
|
- system resources like number of processes (PID)
|
|
|
|
- "exotic" resources like GPU VRAM, huge pages, RDMA
|
|
|
|
- other things like device node access (`/dev`) and perf events
|
|
|
|
---
|
|
|
|
## Crowd control
|
|
|
|
- Control groups also allow to group processes for special operations:
|
|
|
|
- freeze (conceptually similar to a "mass-SIGSTOP/SIGCONT")
|
|
|
|
- kill (safe mass-SIGKILL)
|
|
|
|
---
|
|
|
|
## Generalities
|
|
|
|
- Cgroups form a hierarchy (a tree)
|
|
|
|
- We can create nodes in that hierarchy
|
|
|
|
- We can associate limits to a node
|
|
|
|
- We can move a process (or multiple processes) to a leaf
|
|
|
|
- The process (or processes) will then respect these limits
|
|
|
|
- We can check the current usage of each node
|
|
|
|
- In other words: limits are optional (if we only want accounting)
|
|
|
|
- When a process is created, it is placed in its parent's groups
|
|
|
|
- The main interface is a pseudo-filesystem (typically mounted on `/sys/fs/cgroup`)
|
|
|
|
---
|
|
|
|
## Example
|
|
|
|
.small[
|
|
```bash
|
|
$ tree /sys/fs/cgroup/ -d
|
|
/sys/fs/cgroup/
|
|
├── init.scope
|
|
├── machine.slice
|
|
├── system.slice
|
|
│ ├── avahi-daemon.service
|
|
│ ├── ...
|
|
│ ├── docker-de3ee38bc8d90b7da218523004cae504a2fa821224fd49f53521d862db583fef.scope
|
|
│ ├── docker-e9e55ba69f0a4639793464972a8645cdb23ae9f60567384479a175e3226776b4.scope
|
|
│ ├── docker.service
|
|
│ ├── docker.socket
|
|
│ ├── ...
|
|
│ └── wpa_supplicant.service
|
|
└── user.slice
|
|
└── user-1000.slice
|
|
├── session-1.scope
|
|
└── user@1000.service
|
|
├── app.slice
|
|
│ └── ...
|
|
├── init.scope
|
|
└── session.slice
|
|
└── ...
|
|
```
|
|
]
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Cgroups v1 vs v2
|
|
|
|
- Cgroups v1 were the original implementation
|
|
|
|
(back when Docker was created)
|
|
|
|
- Cgroups v2 are a huge refactor
|
|
|
|
(development started in Linux 3.10, released in 4.5.)
|
|
|
|
- Cgroups v2 have a number of differences:
|
|
|
|
- single hierarchy (instead of one tree per controller)
|
|
|
|
- processes can only be on leaf nodes (not inner nodes)
|
|
|
|
- and of course many improvements / refactorings
|
|
|
|
- Cgroups v2 should be the default on all modern distros!
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Example of cgroup v1 hierarchy
|
|
|
|
The numbers are PIDs.
|
|
|
|
The names are the names of our nodes (arbitrarily chosen).
|
|
|
|
.small[
|
|
```bash
|
|
cpu memory
|
|
├── batch ├── stateless
|
|
│ ├── cryptoscam │ ├── 25
|
|
│ │ └── 52 │ ├── 26
|
|
│ └── ffmpeg │ ├── 27
|
|
│ ├── 109 │ ├── 52
|
|
│ └── 88 │ ├── 109
|
|
└── realtime │ └── 88
|
|
├── nginx └── databases
|
|
│ ├── 25 ├── 1008
|
|
│ ├── 26 └── 524
|
|
│ └── 27
|
|
├── postgres
|
|
│ └── 524
|
|
└── redis
|
|
└── 1008
|
|
```
|
|
]
|
|
|
|
---
|
|
|
|
## CPU cgroup
|
|
|
|
- Keeps track of CPU time used by a group of processes
|
|
|
|
(this is easier and more accurate than `getrusage` and `/proc`)
|
|
|
|
- Allows setting relative weights used by the scheduler
|
|
|
|
- Allows setting maximum time usage per time period
|
|
|
|
(e.g. "50ms every 100ms", which would cap the group to 50% of one CPU core)
|
|
|
|
- Allows setting reservations and caps ("utilization clamping")
|
|
|
|
(particularly relevant for realtime processes)
|
|
|
|
---
|
|
|
|
## Checking current CPU limits
|
|
|
|
- Getting the cgroup for the current user session:
|
|
```bash
|
|
cat /proc/$$/cgroup
|
|
```
|
|
(it should start with `/user.slice/...`)
|
|
|
|
- Checking the current CPU limit:
|
|
```bash
|
|
cat /sys/fs/cgroup/user.slice/.../cpu.max
|
|
```
|
|
(it should look like `max 100000`)
|
|
|
|
- `max` means unlimited; `100000` means "over a period of 100000 microseconds"
|
|
|
|
(unless specified, all cgroup time durations are in microseconds)
|
|
|
|
---
|
|
|
|
## Setting a CPU limit
|
|
|
|
- Run `top` in a terminal to view CPU usage
|
|
|
|
- In a separate terminal, burn CPU cycles with e.g.:
|
|
```bash
|
|
while : ; do : ; done
|
|
```
|
|
|
|
- Set a 50% CPU limit for that user or session:
|
|
```bash
|
|
echo 50000 > /sys/fs/cgroup/user.slice/.../cpu.max
|
|
```
|
|
|
|
- Notice that CPU usage goes down
|
|
|
|
(probably to *less* than 50% since this is a limit for the whole user/session!)
|
|
|
|
---
|
|
|
|
## Removing the CPU limit
|
|
|
|
- Remember to remove the limit when you're done:
|
|
```bash
|
|
echo max > /sys/fs/cgroup/user.slice/.../cpu.max
|
|
```
|
|
|
|
---
|
|
|
|
## Cpuset cgroup
|
|
|
|
- Pin groups to specific CPU(s)
|
|
|
|
- Features:
|
|
|
|
- limit apps to specific CPUs (`cpuset.cpus`)
|
|
|
|
- reserve CPUs for exclusive use (`cpuset.cpus.exclusive`)
|
|
|
|
- assign apps to specific NUMA memory nodes (`cpuset.mems`)
|
|
|
|
- Use-cases:
|
|
|
|
- dedicate CPUs to avoid performance loss due to cache flushes
|
|
|
|
- improve memory performance in NUMA systems
|
|
|
|
---
|
|
|
|
## Cpuset concepts
|
|
|
|
- `cpuset.cpus` / `cpuset.mems`
|
|
|
|
*express what we allow the cgroup to use (can be empty to allow everything)*
|
|
|
|
- `cpuset.cpus.effective` / `cpusets.mems.effective`
|
|
|
|
*express what the cgroup can actually use after accounting for other restrictions*
|
|
|
|
- `cpuset.cpus.exclusive` / `cpuset.cpus.partition`
|
|
|
|
*used to create "partitions" = sets of CPU(s) exclusively reserved for a cgroup*
|
|
|
|
---
|
|
|
|
## Memory cgroup: accounting
|
|
|
|
- Keeps track of pages used by each group:
|
|
|
|
- file (read/write/mmap from block devices)
|
|
- anonymous (stack, heap, anonymous mmap)
|
|
- active (recently accessed)
|
|
- inactive (candidate for eviction)
|
|
- ...many other categories!
|
|
|
|
- Each page is "charged" to a single group
|
|
|
|
(this can result in non-deterministic "charges" for shared pages, e.g. mapped files)
|
|
|
|
- To view all the counters kept by this cgroup:
|
|
|
|
```bash
|
|
$ cat /sys/fs/cgroup/memory.stat
|
|
```
|
|
|
|
---
|
|
|
|
## Memory cgroup: limits and reservations
|
|
|
|
- Cgroups v1 allowed to set soft and hard limits
|
|
|
|
(soft limits influenced reclaim but it wasn't straightforward to use)
|
|
|
|
- Cgroups v2 are way more sophisticated:
|
|
|
|
- hard limits (`.max`)
|
|
|
|
- thresholds triggering more evictions (`.high`)
|
|
|
|
- thresholds triggering less evictions (`.low`)
|
|
|
|
- reservations (`.min`)
|
|
|
|
- Also limits for swap and zswap
|
|
|
|
---
|
|
|
|
## Hard limits
|
|
|
|
- A cgroup can *never* exceed its hard limits
|
|
|
|
- When a cgroup tries to use more than the hard limit:
|
|
|
|
- the kernel tries to reclaim memory (buffers, mapped files...)
|
|
|
|
- when there is nothing to reclaim, the OOM killer is invoked
|
|
|
|
- There is a `memory.oom.group` flag to alter OOM behavior:
|
|
|
|
- `0` (default) = kill processes one by one
|
|
|
|
- `1` = consider the cgroup as a unit; OOM will kill it entirely
|
|
|
|
---
|
|
|
|
## Also...
|
|
|
|
- A `.peak` value is also exposed for each tracked amount
|
|
|
|
(memory, swap, zswap)
|
|
|
|
- Write an amount to `memory.reclaim` to trigger reclaim
|
|
|
|
(=ask the kernel to recover memory from the cgroup)
|
|
|
|
- Check memory stats per NUMA nopde (`memory.numa_stat`)
|
|
|
|
- And more!
|
|
|
|
---
|
|
|
|
## Block I/O cgroup
|
|
|
|
- Keep track of I/Os for each group:
|
|
|
|
- per block device
|
|
|
|
- read, write, and discard
|
|
|
|
- in bytes and in operations
|
|
|
|
- Set hard limits for each counter
|
|
|
|
- Set relative weights and latency targets
|
|
|
|
---
|
|
|
|
## `io.max`
|
|
|
|
- Enforce hard limits
|
|
|
|
(set max number of operations, of bytes read/written...)
|
|
|
|
- Each limit is per-device
|
|
|
|
- Doesn't offer performance guarantees
|
|
|
|
(once a device is saturated, performance will degrade for everyone)
|
|
|
|
---
|
|
|
|
## `io.cost.qos`
|
|
|
|
- Try to offer latency guarantees
|
|
|
|
- Define per-device thresholds to throttle operations
|
|
|
|
"if the 95% percentile latency of read operations on this device
|
|
is above 100ms...
|
|
|
|
...throttle operations on this device (queue them)"
|
|
|
|
- Can also define `io.weight` for relative priorities between cgroups
|
|
|
|
- Check [this document](https://facebookmicrosites.github.io/resctl-demo-website/docs/demo_docs/setting_benchmarks/iocost/) for some details and hints
|
|
|
|
---
|
|
|
|
## Network I/O
|
|
|
|
- Cgroups v1 had net_cls and net_prio controllers
|
|
|
|
- These have been deprecated in cgroups v2:
|
|
|
|
*There is no direct equivalent of the net_cls and net_prio
|
|
controllers from cgroups version 1. Instead, support has been
|
|
added to iptables(8) to allow eBPF filters that hook on cgroup v2
|
|
pathnames to make decisions about network traffic on a per-cgroup
|
|
basis.*
|
|
|
|
---
|
|
|
|
## Pid
|
|
|
|
- Limit (and count) number of processes in a cgroup
|
|
|
|
- Protects against e.g. fork bombs
|
|
|
|
---
|
|
|
|
## Devices
|
|
|
|
- We need to limit access to device nodes
|
|
|
|
- Containers should not be able to open e.g. disks and partitions directly
|
|
|
|
(/dev/sda\*, /dev/nvme\*...)
|
|
|
|
- However, some devices are expected to be available at all times:
|
|
|
|
/dev/tty, /dev/zero, /dev/null, /dev/random...
|
|
|
|
---
|
|
|
|
## Cgroups v1
|
|
|
|
- There used to be a special "devices" control group
|
|
|
|
- It made it easy to grand read/write/mknod permissions
|
|
|
|
(individually for each device and each container)
|
|
|
|
- Access could be granted/revoked/viewed through a pseudo-file:
|
|
```bash
|
|
echo 'c 1:3 mr' > /sys/fs/cgroup/.../devices.allow
|
|
```
|
|
|
|
- This file doesn't exist anymore in cgroups v2!
|
|
|
|
---
|
|
|
|
## Cgroups v2
|
|
|
|
- Device access is controlled with eBPF programs
|
|
|
|
(there is a special program type, [`cgroup_device`][bpf-cgroup-device], for that purpose)
|
|
|
|
- This requires writing and compiling eBPF programs (😰)
|
|
|
|
- Viewing permissions requires disassembling eBPF programs (😱)
|
|
|
|
[bpf-cgroup-device]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_DEVICE/
|
|
|
|
---
|
|
|
|
## Viewing eBPF programs
|
|
|
|
- Install bpf tools (package name `bpftool` or `bpf`)
|
|
|
|
- View all eBPF programs attached to cgroups:
|
|
```bash
|
|
sudo bpftool cgroup tree
|
|
```
|
|
|
|
- View eBPF programs attached to a Docker container:
|
|
```bash
|
|
sudo bpftool cgroup list /sys/fs/cgroup/system.slice/docker-<CONTAINER_ID>.scope
|
|
```
|
|
|
|
- Disassemble an eBPF program:
|
|
```bash
|
|
sudo bpftool prog dump xlated id <ID>
|
|
```
|
|
|
|
- *Bon chance* 😬
|
|
|
|
---
|
|
|
|
## Some interesting nodes
|
|
|
|
- `/dev/net/tun` (network interface manipulation)
|
|
|
|
- `/dev/fuse` (filesystems in user space)
|
|
|
|
- `/dev/kvm` (run VMs in containers)
|
|
|
|
- `/dev/dri` (GPU)
|
|
|
|
- `/dev/ttyUSB*`, `/dev/ttyACM*` (serial devices)
|
|
|
|
- `/dev/snd/*` (sound cards)
|
|
|
|
---
|
|
|
|
## And the exotic ones...
|
|
|
|
- `rdma`: remote memory access, infiniband
|
|
|
|
- `dmem`: device memory (VRAM), relatively new
|
|
|
|
(kernel 6.14, January 2025; only Intel and AMD GPU for now)
|
|
|
|
- `hugetlb`: huge pages
|
|
|
|
- `perf_event`: [performance profiling](https://perfwiki.github.io/main/)
|
|
|
|
- `misc`: generic cgroup for other discrete resources
|
|
|
|
(extension point to plug even more exotic resources)
|
|
|
|
---
|
|
|
|
# Namespaces
|
|
|
|
- Provide processes with their own view of the system
|
|
|
|
- Namespaces limit what you can see (and therefore, what you can use)
|
|
|
|
- These namespaces are available in modern kernels:
|
|
|
|
- pid
|
|
- net
|
|
- mnt
|
|
- uts
|
|
- ipc
|
|
- user
|
|
- time
|
|
- cgroup
|
|
|
|
(we are going to detail them individually)
|
|
|
|
- Each process belongs to one namespace of each type
|
|
|
|
---
|
|
|
|
## Namespaces are always active
|
|
|
|
- Namespaces exist even when you don't use containers
|
|
|
|
- This is a bit similar to the UID field in UNIX processes:
|
|
|
|
- all processes have the UID field, even if no user exists on the system
|
|
|
|
- the field always has a value / the value is always defined
|
|
<br/>
|
|
(i.e. any process running on the system has some UID)
|
|
|
|
- the value of the UID field is used when checking permissions
|
|
<br/>
|
|
(the UID field determines which resources the process can access)
|
|
|
|
- You can replace "UID field" with "namespace" above and it still works!
|
|
|
|
- In other words: even when you don't use containers,
|
|
<br/>there is one namespace of each type, containing all the processes on the system
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Manipulating namespaces
|
|
|
|
- Namespaces are created with two methods:
|
|
|
|
- the `clone()` system call (used when creating new threads and processes)
|
|
|
|
- the `unshare()` system call
|
|
|
|
- The Linux tool `unshare` allows doing that from a shell
|
|
|
|
- A new process can re-use none / all / some of the namespaces of its parent
|
|
|
|
- It is possible to "enter" a namespace with the `setns()` system call
|
|
|
|
- The Linux tool `nsenter` allows doing that from a shell
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Namespaces lifecycle
|
|
|
|
- When the last process of a namespace exits, the namespace is destroyed
|
|
|
|
- All the associated resources are then removed
|
|
|
|
- Namespaces are materialized by pseudo-files in `/proc/<pid>/ns`.
|
|
|
|
```bash
|
|
ls -l /proc/self/ns
|
|
```
|
|
|
|
- It is possible to compare namespaces by checking these files
|
|
|
|
(this helps to answer the question, "are these two processes in the same namespace?")
|
|
|
|
- It is possible to preserve a namespace by bind-mounting its pseudo-file
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Namespaces can be used independently
|
|
|
|
- As mentioned in the previous slides:
|
|
|
|
*a new process can re-use none / all / some of the namespaces of its parent*
|
|
|
|
- It's possible to create e.g.:
|
|
|
|
- mount namespaces to have "private" `/tmp` for each user / app
|
|
|
|
- network namespaces to isolate apps or give them a special network access
|
|
|
|
- It's possible to use namespaces without cgroups
|
|
|
|
(and totally outside of container contexts)
|
|
|
|
---
|
|
|
|
## UTS namespace
|
|
|
|
- gethostname / sethostname
|
|
|
|
- Allows setting a custom hostname for a container
|
|
|
|
- That's (mostly) it!
|
|
|
|
- Also allows setting the NIS domain
|
|
|
|
(if you don't know what a NIS domain is, you don't have to worry about it!)
|
|
|
|
- If you're wondering: UTS = UNIX time sharing
|
|
|
|
- This namespace was named like this because of the `struct utsname`,
|
|
<br/>
|
|
which is commonly used to obtain the machine's hostname, architecture, etc.
|
|
|
|
(the more you know!)
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Creating our first namespace
|
|
|
|
Let's use `unshare` to create a new process that will have its own UTS namespace:
|
|
|
|
```bash
|
|
$ sudo unshare --uts
|
|
```
|
|
|
|
- We have to use `sudo` for most `unshare` operations
|
|
|
|
- We indicate that we want a new uts namespace, and nothing else
|
|
|
|
- If we don't specify a program to run, a `$SHELL` is started
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Demonstrating our uts namespace
|
|
|
|
In our new "container", check the hostname, change it, and check it:
|
|
|
|
```bash
|
|
# hostname
|
|
nodeX
|
|
# hostname tupperware
|
|
# hostname
|
|
tupperware
|
|
```
|
|
|
|
In another shell, check that the machine's hostname hasn't changed:
|
|
|
|
```bash
|
|
$ hostname
|
|
nodeX
|
|
```
|
|
|
|
Exit the "container" with `exit` or `Ctrl-D`.
|
|
|
|
---
|
|
|
|
## Net namespace overview
|
|
|
|
- Each network namespace has its own private network stack
|
|
|
|
- The network stack includes:
|
|
|
|
- network interfaces (including `lo`)
|
|
|
|
- routing table**s** (as in `ip rule` etc.)
|
|
|
|
- iptables chains and rules
|
|
|
|
- sockets (as seen by `ss`, `netstat`)
|
|
|
|
- You can move a network interface from a network namespace to another:
|
|
```bash
|
|
ip link set dev eth0 netns PID
|
|
```
|
|
|
|
---
|
|
|
|
## Net namespace typical use
|
|
|
|
- Each container is given its own network namespace
|
|
|
|
- For each network namespace (i.e. each container), a `veth` pair is created
|
|
|
|
(two `veth` interfaces act as if they were connected with a cross-over cable)
|
|
|
|
- One `veth` is moved to the container network namespace (and renamed `eth0`)
|
|
|
|
- The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge)
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Creating a network namespace
|
|
|
|
Start a new process with its own network namespace:
|
|
|
|
```bash
|
|
$ sudo unshare --net
|
|
```
|
|
|
|
See that this new network namespace is unconfigured:
|
|
|
|
```bash
|
|
# ping 1.1
|
|
connect: Network is unreachable
|
|
# ifconfig
|
|
# ip link ls
|
|
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
|
|
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
|
```
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Creating the `veth` interfaces
|
|
|
|
In another shell (on the host), create a `veth` pair:
|
|
|
|
```bash
|
|
$ sudo ip link add name in_host type veth peer name in_netns
|
|
```
|
|
|
|
Configure the host side (`in_host`):
|
|
|
|
```bash
|
|
$ sudo ip link set in_host up
|
|
$ sudo ip addr add 172.22.0.1/24 dev in_host
|
|
```
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Moving the `veth` interface
|
|
|
|
*In the process created by `unshare`,* check the PID of our "network container":
|
|
|
|
```bash
|
|
# echo $$
|
|
533
|
|
```
|
|
|
|
*On the host*, move the other side (`in_netns`) to the network namespace:
|
|
|
|
```bash
|
|
$ sudo ip link set in_netns netns 533
|
|
```
|
|
|
|
(Make sure to update "533" with the actual PID obtained above!)
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Basic network configuration
|
|
|
|
Let's set up `lo` (the loopback interface):
|
|
|
|
```bash
|
|
# ip link set lo up
|
|
```
|
|
|
|
Activate the `veth` interface and rename it to `eth0`:
|
|
|
|
```bash
|
|
# ip link set in_netns name eth0 up
|
|
```
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Allocating IP address and default route
|
|
|
|
*In the process created by `unshare`,* configure the interface:
|
|
|
|
```bash
|
|
# ip addr add 172.22.0.2/24 dev eth0
|
|
# ip route add default via 172.22.0.1
|
|
```
|
|
|
|
(Make sure to update the IP addresses if necessary.)
|
|
|
|
Check that we can ping the host:
|
|
|
|
```bash
|
|
# ping 172.22.0.1
|
|
```
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Reaching the outside world
|
|
|
|
This requires to:
|
|
|
|
- enable forwarding on the host
|
|
|
|
- add a masquerading (SNAT) rule for traffic coming from the namespace
|
|
|
|
If Docker is running on the host, we can also add the `in_host` interface
|
|
to the Docker bridge, and configure the `in_netns` interface with an
|
|
IP address belonging to the subnet of the Docker bridge!
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Cleaning up network namespaces
|
|
|
|
- Terminate the process created by `unshare` (with `exit` or `Ctrl-D`).
|
|
|
|
- Since this was the only process in the network namespace, it is destroyed.
|
|
|
|
- All the interfaces in the network namespace are destroyed.
|
|
|
|
- When a `veth` interface is destroyed, it also destroys the other half of the pair.
|
|
|
|
- So we don't have anything else to do to clean up!
|
|
|
|
---
|
|
|
|
## Docker options leveraging network namespaces
|
|
|
|
- `--net none` gives an empty network namespace to a container
|
|
|
|
(effectively isolating it completely from the network)
|
|
|
|
- `--net host` means "do not containerize the network"
|
|
|
|
(no network namespace is created; the container uses the host network stack)
|
|
|
|
- `--net container` means "reuse the network namespace of another container"
|
|
|
|
(as a result, both containers share the same interfaces, routes, etc.)
|
|
|
|
---
|
|
|
|
## Mnt namespace
|
|
|
|
- Processes can have their own root fs (à la chroot)
|
|
|
|
- Processes can also have "private" mounts; this allows:
|
|
|
|
- isolating `/tmp` (per user, per service...)
|
|
|
|
- masking `/proc`, `/sys` (for processes that don't need them)
|
|
|
|
- mounting remote filesystems or sensitive data,
|
|
<br/>but make it visible only for allowed processes
|
|
|
|
- Mounts can be totally private, or shared
|
|
|
|
- For a long time, there was no easy way to "move" a mount to another namespace
|
|
|
|
- It's now possible; see [justincormack/addmount](https://github.com/justincormack/addmount) for a simple example
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## Setting up a private `/tmp`
|
|
|
|
Create a new mount namespace:
|
|
|
|
```bash
|
|
$ sudo unshare --mount
|
|
```
|
|
|
|
In that new namespace, mount a brand new `/tmp`:
|
|
|
|
```bash
|
|
# mount -t tmpfs none /tmp
|
|
```
|
|
|
|
Check the content of `/tmp` in the new namespace, and compare to the host.
|
|
|
|
The mount is automatically cleaned up when you exit the process.
|
|
|
|
---
|
|
|
|
## PID namespace
|
|
|
|
- Processes within a PID namespace only "see" processes
|
|
in the same PID namespace
|
|
|
|
- Each PID namespace has its own numbering (starting at 1)
|
|
|
|
- When PID 1 goes away, the whole namespace is killed
|
|
|
|
(when PID 1 goes away on a normal UNIX system, the kernel panics!)
|
|
|
|
- Those namespaces can be nested
|
|
|
|
- A process ends up having multiple PIDs (one per namespace in which it is nested)
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## PID namespace in action
|
|
|
|
Create a new PID namespace:
|
|
|
|
```bash
|
|
$ sudo unshare --pid --fork
|
|
```
|
|
|
|
(We need the `--fork` flag because the PID namespace is special.)
|
|
|
|
Check the process tree in the new namespace:
|
|
|
|
```bash
|
|
# ps faux
|
|
```
|
|
|
|
--
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
🤔 Why do we see all the processes?!?
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## PID namespaces and `/proc`
|
|
|
|
- Tools like `ps` rely on the `/proc` pseudo-filesystem
|
|
|
|
- Our new namespace still has access to the original `/proc`
|
|
|
|
- Therefore, it still sees host processes
|
|
|
|
- But it cannot affect them
|
|
|
|
(try to `kill` a process: you will get `No such process`)
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## PID namespaces, take 2
|
|
|
|
- This can be solved by mounting `/proc` in the namespace
|
|
|
|
- The `unshare` utility provides a convenience flag, `--mount-proc`
|
|
|
|
- This flag will mount `/proc` in the namespace
|
|
|
|
- It will also unshare the mount namespace, so that this mount is local
|
|
|
|
Try it:
|
|
|
|
```bash
|
|
$ sudo unshare --pid --fork --mount-proc
|
|
# ps faux
|
|
```
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## OK, really, why do we need `--fork`?
|
|
|
|
*It is not necessary to remember all these details.
|
|
<br/>
|
|
This is just an illustration of the complexity of namespaces!*
|
|
|
|
The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary.
|
|
<br/>
|
|
A process calling `unshare` to create new namespaces is moved to the new namespaces...
|
|
<br/>
|
|
... Except for the PID namespace.
|
|
<br/>
|
|
(Because this would change the current PID of the process from X to 1.)
|
|
|
|
The processes created by the new binary are placed into the new PID namespace.
|
|
<br/>
|
|
The first one will be PID 1.
|
|
<br/>
|
|
If PID 1 exits, it is not possible to create additional processes in the namespace.
|
|
<br/>
|
|
(Attempting to do so will result in `ENOMEM`.)
|
|
|
|
Without the `--fork` flag, the first command that we execute will be PID 1 ...
|
|
<br/>
|
|
... And once it exits, we cannot create more processes in the namespace!
|
|
|
|
Check `man 2 unshare` and `man pid_namespaces` if you want more details.
|
|
|
|
---
|
|
|
|
## IPC namespace
|
|
|
|
--
|
|
|
|
- Does anybody know about IPC?
|
|
|
|
--
|
|
|
|
- Does anybody *care* about IPC?
|
|
|
|
--
|
|
|
|
- Allows a process (or group of processes) to have own:
|
|
|
|
- IPC semaphores
|
|
- IPC message queues
|
|
- IPC shared memory
|
|
|
|
... without risk of conflict with other instances.
|
|
|
|
- Older versions of PostgreSQL cared about this.
|
|
|
|
*No demo for that one.*
|
|
|
|
---
|
|
|
|
## User namespace
|
|
|
|
- Allows mapping UID/GID; e.g.:
|
|
|
|
- UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
|
|
- UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
|
|
- etc.
|
|
|
|
- UID 0 in the container can still perform privileged operations in the container
|
|
|
|
(for instance: setting up network interfaces)
|
|
|
|
- But outside of the container, it is a non-privileged user
|
|
|
|
- It also means that the UID in containers becomes unimportant
|
|
|
|
(just use UID 0 in the container, since it gets squashed to a non-privileged user outside)
|
|
|
|
- Ultimately enables better privilege separation in container engines
|
|
|
|
---
|
|
|
|
class: extra-details, deep-dive
|
|
|
|
## User namespace challenges
|
|
|
|
- UID needs to be mapped when passed between processes or kernel subsystems
|
|
|
|
- Filesystem permissions and file ownership are more complicated
|
|
|
|
.small[(e.g. when the same root filesystem is shared by multiple containers
|
|
running with different UIDs)]
|
|
|
|
- With the Docker Engine:
|
|
|
|
- some feature combinations are not allowed
|
|
<br/>
|
|
(e.g. user namespace + host network namespace sharing)
|
|
|
|
- user namespaces need to be enabled/disabled globally
|
|
<br/>
|
|
(when the daemon is started)
|
|
|
|
- container images are stored separately
|
|
<br/>
|
|
(so the first time you toggle user namespaces, you need to re-pull images)
|
|
|
|
*No demo for that one.*
|
|
|
|
---
|
|
|
|
## Time namespace
|
|
|
|
- Virtualize time
|
|
|
|
- Expose a slower/faster clock to some processes
|
|
|
|
(for e.g. simulation purposes)
|
|
|
|
- Expose a clock offset to some processes
|
|
|
|
(simulation, suspend/restore...)
|
|
|
|
---
|
|
|
|
## Cgroup namespace
|
|
|
|
- Virtualize access to `/proc/<PID>/cgroup`
|
|
|
|
- Lets containerized processes view their relative cgroup tree
|
|
|
|
---
|
|
|
|
# Security features
|
|
|
|
- Namespaces and cgroups are not enough to ensure strong security
|
|
|
|
- We need extra mechanisms: capabilities, seccomp, LSMs
|
|
|
|
- These mechanisms were already used before containers to harden security
|
|
|
|
- They can be used together with containers
|
|
|
|
- Good container engines will automatically leverage these features.
|
|
|
|
(so that you don't have to worry about it)
|
|
|
|
---
|
|
|
|
## Capabilities
|
|
|
|
- In traditional UNIX, many operations are possible if and only if UID=0 (root)
|
|
|
|
- Some of these operations are very powerful:
|
|
|
|
- changing file ownership, accessing all files ...
|
|
|
|
- Some of these operations deal with system configuration, but can be abused:
|
|
|
|
- setting up network interfaces, mounting filesystems ...
|
|
|
|
- Some of these operations are not very dangerous but are needed by servers:
|
|
|
|
- binding to a port below 1024.
|
|
|
|
- Capabilities are per-process flags to allow these operations individually
|
|
|
|
---
|
|
|
|
## Some capabilities
|
|
|
|
- `CAP_CHOWN`: arbitrarily change file ownership and permissions
|
|
|
|
- `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions
|
|
|
|
- `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc.
|
|
|
|
- `CAP_NET_BIND_SERVICE`: bind a port below 1024
|
|
|
|
See `man capabilities` for the full list and details
|
|
|
|
---
|
|
|
|
## Using capabilities
|
|
|
|
- Container engines will typically drop all "dangerous" capabilities
|
|
|
|
- You can then re-enable capabilities on a per-container basis, as needed
|
|
|
|
- With the Docker engine: `docker run --cap-add ...`
|
|
|
|
- From the shell:
|
|
|
|
`capsh --drop=cap_net_admin --`
|
|
|
|
`capsh --drop=all --`
|
|
|
|
---
|
|
|
|
## File capabilities
|
|
|
|
- It is also possible to give capabilities to executable files
|
|
|
|
- This is comparable to the SUID bit, but with finer grain
|
|
|
|
(e.g., `setcap cap_net_raw+ep /bin/ping`)
|
|
|
|
- There are differences between *permitted* and *inheritable* capabilities...
|
|
|
|
🤔
|
|
|
|
---
|
|
|
|
class: extra-details
|
|
|
|
## Capability sets
|
|
|
|
- Permitted set (=what a process could use, provided the file has the cap)
|
|
|
|
- Effective set (=what a process can actually use)
|
|
|
|
- Inheritable set (=capabilities preserved across exexcve calls)
|
|
|
|
- Bounding set (=system-wide limit over what can be acquired through execve / capset)
|
|
|
|
- Ambient set (=capabilities retained across execve for non-privileged users)
|
|
|
|
- Files can have *permitted*, *effective*, *inheritable* capability sets
|
|
|
|
---
|
|
|
|
## More about capabilities
|
|
|
|
- Capabilities manpage:
|
|
|
|
https://man7.org/linux/man-pages/man7/capabilities.7.html
|
|
|
|
- Subtleties about `capsh`:
|
|
|
|
https://sites.google.com/site/fullycapable/why-didnt-that-work
|
|
|
|
---
|
|
|
|
## Seccomp
|
|
|
|
- Seccomp is secure computing.
|
|
|
|
- Achieve high level of security by restricting drastically available syscalls.
|
|
|
|
- Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()`.
|
|
|
|
- The seccomp-bpf extension allows specifying custom filters with BPF rules.
|
|
|
|
- This allows filtering by syscall, and by parameter.
|
|
|
|
- BPF code can perform arbitrarily complex checks, quickly, and safely.
|
|
|
|
- Container engines take care of this so you don't have to.
|
|
|
|
---
|
|
|
|
## Linux Security Modules
|
|
|
|
- The most popular ones are SELinux and AppArmor.
|
|
|
|
- Red Hat distros generally use SELinux.
|
|
|
|
- Debian distros (in particular, Ubuntu) generally use AppArmor.
|
|
|
|
- LSMs add a layer of access control to all process operations.
|
|
|
|
- Container engines take care of this so you don't have to.
|
|
|
|
???
|
|
|
|
:EN:Containers internals
|
|
:EN:- Control groups (cgroups)
|
|
:EN:- Linux kernel namespaces
|
|
:FR:Fonctionnement interne des conteneurs
|
|
:FR:- Les "control groups" (cgroups)
|
|
:FR:- Les namespaces du noyau Linux
|