# Deep dive into container internals

In this chapter, we will explain some of the fundamental building blocks of containers.

This will give you a solid foundation so you can:

- understand "what's going on" in complex situations,

- anticipate the behavior of containers (performance, security...) in new scenarios,

- implement your own container engine.

The last item should be done for educational purposes only!

---

## There is no container code in the Linux kernel

- If we search "container" in the Linux kernel code, we find:

  - generic code to manipulate data structures (like linked lists, etc.),

  - unrelated concepts like "ACPI containers",

  - *nothing* relevant to "our" containers!

- Containers are composed using multiple independent features.

- On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."

- Security also requires features like capabilities, seccomp, LSMs...

---

# Namespaces

- Provide processes with their own view of the system.

- Namespaces limit what you can see (and therefore, what you can use).

- These namespaces are available in modern kernels:

  - pid
  - net
  - mnt
  - uts
  - ipc
  - user

  (We are going to detail them individually.)

- Each process belongs to one namespace of each type.

---

## Namespaces are always active

- Namespaces exist even when you don't use containers.

- This is a bit similar to the UID field in UNIX processes:

  - all processes have the UID field, even if no user exists on the system

  - the field always has a value / the value is always defined
    <br/>
    (i.e. any process running on the system has some UID)

  - the value of the UID field is used when checking permissions
    <br/>
    (the UID field determines which resources the process can access)

- You can replace "UID field" with "namespace" above and it still works!

- In other words: even when you don't use containers,
  <br/>there is one namespace of each type, containing all the processes on the system.

---

class: extra-details, deep-dive

## Manipulating namespaces

- Namespaces are created with two methods:

  - the `clone()` system call (used when creating new threads and processes),

  - the `unshare()` system call.

- The Linux tool `unshare` allows to do that from a shell.

- A new process can re-use none / all / some of the namespaces of its parent.

- It is possible to "enter" a namespace with the `setns()` system call.

- The Linux tool `nsenter` allows to do that from a shell.

---

class: extra-details, deep-dive

## Namespaces lifecycle

- When the last process of a namespace exits, the namespace is destroyed.

- All the associated resources are then removed.

- Namespaces are materialized by pseudo-files in `/proc/<pid>/ns`.

  ```bash
  ls -l /proc/self/ns
  ```

- It is possible to compare namespaces by checking these files.

  (This helps to answer the question, "are these two processes in the same namespace?")

- It is possible to preserve a namespace by bind-mounting its pseudo-file.

---

class: extra-details, deep-dive

## Namespaces can be used independently

- As mentioned in the previous slides:

  *A new process can re-use none / all / some of the namespaces of its parent.*

- We are going to use that property in the examples in the next slides.

- We are going to present each type of namespace.

- For each type, we will provide an example using only that namespace.

---

## UTS namespace

- gethostname / sethostname

- Allows to set a custom hostname for a container.

- That's (mostly) it!

- Also allows to set the NIS domain.

  (If you don't know what a NIS domain is, you don't have to worry about it!)

- If you're wondering: UTS = UNIX time sharing.

- This namespace was named like this because of the `struct utsname`,
  <br/>
  which is commonly used to obtain the machine's hostname, architecture, etc.

  (The more you know!)

---

class: extra-details, deep-dive

## Creating our first namespace

Let's use `unshare` to create a new process that will have its own UTS namespace:

```bash
$ sudo unshare --uts
```

- We have to use `sudo` for most `unshare` operations.

- We indicate that we want a new uts namespace, and nothing else.

- If we don't specify a program to run, a `$SHELL` is started.

---

class: extra-details, deep-dive

## Demonstrating our uts namespace

In our new "container", check the hostname, change it, and check it:

```bash
 # hostname
 nodeX
 # hostname tupperware
 # hostname
 tupperware
```

In another shell, check that the machine's hostname hasn't changed:

```bash
$ hostname
nodeX
```

Exit the "container" with `exit` or `Ctrl-D`.

---

## Net namespace overview

- Each network namespace has its own private network stack.

- The network stack includes:

  - network interfaces (including `lo`),

  - routing table**s** (as in `ip rule` etc.),

  - iptables chains and rules,

  - sockets (as seen by `ss`, `netstat`).

- You can move a network interface from a network namespace to another:
  ```bash
  ip link set dev eth0 netns PID
  ```

---

## Net namespace typical use

- Each container is given its own network namespace.

- For each network namespace (i.e. each container), a `veth` pair is created.

  (Two `veth` interfaces act as if they were connected with a cross-over cable.)

- One `veth` is moved to the container network namespace (and renamed `eth0`).

- The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge).

---

class: extra-details

## Creating a network namespace

Start a new process with its own network namespace:

```bash
$ sudo unshare --net
```

See that this new network namespace is unconfigured:

```bash
 # ping 1.1
 connect: Network is unreachable
 # ifconfig
 # ip link ls
 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
```

---

class: extra-details

## Creating the `veth` interfaces

In another shell (on the host), create a `veth` pair:

```bash
$ sudo ip link add name in_host type veth peer name in_netns
```

Configure the host side (`in_host`):

```bash
$ sudo ip link set in_host master docker0 up
```

---

class: extra-details

## Moving the `veth` interface

*In the process created by `unshare`,* check the PID of our "network container":

```bash
 # echo $$
 533
```

*On the host*, move the other side (`in_netns`) to the network namespace:

```bash
$ sudo ip link set in_netns netns 533
```

(Make sure to update "533" with the actual PID obtained above!)

---

class: extra-details

## Basic network configuration

Let's set up `lo` (the loopback interface):

```bash
 # ip link set lo up
```

Activate the `veth` interface and rename it to `eth0`:

```bash
 # ip link set in_netns name eth0 up
```

---

class: extra-details

## Allocating IP address and default route

*On the host*, check the address of the Docker bridge:

```bash
$ ip addr ls dev docker0
```

(It could be something like `172.17.0.1`.)

Pick an IP address in the middle of the same subnet, e.g. `172.17.0.99`.

*In the process created by `unshare`,* configure the interface:

```bash
 # ip addr add 172.17.0.99/24 dev eth0
 # ip route add default via 172.17.0.1
```

(Make sure to update the IP addresses if necessary.)

---

class: extra-details

## Validating the setup

Check that we now have connectivity:

```bash
 # ping 1.1
```

Note: we were able to take a shortcut, because Docker is running,
and provides us with a `docker0` bridge and a valid `iptables` setup.

If Docker is not running, you will need to take care of this!

---

class: extra-details

## Cleaning up network namespaces

- Terminate the process created by `unshare` (with `exit` or `Ctrl-D`).

- Since this was the only process in the network namespace, it is destroyed.

- All the interfaces in the network namespace are destroyed.

- When a `veth` interface is destroyed, it also destroys the other half of the pair.

- So we don't have anything else to do to clean up!

---

## Other ways to use network namespaces

- `--net none` gives an empty network namespace to a container.

  (Effectively isolating it completely from the network.)

- `--net host` means "do not containerize the network".

  (No network namespace is created; the container uses the host network stack.)

- `--net container` means "reuse the network namespace of another container".

  (As a result, both containers share the same interfaces, routes, etc.)

---

## Mnt namespace

- Processes can have their own root fs (à la chroot).

- Processes can also have "private" mounts. This allows to:

  - isolate `/tmp` (per user, per service...)

  - mask `/proc`, `/sys` (for processes that don't need them)

  - mount remote filesystems or sensitive data,
    <br/>but make it visible only for allowed processes

- Mounts can be totally private, or shared.

- At this point, there is no easy way to pass along a mount
  from a namespace to another.

---

class: extra-details, deep-dive

## Setting up a private `/tmp`

Create a new mount namespace:

```bash
$ sudo unshare --mount
```

In that new namespace, mount a brand new `/tmp`:

```bash
 # mount -t tmpfs none /tmp
```

Check the content of `/tmp` in the new namespace, and compare to the host.

The mount is automatically cleaned up when you exit the process.

---

## PID namespace

- Processes within a PID namespace only "see" processes
  in the same PID namespace.

- Each PID namespace has its own numbering (starting at 1).

- When PID 1 goes away, the whole namespace is killed.

  (When PID 1 goes away on a normal UNIX system, the kernel panics!)

- Those namespaces can be nested.

- A process ends up having multiple PIDs (one per namespace in which it is nested).

---

class: extra-details, deep-dive

## PID namespace in action

Create a new PID namespace:

```bash
$ sudo unshare --pid --fork
```

(We need the `--fork` flag because the PID namespace is special.)

Check the process tree in the new namespace:

```bash
 # ps faux
```

--

class: extra-details, deep-dive

🤔 Why do we see all the processes?!?

---

class: extra-details, deep-dive

## PID namespaces and `/proc`

- Tools like `ps` rely on the `/proc` pseudo-filesystem.

- Our new namespace still has access to the original `/proc`.

- Therefore, it still sees host processes.

- But it cannot affect them.

  (Try to `kill` a process: you will get `No such process`.)

---

class: extra-details, deep-dive

## PID namespaces, take 2

- This can be solved by mounting `/proc` in the namespace.

- The `unshare` utility provides a convenience flag, `--mount-proc`.

- This flag will mount `/proc` in the namespace.

- It will also unshare the mount namespace, so that this mount is local.

Try it:

```bash
 $ sudo unshare --pid --fork --mount-proc
 # ps faux
```

---

class: extra-details

## OK, really, why do we need `--fork`?

*It is not necessary to remember all these details.
<br/>
This is just an illustration of the complexity of namespaces!*

The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary.
<br/>
A process calling `unshare` to create new namespaces is moved to the new namespaces...
<br/>
... Except for the PID namespace.
<br/>
(Because this would change the current PID of the process from X to 1.)

The processes created by the new binary are placed into the new PID namespace.
<br/>
The first one will be PID 1.
<br/>
If PID 1 exits, it is not possible to create additional processes in the namespace.
<br/>
(Attempting to do so will result in `ENOMEM`.)

Without the `--fork` flag, the first command that we execute will be PID 1 ...
<br/>
... And once it exits, we cannot create more processes in the namespace!

Check `man 2 unshare` and `man pid_namespaces` if you want more details.

---

## IPC namespace

--

- Does anybody know about IPC?

--

- Does anybody *care* about IPC?

--

- Allows a process (or group of processes) to have own:

  - IPC semaphores
  - IPC message queues
  - IPC shared memory

  ... without risk of conflict with other instances.

- Older versions of PostgreSQL cared about this.

*No demo for that one.*

---

## User namespace

- Allows to map UID/GID; e.g.:

  - UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
  - UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
  - etc.

- UID 0 in the container can still perform privileged operations in the container.

  (For instance: setting up network interfaces.)

- But outside of the container, it is a non-privileged user.

- It also means that the UID in containers becomes unimportant.

  (Just use UID 0 in the container, since it gets squashed to a non-privileged user outside.)

- Ultimately enables better privilege separation in container engines.

---

class: extra-details, deep-dive

## User namespace challenges

- UID needs to be mapped when passed between processes or kernel subsystems.

- Filesystem permissions and file ownership are more complicated.

  .small[(E.g. when the same root filesystem is shared by multiple containers
  running with different UIDs.)]

- With the Docker Engine:

  - some feature combinations are not allowed
    <br/>
    (e.g. user namespace + host network namespace sharing)

  - user namespaces need to be enabled/disabled globally
    <br/>
    (when the daemon is started)

  - container images are stored separately
    <br/>
    (so the first time you toggle user namespaces, you need to re-pull images)

*No demo for that one.*

---

# Control groups

- Control groups provide resource *metering* and *limiting*.

- This covers a number of "usual suspects" like:

  - memory

  - CPU

  - block I/O

  - network (with cooperation from iptables/tc)

- And a few exotic ones:

  - huge pages (a special way to allocate memory)

  - RDMA (resources specific to InfiniBand / remote memory transfer)

---

## Crowd control

- Control groups also allow to group processes for special operations:

  - freezer (conceptually similar to a "mass-SIGSTOP/SIGCONT")

  - perf_event (gather performance events on multiple processes)

  - cpuset (limit or pin processes to specific CPUs)

- There is a "pids" cgroup to limit the number of processes in a given group.

- There is also a "devices" cgroup to control access to device nodes.

  (i.e. everything in `/dev`.)

---

## Generalities

- Cgroups form a hierarchy (a tree).

- We can create nodes in that hierarchy.

- We can associate limits to a node.

- We can move a process (or multiple processes) to a node.

- The process (or processes) will then respect these limits.

- We can check the current usage of each node.

- In other words: limits are optional (if we only want accounting).

- When a process is created, it is placed in its parent's groups.

---

## Example

The numbers are PIDs.

The names are the names of our nodes (arbitrarily chosen).

.small[
```bash
cpu                      memory
├── batch                ├── stateless
│   ├── cryptoscam       │   ├── 25
│   │   └── 52           │   ├── 26
│   └── ffmpeg           │   ├── 27
│       ├── 109          │   ├── 52
│       └── 88           │   ├── 109
└── realtime             │   └── 88
    ├── nginx            └── databases
    │   ├── 25               ├── 1008
    │   ├── 26               └── 524
    │   └── 27
    ├── postgres
    │   └── 524
    └── redis
        └── 1008
```
]

---

class: extra-details, deep-dive

## Cgroups v1 vs v2

- Cgroups v1 are available on all systems (and widely used).

- Cgroups v2 are a huge refactor.

  (Development started in Linux 3.10, released in 4.5.)

- Cgroups v2 have a number of differences:

  - single hierarchy (instead of one tree per controller),

  - processes can only be on leaf nodes (not inner nodes),

  - and of course many improvements / refactorings.

---

## Memory cgroup: accounting

- Keeps track of pages used by each group:

  - file (read/write/mmap from block devices),
  - anonymous (stack, heap, anonymous mmap),
  - active (recently accessed),
  - inactive (candidate for eviction).

- Each page is "charged" to a group.

- Pages can be shared across multiple groups.

  (Example: multiple processes reading from the same files.)

- To view all the counters kept by this cgroup:

  ```bash
  $ cat /sys/fs/cgroup/memory/memory.stat
  ```

---

## Memory cgroup: limits

- Each group can have (optional) hard and soft limits.

- Limits can be set for different kinds of memory:

  - physical memory,

  - kernel memory,

  - total memory (including swap).

---

## Soft limits and hard limits

- Soft limits are not enforced.

  (But they influence reclaim under memory pressure.)

- Hard limits *cannot* be exceeded:

  - if a group of processes exceeds a hard limit,

  - and if the kernel cannot reclaim any memory,

  - then the OOM (out-of-memory) killer is triggered,

  - and processes are killed until memory gets below the limit again.

---

class: extra-details, deep-dive

## Avoiding the OOM killer

- For some workloads (databases and stateful systems), killing
  processes because we run out of memory is not acceptable.

- The "oom-notifier" mechanism helps with that.

- When "oom-notifier" is enabled and a hard limit is exceeded:

  - all processes in the cgroup are frozen,

  - a notification is sent to user space (instead of killing processes),

  - user space can then raise limits, migrate containers, etc.,

  - once the memory usage is below the hard limit, unfreeze the cgroup.

---

class: extra-details, deep-dive

## Overhead of the memory cgroup

- Each time a process grabs or releases a page, the kernel update counters.

- This adds some overhead.

- Unfortunately, this cannot be enabled/disabled per process.

- It has to be done system-wide, at boot time.

- Also, when multiple groups use the same page:

  - only the first group gets "charged",

  - but if it stops using it, the "charge" is moved to another group.

---

class: extra-details, deep-dive

## Setting up a limit with the memory cgroup

Create a new memory cgroup:

```bash
$ CG=/sys/fs/cgroup/memory/onehundredmegs
$ sudo mkdir $CG
```

Limit it to approximately 100MB of memory usage:

```bash
$ sudo tee $CG/memory.memsw.limit_in_bytes <<< 100000000
```

Move the current process to that cgroup:

```bash
$ sudo tee $CG/tasks <<< $$
```

The current process *and all its future children* are now limited.

(Confused about `<<<`? Look at the next slide!)

---

class: extra-details, deep-dive

## What's `<<<`?

- This is a "here string". (It is a non-POSIX shell extension.)

- The following commands are equivalent:

  ```bash
  foo <<< hello
  ```

  ```bash
  echo hello | foo
  ```

  ```bash
  foo <<EOF
  hello
  EOF
  ```

- Why did we use that?

---

class: extra-details, deep-dive

## Writing to cgroups pseudo-files requires root

Instead of:

```bash
sudo tee $CG/tasks <<< $$
```

We could have done:

```bash
sudo sh -c "echo $$ > $CG/tasks"
```

The following commands, however, would be invalid:

```bash
sudo echo $$ > $CG/tasks
```

```bash
sudo -i # (or su)
echo $$ > $CG/tasks
```

---

class: extra-details, deep-dive

## Testing the memory limit

Start the Python interpreter:

```bash
$ python
Python 3.6.4 (default, Jan  5 2018, 02:35:40)
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
```

Allocate 80 megabytes:

```python
>>> s = "!" * 1000000 * 80
```

Add 20 megabytes more:

```python
>>> t = "!" * 1000000 * 20
Killed
```

---

## CPU cgroup

- Keeps track of CPU time used by a group of processes.

  (This is easier and more accurate than `getrusage` and `/proc`.)

- Keeps track of usage per CPU as well.

  (i.e., "this group of process used X seconds of CPU0 and Y seconds of CPU1".)

- Allows to set relative weights used by the scheduler.

---

## Cpuset cgroup

- Pin groups to specific CPU(s).

- Use-case: reserve CPUs for specific apps.

- Warning: make sure that "default" processes aren't using all CPUs!

- CPU pinning can also avoid performance loss due to cache flushes.

- This is also relevant for NUMA systems.

- Provides extra dials and knobs.

  (Per zone memory pressure, process migration costs...)

---

## Blkio cgroup

- Keeps track of I/Os for each group:

  - per block device
  - read vs write
  - sync vs async

- Set throttle (limits) for each group:

  - per block device
  - read vs write
  - ops vs bytes

- Set relative weights for each group.

- Note: most writes go through the page cache.
  <br/>(So classic writes will appear to be unthrottled at first.)

---

## Net_cls and net_prio cgroup

- Only works for egress (outgoing) traffic.

- Automatically set traffic class or priority
  for traffic generated by processes in the group.

- Net_cls will assign traffic to a class.

- Classes have to be matched with tc or iptables, otherwise traffic just flows normally.

- Net_prio will assign traffic to a priority.

- Priorities are used by queuing disciplines.

---

## Devices cgroup

- Controls what the group can do on device nodes

- Permissions include read/write/mknod

- Typical use:

  - allow `/dev/{tty,zero,random,null}` ...
  - deny everything else

- A few interesting nodes:

  - `/dev/net/tun` (network interface manipulation)
  - `/dev/fuse` (filesystems in user space)
  - `/dev/kvm` (VMs in containers, yay inception!)
  - `/dev/dri` (GPU)

---

# Security features

- Namespaces and cgroups are not enough to ensure strong security.

- We need extra mechanisms: capabilities, seccomp, LSMs.

- These mechanisms were already used before containers to harden security.

- They can be used together with containers.

- Good container engines will automatically leverage these features.

  (So that you don't have to worry about it.)

---

## Capabilities

- In traditional UNIX, many operations are possible if and only if UID=0 (root).

- Some of these operations are very powerful:

  - changing file ownership, accessing all files ...

- Some of these operations deal with system configuration, but can be abused:

  - setting up network interfaces, mounting filesystems ...

- Some of these operations are not very dangerous but are needed by servers:

  - binding to a port below 1024.

- Capabilities are per-process flags to allow these operations individually.

---

## Some capabilities

- `CAP_CHOWN`: arbitrarily change file ownership and permissions.

- `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions.

- `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc.

- `CAP_NET_BIND_SERVICE`: bind a port below 1024.

See `man capabilities` for the full list and details.

---

## Using capabilities

- Container engines will typically drop all "dangerous" capabilities.

- You can then re-enable capabilities on a per-container basis, as needed.

- With the Docker engine: `docker run --cap-add ...`

- If you write your own code to manage capabilities:

  - make sure that you understand what each capability does,

  - read about *ambient* capabilities as well.

---

## Seccomp

- Seccomp is secure computing.

- Achieve high level of security by restricting drastically available syscalls.

- Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()`.

- The seccomp-bpf extension allows to specify custom filters with BPF rules.

- This allows to filter by syscall, and by parameter.

- BPF code can perform arbitrarily complex checks, quickly, and safely.

- Container engines take care of this so you don't have to.

---

## Linux Security Modules

- The most popular ones are SELinux and AppArmor.

- Red Hat distros generally use SELinux.

- Debian distros (in particular, Ubuntu) generally use AppArmor.

- LSMs add a layer of access control to all process operations.

- Container engines take care of this so you don't have to.