Init systems and PID 1

In this chapter, we will consider:

What's an init system?

On UNIX, the "init system" (or "init" in short) is PID 1.
It is the first process started by the kernel when the system starts.
It has multiple responsibilities:
- start every other process on the machine,
- reap orphaned zombie processes.

class: extra-details

When a process exits (or "dies"), it becomes a "zombie".

(Zombie processes show up in ps or top with the status code Z.)
Its parent process must reap the zombie process.

(This is done by calling waitpid() to retrieve the process' exit status.)
When a process exits, if it has child processes, these processes are "orphaned."
They are then re-parented to PID 1, init.
Init therefore needs to take care of these orphaned processes when they exit.

It's often tempting to use an init system or a process manager.

(Examples: systemd, supervisord...)
Our containers are then called "system containers".

(By contrast with "application containers".)
"System containers" are similar to lightweight virtual machines.
They have multiple downsides:
- when starting multiple processes, their logs get mixed on stdout,
- if the application process dies, the container engine doesn't see it.
Overall, they make it harder to operate troubleshoot containerized apps.

Sometimes, it's convenient to run a real init system like systemd.

(Example: a CI system whose goal is precisely to test an init script or unit file.)
If we need to run multiple processes: can we use multiple containers?

(Example: this Compose file runs multiple processes together.)
When deploying with Kubernetes:
- a container belong to a pod,
- a pod can have multiple containers.

Our application runs as PID 1 in the container.
Our application may or may not be designed to reap zombie processes.
If our application uses subprocesses and doesn't reap them ...

... this can lead to PID exhaustion!

(Or, more realistically, to a confusing herd of zombie processes.)
How can we solve this?

Docker can automatically provide a minimal init process.
This is enabled with docker run --init ...
It uses a small init system (tini) as PID 1:
- it reaps zombies,
- it forwards signals,
- it exits when the child exits.
It is totally transparent to our application.
We should use it if our application creates subprocess but doesn't reap them.

class: extra-details

Kubernetes does not expose that --init option.
However, we can achieve the same result with Process Namespace Sharing.
When Process Namespace Sharing is enabled, PID 1 will be pause.
That pause process takes care of reaping zombies.
Process Namespace Sharing is available since Kubernetes 1.16.
If you're using an older version of Kubernetes ...

... you might have to add tini explicitly to your Docker image.