diff --git a/k8s/pod-disruption-budget.yaml b/k8s/pod-disruption-budget.yaml new file mode 100644 index 0000000..8c82697 --- /dev/null +++ b/k8s/pod-disruption-budget.yaml @@ -0,0 +1,12 @@ +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: my-pdb +spec: + #minAvailable: 2 + #minAvailable: 90% + maxUnavailable: 1 + #maxUnavailable: 10% + selector: + matchLabels: + app: my-app diff --git a/slides/k8s/disruptions.md b/slides/k8s/disruptions.md new file mode 100644 index 0000000..bc56e29 --- /dev/null +++ b/slides/k8s/disruptions.md @@ -0,0 +1,511 @@ +# Disruptions + +In a perfect world... + +- hardware never fails + +- software never has bugs + +- ...and never needs to be updated + +- ...and uses a predictable amount of resources + +- ...and these resources are infinite anyways + +- network latency and packet loss are zero + +- humans never make mistakes + +-- + +😬 + +--- + +## Disruptions + +In the real world... + +- hardware will fail randomly (without advance notice) + +- software has bugs + +- ...and we constantly add new features + +- ...and will sometimes use more resources than expected + +- ...and these resources are limited + +- network latency and packet loss are NOT zero + +- humans make mistake (shutting down the wrong machine, the wrong app...) + +--- + +## Disruptions + +- In Kubernetes, a "disruption" is something that stops the execution of a Pod + +- There are **voluntary** and **unvoluntary** disruptions + + - voluntary = directly initiated by humans (including by mistake!) + + - unvoluntary = everything else + +- In this section, we're going to see what they are and how to prevent them + + (or at least, mitigate their effects) + +--- + +## Node outage + +- Example: hardware failure (server or network), low-level error + + (includes kernel bugs, issues affecting underlying hypervisors or infrastructure...) + +- **Unvoluntary** disruption (even if it results from human error!) + +- Consequence: all workloads on that node become unresponsive + +- Mitigations: + + - scale workloads to at least 2 replicas (or more if quorum is needed) + + - add anti-affinity scheduling constraints (to avoid having all pods on the same node) + +--- + +## Node outage play-by-play + +- Node goes down (or disconnected from network) + +- Its lease (in Namespace `kube-node-lease`) doesn't get renewed + +- Controller manager detects that and mark the node as "unreachable" + + (this adds both a `NoSchedule` and `NoExecute` taints to the node) + +- Eventually, the `NoExecute` taint will evict these pods + +- This will trigger creation of replacement pods by owner controllers + + (except for pods with a stable network identity, e.g. in a Stateful Set!) + +--- + +## Node outage notes + +- By default, pods will tolerate the `unreachable:NoExecute` taint for 5 minutes + + (toleration automatically added by Admission controller `DefaultTolerationSeconds`) + +- Pods of a Stateful Set don't recover automatically: + + - as long as the Pod exists, a replacement Pod can't be created + + - the Pod will exist as long as its Node exists + + - deleting the Node (manually or automatically) will recover the Pod + +--- + +## Memory/disk pressure + +- Example: available memory on a node goes below a specific threshold + + (because a pod is using too much memory and no limit was set) + +- **Unvoluntary** disruption + +- Consequence: kubelet starts to *evict* some pods + +- Mitigations: + + - set *resource limits* on containers to prevent them from using too much resources + + - set *resource requests* on containers to make sure they don't get evicted +
+ (as long as they use less than what they requested) + + - make sure that apps don't use more resources than what they've requested + +--- + +## Memory/disk pressure play-by-play + +- Memory leak in an application container, slowly causing very high memory usage + +- Overall free memory on the node goes below the *soft* or the *hard* threshold + + (default hard threshold = 100Mi; default soft threshold = none) + +- When reaching the *soft* threshold: + + - kubelet waits until the "eviction soft grace period" expires + + - then (if resource usage is still above the threshold) it gracefully evicts pods + +- When reaching the *hard* threshold: + + - kubelet immediately and forcefully evicts pods + +--- + +## Which pods are evicted? + +- Kubelet only considers pods that are using *more* than what they requested + + (and only for the resource that is under pressure, e.g. RAM or disk usage) + +- First, it sorts pods by *priority¹* (as set with the `priorityClassName` in the pod spec) + +- Then, by how much their resource usage exceeds their request + + (again, for the resource that is under pressure) + +- It evicts pods until enough resources have been freed up + +--- + +## Soft (graceful) vs hard (forceful) eviction + +- Soft eviction = graceful shutdown of the pod + + (honor's the pod `terminationGracePeriodSeconds` timeout) + +- Hard eviction = immediate shutdown of the pod + + (kills all containers immediately) + +--- + +## Memory/disk pressure notes + +- If resource usage increases *very fast*, kubelet might not catch it fast enough + +- For memory: this will trigger the kernel out-of-memory killer + + - containers killed by OOM are automatically restarted (no eviction) + + - eviction might happen at a later point though (if memory usage stays high) + +- For disk: there is no "out-of-disk" killer, but writes will fail + + - the `write` system call fails with `errno = ENOSPC` / `No space left on device` + + - eviction typically happens shortly after (when kubelet catches up) + +--- + +## Memory/disk pressure delays + +- By default, no soft threshold is defined + +- Defining it requires setting both the threshold and the grace period + +- Grace periods can be different for the different types of resources + +- When a node is under pressure, kubelet places a `NoSchedule` taint + + (to avoid adding more pods while the pod is under pressure) + +- Once the node is no longer under pressure, kubelet clears the taint + + (after waiting an extra timeout, `evictionPressureTransitionPeriod`, 5 min by default) + +--- + +## Accidental deletion + +- Example: developer deletes the wrong Deployment, the wrong Namespace... + +- **Voluntary** disruption + + (from Kubernetes' perspective!) + +- Consequence: application is down + +- Mitigations: + + - only deploy to production systems through e.g. gitops workflows + + - enforce peer review of changes + + - only give users limited (e.g. read-only) access to production systems + + - use canary deployments (might not catch all mistakes though!) + +--- + +## Bad code deployment + +- Example: critical bug introduced, application crashes immediately or is non-functional + +- **Voluntary** disruption + + (again, from Kubernetes' perspective!) + +- Consequence: application is down + +- Mitigations: + + - readiness probes can mitigate immediate crashes +
+ (rolling update continues only when enough pods are ready) + + - delayed crashes will require a rollback +
+ (manual intervention, or automated by a canary system) + +--- + +## Node shutdown + +- Example: scaling down a cluster to save money + +- **Voluntary** disruption + +- Consequence: + + - all workloads running on that node are terminated + + - this might disrupt workloads that have too many replicas on that node + + - or workloads that should not be interrupted at all + +- Mitigations: + + - terminate workloads one at a time, coordinating with users + +-- + +🤔 + +--- + +## Node shutdown + +- Example: scaling down a cluster to save money + +- **Voluntary** disruption + +- Consequence: + + - all workloads running on that node are terminated + + - this might disrupt workloads that have too many replicas on that node + + - or workloads that should not be interrupted at all + +- Mitigations: + + - ~~terminate workloads one at a time, coordinating with users~~ + + - use Pod Disruption Budgets + +--- + +## Pod Disruption Budgets + +- A PDB is a kind of *contract* between: + + - "admins" = folks maintaining the cluster (e.g. adding/removing/updating nodes) + + - "users" = folks deploying apps and workloads on the cluster + +- A PDB expresses something like: + + *in that particular set of pods, do not "disrupt" more than X at a time* + +- Examples: + + - in that set of frontend pods, do not disrupt more than 1 at a time + + - in that set of worker pods, always have at least 10 ready +
+ (do not disrupt them if it would bring down the number of ready pods below 10) + +--- + +## PDB - user side + +- Cluster users create a PDB with a manifest like this one: + +```yaml +@@INCLUDE[k8s/pod-disruption-budget.yaml] +``` + +- The PDB must indicate either `minAvailable` or `maxUnavailable` + +--- + +## Rounding logic + +- Percentages are rounded **up** + +- When specifying `maxUnavailble` as a percentage, this can result in a higher perecentage + + (e.g. `maxUnavailable: 50%` with 3 pods can result in 2 pods being unavailable!) + +--- + +## Unmanaged pods + +- Specifying `minAvailable: X` works all the time + +- Specifying `minAvailable: X%` or `maxUnavaiable` requires *managed pods* + + (pods that belong to a controller, e.g. Replica Set, Stateful Set...) + +- This is because the PDB controller needs to know the total number of pods + + (given by the `replicas` field, not merely by counting pod objects) + +- The PDB controller will try to resolve the controller using the pod selector + +- If that fails, the PDB controller will emit warning events + + (visible with `kubectl describe pdb ...`) + +--- + +## Zero + +- `maxUnavailable: 0` means "do not disrupt my pods" + +- Same thing if `minAvailable` is greater than or equal to the number of pods + +- In that case, cluster admins are supposed to get in touch with cluster users + +- This will prevent fully automated operation + + (and some cluster admins automated systems might not honor that request) + +--- + +## PDB - admin side + +- As a cluster admin, we need to follow certain rules + +- Only shut down (or restart) a node when no pods are running on that node + + (except system pods belonging to Daemon Sets) + +- To remove pods running on a node, we should use the *eviction API* + + (which will check PDB constraints and honor them) + +- To prevent new pods from being scheduled on a node, we can use a *taint* + +- These operations are streamlined by `kubectl drain`, which will: + + - *cordon* the node (add a `NoSchedule` taint) + + - invoke the *eviction API* to remove pods while respecting their PDBs + +--- + +## Theory vs practice + +- `kubectl drain` won't evict pods using `emptyDir` volumes + + (unless the `--delete-emptydir-data` flag is passed as well) + +- Make sure that `emptyDir` volumes don't hold anything important + + (they shouldn't, but... who knows!) + +- Kubernetes lacks a standard way for users to express: + + *this `emptyDir` volume can/cannot be safely deleted* + +- If a PDB forbids an eviction, this requires manual coordination + +--- + +class: extra-details + +## Unhealthy pod eviction policy + +- By default, unhealthy pods can only be evicted if PDB allows it + + (unhealthy = running, but not ready) + +- In many cases, unhealthy pods aren't healthy anyway, and can be removed + +- This behavior is enabled by setting the appropriate field in the PDB manifest: + +```yaml +spec: + unhealthyPodEvictionPolicy: AlwaysAllow +``` + +--- + +## Node upgrade + +- Example: upgrading kubelet or the Linux kernel on a node + +- **Voluntary** disruption + +- Consequence: + + - all workloads running on that node are temporarily interrupted, and restarted + + - this might disrupt these workloads + +- Mitigations: + + - migrate workloads off the done first (as if we were shutting it down) + +--- + +## Node upgrade notes + +- Is it necessary to drain a node before doing an upgrade? + +- From [the documentation][node-upgrade-docs]: + + *Draining nodes before upgrading kubelet ensures that pods are re-admitted and containers are re-created, which may be necessary to resolve some security issues or other important bugs.* + +- It's *probably* safe to upgrade in-place for: + + - kernel upgrades + + - kubelet patch-level upgrades (1.X.Y → 1.X.Z) + +- It's *probably* better to drain the node for minor revisions kubelet upgrades (1.X → 1.Y) + +- In doubt, test extensively in staging environments! + +[node-upgrade-docs]: https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/#manual-deployments + +--- + +## Manual rescheduling + +- Example: moving workloads around to accommodate noisy neighbors or other issues + + (e.g. pod X is doing a lot of disk I/O and this is starving other pods) + +- **Voluntary** disruption + +- Consequence: + + - the moved workloads are temporarily interrupted + +- Mitigations: + + - define an appropriate number of replicas, declare PDBs + + - use the [eviction API][eviction-API] to move workloads + +[eviction-API]: https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/ + +??? + +:EN:- Voluntary and unvoluntary disruptions +:EN:- Pod Disruption Budgets +:FR:- "Disruptions" volontaires et involontaires +:FR:- Pod Disruption Budgets \ No newline at end of file diff --git a/slides/kube-selfpaced.yml b/slides/kube-selfpaced.yml index c102f69..9422f23 100644 --- a/slides/kube-selfpaced.yml +++ b/slides/kube-selfpaced.yml @@ -129,6 +129,7 @@ content: - k8s/resource-limits.md - k8s/metrics-server.md - k8s/cluster-sizing.md + - k8s/disruptions.md - k8s/cluster-autoscaler.md - k8s/horizontal-pod-autoscaler.md - k8s/hpa-v2.md