Add content about disruptions and PDB

This commit is contained in:
Jérôme Petazzoni
2023-11-30 15:36:32 -06:00
parent a9920e5cf0
commit ac203a128d
4 changed files with 526 additions and 0 deletions

View File

@@ -0,0 +1,13 @@
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
#minAvailable: 2
#minAvailable: 90%
maxUnavailable: 1
#maxUnavailable: 10%
selector:
matchLabels:
app: my-app

511
slides/k8s/disruptions.md Normal file
View File

@@ -0,0 +1,511 @@
# Disruptions
In a perfect world...
- hardware never fails
- software never has bugs
- ...and never needs to be updated
- ...and uses a predictable amount of resources
- ...and these resources are infinite anyways
- network latency and packet loss are zero
- humans never make mistakes
--
😬
---
## Disruptions
In the real world...
- hardware will fail randomly (without advance notice)
- software has bugs
- ...and we constantly add new features
- ...and will sometimes use more resources than expected
- ...and these resources are limited
- network latency and packet loss are NOT zero
- humans make mistake (shutting down the wrong machine, the wrong app...)
---
## Disruptions
- In Kubernetes, a "disruption" is something that stops the execution of a Pod
- There are **voluntary** and **unvoluntary** disruptions
- voluntary = directly initiated by humans (including by mistake!)
- unvoluntary = everything else
- In this section, we're going to see what they are and how to prevent them
(or at least, mitigate their effects)
---
## Node outage
- Example: hardware failure (server or network), low-level error
(includes kernel bugs, issues affecting underlying hypervisors or infrastructure...)
- **Unvoluntary** disruption (even if it results from human error!)
- Consequence: all workloads on that node become unresponsive
- Mitigations:
- scale workloads to at least 2 replicas (or more if quorum is needed)
- add anti-affinity scheduling constraints (to avoid having all pods on the same node)
---
## Node outage play-by-play
- Node goes down (or disconnected from network)
- Its lease (in Namespace `kube-node-lease`) doesn't get renewed
- Controller manager detects that and mark the node as "unreachable"
(this adds both a `NoSchedule` and `NoExecute` taints to the node)
- Eventually, the `NoExecute` taint will evict these pods
- This will trigger creation of replacement pods by owner controllers
(except for pods with a stable network identity, e.g. in a Stateful Set!)
---
## Node outage notes
- By default, pods will tolerate the `unreachable:NoExecute` taint for 5 minutes
(toleration automatically added by Admission controller `DefaultTolerationSeconds`)
- Pods of a Stateful Set don't recover automatically:
- as long as the Pod exists, a replacement Pod can't be created
- the Pod will exist as long as its Node exists
- deleting the Node (manually or automatically) will recover the Pod
---
## Memory/disk pressure
- Example: available memory on a node goes below a specific threshold
(because a pod is using too much memory and no limit was set)
- **Unvoluntary** disruption
- Consequence: kubelet starts to *evict* some pods
- Mitigations:
- set *resource limits* on containers to prevent them from using too much resources
- set *resource requests* on containers to make sure they don't get evicted
<br/>
(as long as they use less than what they requested)
- make sure that apps don't use more resources than what they've requested
---
## Memory/disk pressure play-by-play
- Memory leak in an application container, slowly causing very high memory usage
- Overall free memory on the node goes below the *soft* or the *hard* threshold
(default hard threshold = 100Mi; default soft threshold = none)
- When reaching the *soft* threshold:
- kubelet waits until the "eviction soft grace period" expires
- then (if resource usage is still above the threshold) it gracefully evicts pods
- When reaching the *hard* threshold:
- kubelet immediately and forcefully evicts pods
---
## Which pods are evicted?
- Kubelet only considers pods that are using *more* than what they requested
(and only for the resource that is under pressure, e.g. RAM or disk usage)
- First, it sorts pods by *priority¹* (as set with the `priorityClassName` in the pod spec)
- Then, by how much their resource usage exceeds their request
(again, for the resource that is under pressure)
- It evicts pods until enough resources have been freed up
---
## Soft (graceful) vs hard (forceful) eviction
- Soft eviction = graceful shutdown of the pod
(honor's the pod `terminationGracePeriodSeconds` timeout)
- Hard eviction = immediate shutdown of the pod
(kills all containers immediately)
---
## Memory/disk pressure notes
- If resource usage increases *very fast*, kubelet might not catch it fast enough
- For memory: this will trigger the kernel out-of-memory killer
- containers killed by OOM are automatically restarted (no eviction)
- eviction might happen at a later point though (if memory usage stays high)
- For disk: there is no "out-of-disk" killer, but writes will fail
- the `write` system call fails with `errno = ENOSPC` / `No space left on device`
- eviction typically happens shortly after (when kubelet catches up)
---
## Memory/disk pressure delays
- By default, no soft threshold is defined
- Defining it requires setting both the threshold and the grace period
- Grace periods can be different for the different types of resources
- When a node is under pressure, kubelet places a `NoSchedule` taint
(to avoid adding more pods while the pod is under pressure)
- Once the node is no longer under pressure, kubelet clears the taint
(after waiting an extra timeout, `evictionPressureTransitionPeriod`, 5 min by default)
---
## Accidental deletion
- Example: developer deletes the wrong Deployment, the wrong Namespace...
- **Voluntary** disruption
(from Kubernetes' perspective!)
- Consequence: application is down
- Mitigations:
- only deploy to production systems through e.g. gitops workflows
- enforce peer review of changes
- only give users limited (e.g. read-only) access to production systems
- use canary deployments (might not catch all mistakes though!)
---
## Bad code deployment
- Example: critical bug introduced, application crashes immediately or is non-functional
- **Voluntary** disruption
(again, from Kubernetes' perspective!)
- Consequence: application is down
- Mitigations:
- readiness probes can mitigate immediate crashes
<br/>
(rolling update continues only when enough pods are ready)
- delayed crashes will require a rollback
<br/>
(manual intervention, or automated by a canary system)
---
## Node shutdown
- Example: scaling down a cluster to save money
- **Voluntary** disruption
- Consequence:
- all workloads running on that node are terminated
- this might disrupt workloads that have too many replicas on that node
- or workloads that should not be interrupted at all
- Mitigations:
- terminate workloads one at a time, coordinating with users
--
🤔
---
## Node shutdown
- Example: scaling down a cluster to save money
- **Voluntary** disruption
- Consequence:
- all workloads running on that node are terminated
- this might disrupt workloads that have too many replicas on that node
- or workloads that should not be interrupted at all
- Mitigations:
- ~~terminate workloads one at a time, coordinating with users~~
- use Pod Disruption Budgets
---
## Pod Disruption Budgets
- A PDB is a kind of *contract* between:
- "admins" = folks maintaining the cluster (e.g. adding/removing/updating nodes)
- "users" = folks deploying apps and workloads on the cluster
- A PDB expresses something like:
*in that particular set of pods, do not "disrupt" more than X at a time*
- Examples:
- in that set of frontend pods, do not disrupt more than 1 at a time
- in that set of worker pods, always have at least 10 ready
<br/>
(do not disrupt them if it would bring down the number of ready pods below 10)
---
## PDB - user side
- Cluster users create a PDB with a manifest like this one:
```yaml
@@INCLUDE[k8s/pod-disruption-budget.yaml]
```
- The PDB must indicate either `minAvailable` or `maxUnavailable`
---
## Rounding logic
- Percentages are rounded **up**
- When specifying `maxUnavailble` as a percentage, this can result in a higher perecentage
(e.g. `maxUnavailable: 50%` with 3 pods can result in 2 pods being unavailable!)
---
## Unmanaged pods
- Specifying `minAvailable: X` works all the time
- Specifying `minAvailable: X%` or `maxUnavaiable` requires *managed pods*
(pods that belong to a controller, e.g. Replica Set, Stateful Set...)
- This is because the PDB controller needs to know the total number of pods
(given by the `replicas` field, not merely by counting pod objects)
- The PDB controller will try to resolve the controller using the pod selector
- If that fails, the PDB controller will emit warning events
(visible with `kubectl describe pdb ...`)
---
## Zero
- `maxUnavailable: 0` means "do not disrupt my pods"
- Same thing if `minAvailable` is greater than or equal to the number of pods
- In that case, cluster admins are supposed to get in touch with cluster users
- This will prevent fully automated operation
(and some cluster admins automated systems might not honor that request)
---
## PDB - admin side
- As a cluster admin, we need to follow certain rules
- Only shut down (or restart) a node when no pods are running on that node
(except system pods belonging to Daemon Sets)
- To remove pods running on a node, we should use the *eviction API*
(which will check PDB constraints and honor them)
- To prevent new pods from being scheduled on a node, we can use a *taint*
- These operations are streamlined by `kubectl drain`, which will:
- *cordon* the node (add a `NoSchedule` taint)
- invoke the *eviction API* to remove pods while respecting their PDBs
---
## Theory vs practice
- `kubectl drain` won't evict pods using `emptyDir` volumes
(unless the `--delete-emptydir-data` flag is passed as well)
- Make sure that `emptyDir` volumes don't hold anything important
(they shouldn't, but... who knows!)
- Kubernetes lacks a standard way for users to express:
*this `emptyDir` volume can/cannot be safely deleted*
- If a PDB forbids an eviction, this requires manual coordination
---
class: extra-details
## Unhealthy pod eviction policy
- By default, unhealthy pods can only be evicted if PDB allows it
(unhealthy = running, but not ready)
- In many cases, unhealthy pods aren't healthy anyway, and can be removed
- This behavior is enabled by setting the appropriate field in the PDB manifest:
```yaml
spec:
unhealthyPodEvictionPolicy: AlwaysAllow
```
---
## Node upgrade
- Example: upgrading kubelet or the Linux kernel on a node
- **Voluntary** disruption
- Consequence:
- all workloads running on that node are temporarily interrupted, and restarted
- this might disrupt these workloads
- Mitigations:
- migrate workloads off the done first (as if we were shutting it down)
---
## Node upgrade notes
- Is it necessary to drain a node before doing an upgrade?
- From [the documentation][node-upgrade-docs]:
*Draining nodes before upgrading kubelet ensures that pods are re-admitted and containers are re-created, which may be necessary to resolve some security issues or other important bugs.*
- It's *probably* safe to upgrade in-place for:
- kernel upgrades
- kubelet patch-level upgrades (1.X.Y → 1.X.Z)
- It's *probably* better to drain the node for minor revisions kubelet upgrades (1.X → 1.Y)
- In doubt, test extensively in staging environments!
[node-upgrade-docs]: https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/#manual-deployments
---
## Manual rescheduling
- Example: moving workloads around to accommodate noisy neighbors or other issues
(e.g. pod X is doing a lot of disk I/O and this is starving other pods)
- **Voluntary** disruption
- Consequence:
- the moved workloads are temporarily interrupted
- Mitigations:
- define an appropriate number of replicas, declare PDBs
- use the [eviction API][eviction-API] to move workloads
[eviction-API]: https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/
???
:EN:- Voluntary and unvoluntary disruptions
:EN:- Pod Disruption Budgets
:FR:- "Disruptions" volontaires et involontaires
:FR:- Pod Disruption Budgets

View File

@@ -67,6 +67,7 @@ content:
- - k8s/resource-limits.md
- k8s/metrics-server.md
- k8s/cluster-sizing.md
- k8s/disruptions.md
- k8s/horizontal-pod-autoscaler.md
- - k8s/prometheus.md
#- k8s/prometheus-stack.md

View File

@@ -131,6 +131,7 @@ content:
- k8s/resource-limits.md
- k8s/metrics-server.md
- k8s/cluster-sizing.md
- k8s/disruptions.md
- k8s/cluster-autoscaler.md
- k8s/horizontal-pod-autoscaler.md
- k8s/hpa-v2.md