github/container.training

Fork 0

mirror of https://github.com/jpetazzo/container.training.git synced 2026-05-19 23:36:33 +00:00

Files

Jérôme Petazzoni d1047f950d 📃 Update resource limits to add ephemeral-storage

2023-11-29 14:23:24 -06:00

22 KiB

Raw Blame History

Resource Limits

We can attach resource indications to our pods

(or rather: to the containers in our pods)
We can specify limits and/or requests
We can specify quantities of CPU and/or memory and/or ephemeral storage

Requests vs limits

Requests are guaranteed reservations of resources
They are used for scheduling purposes
Kubelet will use cgroups to e.g. guarantee a minimum amount of CPU time
A container can use more than its requested resources
A container using less than what it requested should never be killed or throttled
A node cannot be overcommitted with requests

(the sum of all requests cannot be higher than resources available on the node)
A small amount of resources is set aside for system components

(this explains why there is a difference between "capacity" and "allocatable")

Requests vs limits

Limits are "hard limits" (a container cannot exceed its limits)
They aren't taken into account by the scheduler
A container exceeding its memory limit is killed instantly

(by the kernel out-of-memory killer)
A container exceeding its CPU limit is throttled
A container exceeding its disk limit is killed

(usually with a small delay, since this is checked periodically by kubelet)
On a given node, the sum of all limits can be higher than the node size

Compressible vs incompressible resources

CPU is a compressible resource
- it can be preempted immediately without adverse effect
- if we have N CPU and need 2N, we run at 50% speed
Memory is an incompressible resource
- it needs to be swapped out to be reclaimed; and this is costly
- if we have N GB RAM and need 2N, we might run at... 0.1% speed!
Disk is also an incompressible resource
- when the disk is full, writes will fail
- applications may or may not crash but persistent apps will be in trouble

Running low on CPU

Two ways for a container to "run low" on CPU:
- it's hitting its CPU limit
- all CPUs on the node are at 100% utilization
The app in the container will run slower

(compared to running without a limit, or if CPU cycles were available)
No other consequence

(but this could affect SLA/SLO for latency-sensitive applications!)

CPU limits implementation details

A container with a CPU limit will be "rationed" by the kernel
Every cfs_period_us, it will receive a CPU quota, like an "allowance"

(that interval defaults to 100ms)
Once it has used its quota, it will be stalled until the next period
This can easily result in throttling for bursty workloads

(see details on next slide)

A bursty example

Web service receives one request per minute
Each request takes 1 second of CPU
Average load: 1.66%
Let's say we set a CPU limit of 10%
This means CPU quotas of 10ms every 100ms
Obtaining the quota for 1 second of CPU will take 10 seconds
Observed latency will be 10 seconds (... actually 9.9s) instead of 1 second

(real-life scenarios will of course be less extreme, but they do happen!)

Multi-core scheduling details

Each core gets a small share of the container's CPU quota

(this avoids locking and contention on the "global" quota for the container)
By default, the kernel distributes that quota to CPUs in 5ms increments

(tunable with kernel.sched_cfs_bandwidth_slice_us)
If a containerized process (or thread) uses up its local CPU quota:

it gets more from the "global" container quota (if there's some left)
If it "yields" (e.g. sleeps for I/O) before using its local CPU quota:

the quota is soon returned to the "global" container quota, minus 1ms

Low quotas on machines with many cores

The local CPU quota is not immediately returned to the global quota
- this reduces locking and contention on the global quota
- but this can cause starvation when many threads/processes become runnable
That 1ms that "stays" on the local CPU quota is often useful
- if the thread/process becomes runnable, it can be scheduled immediately
- again, this reduces locking and contention on the global quota
- but if the thread/process doesn't become runnable, it is wasted!
- this can become a huge problem on machines with many cores

CPU limits in a nutshell

Beware if you run small bursty workloads on machines with many cores!

("highly-threaded, user-interactive, non-cpu bound applications")
Check the nr_throttled and throttled_time metrics in cpu.stat
Possible solutions/workarounds:
- be generous with the limits
- make sure your kernel has the appropriate patch
- use static CPU manager policy

For more details, check this blog post or these ones (part 1, part 2).

Running low on memory

When the kernel runs low on memory, it starts to reclaim used memory
Option 1: free up some buffers and caches

(fastest option; might affect performance if cache memory runs very low)
Option 2: swap, i.e. write to disk some memory of one process to give it to another

(can have a huge negative impact on performance because disks are slow)
Option 3: terminate a process and reclaim all its memory

(OOM or Out Of Memory Killer on Linux)

Memory limits on Kubernetes

Kubernetes does not support swap

(but it may support it in the future, thanks to KEP 2400)
If a container exceeds its memory limit, it gets killed immediately
If a node memory usage gets too high, it will evict some pods

(we say that the node is "under pressure", more on that in a bit!)

Running low on disk

When the kubelet runs low on disk, it starts to reclaim disk space

(similarly to what the kernel does, but in different categories)
Option 1: garbage collect dead pods and containers

(no consequence, but their logs will be deleted)
Option 2: remove unused images

(no consequence, but these images will have to be repulled if we need them later)
Option 3: evict pods and remove them to reclaim their disk usage
Note: this only applies to ephemeral storage, not to e.g. Persistent Volumes!

Ephemeral storage?

This includes:
- the read-write layer of the container
  (any file creation/modification outside of its volumes)
- emptyDir volumes mounted in the container
- the container logs stored on the node
This does not include:
- the container image
- other types of volumes (e.g. Persistent Volumes, hostPath, or local volumes)

Disk limit enforcement

Disk usage is periodically measured by kubelet

(with something equivalent to du)
There can be a small delay before pod termination when disk limit is exceeded
It's also possible to enable filesystem project quotas

(e.g. with EXT4 or XFS)
Remember that container logs are also accounted for!

(container log rotation/retention is managed by kubelet)

`nodefs` and `imagefs`

nodefs is the main filesystem of the node

(holding, notably, emptyDir volumes and container logs)
Optionally, the container engine can be configured to use an imagefs
imagefs will store container images and container writable layers
When there is a separate imagefs, its disk usage is tracked independently
If imagefs usage gets too high, kubelet will remove old images first

(conversely, if nodefs usage gets too high, kubelet won't remove old images)

CPU and RAM reservation

Kubernetes passes resources requests and limits to the container engine
The container engine applies these requests and limits with specific mechanisms
Example: on Linux, this is typically done with control groups aka cgroups
Most systems use cgroups v1, but cgroups v2 are slowly being rolled out

(e.g. available in Ubuntu 22.04 LTS)
Cgroups v2 have new, interesting features for memory control:
- ability to set "minimum" memory amounts (to effectively reserve memory)
- better control on the amount of swap used by a container

What's the deal with swap?

With cgroups v1, it's not possible to disable swap for a cgroup

(the closest option is to reduce "swappiness")
It is possible with cgroups v2 (see the kernel docs and the fbatx docs)
Cgroups v2 aren't widely deployed yet
The architects of Kubernetes wanted to ensure that Guaranteed pods never swap
The simplest solution was to disable swap entirely
Kubelet will refuse to start if it detects that swap is enabled!

Alternative point of view

Swap enables paging¹ of anonymous² memory
Even when swap is disabled, Linux will still page memory for:
- executables, libraries
- mapped files
Disabling swap will reduce performance and available resources
For a good time, read kubernetes/kubernetes#53533
Also read this excellent blog post about swap

¹Paging: reading/writing memory pages from/to disk to reclaim physical memory

²Anonymous memory: memory that is not backed by files or blocks

Enabling swap anyway

If you don't care that pods are swapping, you can enable swap
You will need to add the flag --fail-swap-on=false to kubelet

(remember: it won't otherwise start if it detects that swap is enabled)

Pod quality of service

Each pod is assigned a QoS class (visible in status.qosClass).

If limits = requests:
- as long as the container uses less than the limit, it won't be affected
- if all containers in a pod have (limits=requests), QoS is considered "Guaranteed"
If requests < limits:
- as long as the container uses less than the request, it won't be affected
- otherwise, it might be killed/evicted if the node gets overloaded
- if at least one container has (requests<limits), QoS is considered "Burstable"
If a pod doesn't have any request nor limit, QoS is considered "BestEffort"

Quality of service impact

When a node is overloaded, BestEffort pods are killed first
Then, Burstable pods that exceed their requests
Burstable and Guaranteed pods below their requests are never killed

(except if their node fails)
If we only use Guaranteed pods, no pod should ever be killed

(as long as they stay within their limits)

(Pod QoS is also explained in this page of the Kubernetes documentation and in this blog post.)

Specifying resources

Resource requests are expressed at the container level
CPU is expressed in "virtual CPUs"

(corresponding to the virtual CPUs offered by some cloud providers)
CPU can be expressed with a decimal value, or even a "milli" suffix

(so 100m = 0.1)
Memory and ephemeral disk storage are expressed in bytes
These can have k, M, G, T, ki, Mi, Gi, Ti suffixes

(corresponding to 10^3, 10^6, 10^9, 10^12, 2^10, 2^20, 2^30, 2^40)

Specifying resources in practice

This is what the spec of a Pod with resources will look like:

containers:
- name: blue
  image: jpetazzo/color
  resources:
    limits:
      cpu: "100m"
      ephemeral-storage: 10M
      memory: "100Mi"
    requests:
      cpu: "10m"
      ephemeral-storage: 10M
      memory: "100Mi"

This set of resources makes sure that this service won't be killed (as long as it stays below 100 MB of RAM), but allows its CPU usage to be throttled if necessary.

Default values

If we specify a limit without a request:

the request is set to the limit
If we specify a request without a limit:

there will be no limit

(which means that the limit will be the size of the node)
If we don't specify anything:

the request is zero and the limit is the size of the node

Unless there are default values defined for our namespace!

We need to specify resource values

If we do not set resource values at all:
- the limit is "the size of the node"
- the request is zero
This is generally not what we want
- a container without a limit can use up all the resources of a node
- if the request is zero, the scheduler can't make a smart placement decision
This is fine when learning/testing, absolutely not in production!

How should we set resources?

Option 1: manually, for each container
- simple, effective, but tedious
Option 2: automatically, with the Vertical Pod Autoscaler (VPA)
- relatively simple, very minimal involvement beyond initial setup
- not compatible with HPAv1, can disrupt long-running workloads (see limitations)
Option 3: semi-automatically, with tools like Robusta KRR
- good compromise between manual work and automation
Option 4: by creating LimitRanges in our Namespaces
- relatively simple, but "one-size-fits-all" approach might not always work

Defining min, max, and default resources

We can create LimitRange objects to indicate any combination of:
- min and/or max resources allowed per pod
- default resource limits
- default resource requests
- maximal burst ratio (limit/request)
LimitRange objects are namespaced
They apply to their namespace only

LimitRange example

apiVersion: v1
kind: LimitRange
metadata:
  name: my-very-detailed-limitrange
spec:
  limits:
  - type: Container
    min:
      cpu: "100m"
    max:
      cpu: "2000m"
      memory: "1Gi"
    default:
      cpu: "500m"
      memory: "250Mi"
    defaultRequest:
      cpu: "500m"

Example explanation

The YAML on the previous slide shows an example LimitRange object specifying very detailed limits on CPU usage, and providing defaults on RAM usage.

Note the type: Container line: in the future, it might also be possible to specify limits per Pod, but it's not officially documented yet.

LimitRange details

LimitRange restrictions are enforced only when a Pod is created

(they don't apply retroactively)
They don't prevent creation of e.g. an invalid Deployment or DaemonSet

(but the pods will not be created as long as the LimitRange is in effect)
If there are multiple LimitRange restrictions, they all apply together

(which means that it's possible to specify conflicting LimitRanges,
preventing any Pod from being created)
If a LimitRange specifies a max for a resource but no default,
that max value becomes the default limit too

Namespace quotas

We can also set quotas per namespace
Quotas apply to the total usage in a namespace

(e.g. total CPU limits of all pods in a given namespace)
Quotas can apply to resource limits and/or requests

(like the CPU and memory limits that we saw earlier)
Quotas can also apply to other resources:
- "extended" resources (like GPUs)
- storage size
- number of objects (number of pods, services...)

Creating a quota for a namespace

Quotas are enforced by creating a ResourceQuota object
ResourceQuota objects are namespaced, and apply to their namespace only
We can have multiple ResourceQuota objects in the same namespace
The most restrictive values are used

Limiting total CPU/memory usage

The following YAML specifies an upper bound for limits and requests:

  apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: a-little-bit-of-compute
  spec:
    hard:
      requests.cpu: "10"
      requests.memory: 10Gi
      limits.cpu: "20"
      limits.memory: 20Gi

These quotas will apply to the namespace where the ResourceQuota is created.

Limiting number of objects

The following YAML specifies how many objects of specific types can be created:

  apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: quota-for-objects
  spec:
    hard:
      pods: 100
      services: 10
      secrets: 10
      configmaps: 10
      persistentvolumeclaims: 20
      services.nodeports: 0
      services.loadbalancers: 0
      count/roles.rbac.authorization.k8s.io: 10

(The count/ syntax allows limiting arbitrary objects, including CRDs.)

YAML vs CLI

Quotas can be created with a YAML definition
...Or with the kubectl create quota command

Example:

kubectl create quota my-resource-quota --hard=pods=300,limits.memory=300Gi

With both YAML and CLI form, the values are always under the hard section

(there is no soft quota)

Viewing current usage

When a ResourceQuota is created, we can see how much of it is used:

kubectl describe resourcequota my-resource-quota

Name:                            my-resource-quota
Namespace:                       default
Resource                         Used  Hard
--------                         ----  ----
pods                             12    100
services                         1     5
services.loadbalancers           0     0
services.nodeports               0     0

Advanced quotas and PriorityClass

Pods can have a priority
The priority is a number from 0 to 1000000000

(or even higher for system-defined priorities)
High number = high priority = "more important" Pod
Pods with a higher priority can preempt Pods with lower priority

(= low priority pods will be evicted if needed)
Useful when mixing workloads in resource-constrained environments

Setting the priority of a Pod

Create a PriorityClass

(or use an existing one)
When creating the Pod, set the field spec.priorityClassName
If the field is not set:
- if there is a PriorityClass with globalDefault, it is used
- otherwise, the default priority will be zero

PriorityClass and ResourceQuotas

A ResourceQuota can include a list of scopes or a scope selector
In that case, the quota will only apply to the scoped resources
Example: limit the resources allocated to "high priority" Pods
In that case, make sure that the quota is created in every Namespace

(or use admission configuration to enforce it)
See the resource quotas documentation for details

Limiting resources in practice

We have at least three mechanisms:
- requests and limits per Pod
- LimitRange per namespace
- ResourceQuota per namespace
Let's see one possible strategy to get started with resource limits

Set a LimitRange

In each namespace, create a LimitRange object
Set a small default CPU request and CPU limit

(e.g. "100m")
Set a default memory request and limit depending on your most common workload
- for Java, Ruby: start with "1G"
- for Go, Python, PHP, Node: start with "250M"
Set upper bounds slightly below your expected node size

(80-90% of your node size, with at least a 500M memory buffer)

Set a ResourceQuota

In each namespace, create a ResourceQuota object
Set generous CPU and memory limits

(e.g. half the cluster size if the cluster hosts multiple apps)
Set generous objects limits
- these limits should not be here to constrain your users
- they should catch a runaway process creating many resources
- example: a custom controller creating many pods

Observe, refine, iterate

Observe the resource usage of your pods

(we will see how in the next chapter)
Adjust individual pod limits
If you see trends: adjust the LimitRange

(rather than adjusting every individual set of pod limits)
Observe the resource usage of your namespaces

(with kubectl describe resourcequota ...)
Rinse and repeat regularly

Underutilization

Remember: when assigning a pod to a node, the scheduler looks at requests

(not at current utilization on the node)
If pods request resources but don't use them, this can lead to underutilization

(because the scheduler will consider that the node is full and can't fit new pods)

Viewing a namespace limits and quotas

kubectl describe namespace will display resource limits and quotas

Try it out:
```
kubectl describe namespace default
```
View limits and quotas for all namespaces:
```
kubectl describe namespace
```

]

Additional resources

A Practical Guide to Setting Kubernetes Requests and Limits
- explains what requests and limits are
- provides guidelines to set requests and limits
- gives PromQL expressions to compute good values
  (our app needs to be running for a while)
Kube Resource Report
- generates web reports on resource usage
nsinjector
- controller to automatically populate a Namespace when it is created

???

:EN:- Setting compute resource limits :EN:- Defining default policies for resource usage :EN:- Managing cluster allocation and quotas :EN:- Resource management in practice

:FR:- Allouer et limiter les ressources des conteneurs :FR:- Définir des ressources par défaut :FR:- Gérer les quotas de ressources au niveau du cluster :FR:- Conseils pratiques

22 KiB Raw Blame History

Resource Limits

Requests vs limits

Requests vs limits

Compressible vs incompressible resources

Running low on CPU

CPU limits implementation details

A bursty example

Multi-core scheduling details

Low quotas on machines with many cores

CPU limits in a nutshell

Running low on memory

Memory limits on Kubernetes

Running low on disk

Ephemeral storage?

Disk limit enforcement

nodefs and imagefs

CPU and RAM reservation

What's the deal with swap?

Alternative point of view

Enabling swap anyway

Pod quality of service

Quality of service impact

Specifying resources

Specifying resources in practice

Default values

We need to specify resource values

How should we set resources?

Defining min, max, and default resources

LimitRange example

Example explanation

LimitRange details

Namespace quotas

Creating a quota for a namespace

Limiting total CPU/memory usage

Limiting number of objects

YAML vs CLI

Viewing current usage

Advanced quotas and PriorityClass

Setting the priority of a Pod

PriorityClass and ResourceQuotas

Limiting resources in practice

Set a LimitRange

Set a ResourceQuota

Observe, refine, iterate

Underutilization

Viewing a namespace limits and quotas

Additional resources

22 KiB

Raw Blame History

`nodefs` and `imagefs`