github/container.training

Fork 0

mirror of https://github.com/jpetazzo/container.training.git synced 2026-02-14 17:49:59 +00:00

Files

Jérôme Petazzoni 0abc67e974 ➕ Add MLops material for QCON SF 2024

2024-11-18 19:21:18 -06:00

8.1 KiB

Raw Permalink Blame History

Autoscaling with KEDA

Cluster autoscaling = automatically add nodes when needed
When needed = when Pods are Pending
How do these pods get created?
When the Ollama Deployment is scaled up
- ... manually (e.g. kubectl scale)
- ... automatically (that's what we want to investigate now!)

Ways to implement autoscaling

Custom code

(e.g. crontab checking some value every few minutes and scaling accordingly)
Kubernetes Horizontal Pod Autoscaler v1

(aka kubectl autoscale)
Kubernetes Horizontal Pod Autoscaler v2 with custom metrics

(e.g. with Prometheus Adapter)
Kubernetes Horizontal Pod Autoscaler v2 with external metrics

(e.g. with KEDA)

Custom code

No, we're not going to do that!
But this would be an interesting exercise in RBAC

(setting minimal amount of permissions for the pod running our custom code)

HPAv1

Pros: very straightforward

Cons: can only scale on CPU utilization

How it works:

periodically measures average CPU utilization across pods
if utilization is above/below a target (default: 80%), scale up/down

HPAv1 in practice

Create the autoscaling policy:
```
kubectl autoscale deployment ollama --max=1000
```
(The --max is required; it's a safety limit.)
Check it:
```
kubectl describe hpa
```
Send traffic, wait a bit: pods should be created automatically

HPAv2 custom vs external

Custom metrics = arbitrary metrics attached to Kubernetes objects
External metrics = arbitrary metrics not related to Kubernetes objects

🤔

HPAv2 custom metrics

Examples:
- on Pods: CPU, RAM, network traffic...
- on Ingress: requests per second, HTTP status codes, request duration...
- on some worker Deployment: number of tasks processed, task duration...
Requires an adapter to:
- expose the metrics through the Kubernetes aggregation layer
- map the actual metrics source to Kubernetes objects

Example: the Prometheus adapter

HPAv2 custom metrics in practice

We're not going to cover this here

(too complex / not enough time!)
If you want more details, check my other course material

HPAv2 external metrics

Examples:
- arbitrary Prometheus query
- arbitrary SQL query
- number of messages in a queue
- and many, many more
Also requires an extra components to expose the metrics

Example: KEDA (https://keda.sh/)

HPAv2 external metrics in practice

We're going to install KEDA
And set it up to autoscale depending on the number of messages in Redis

Installing KEDA

Multiple options (details in the documentation):

YAML
Operator Hub
Helm chart 💡

helm upgrade --install --repo https://kedacore.github.io/charts \
 --namespace keda-system --create-namespace keda keda

Scaling according to Redis

We need to create a KEDA Scaler
This is done with a "ScaledObject" manifest
Here is the documentation for the Redis Lists Scaler
Let's write that manifest!

`keda-redis-scaler.yaml`

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ollama
spec:
  scaleTargetRef:
    name: ollama
  triggers:
  - type: redis
    metadata:
      address: redis.`default`.svc:6379
      listName: cities
      listLength: "10"

Notes

We need to update the address field with our namespace

(unless we are running in the default namespace)
Alternative: use addressFromEnv and set an env var in the Ollama pods
listLength gives the target ratio of messages / replicas
In our example, KEDA will scale the Deployment to messages / 100

(rounded up!)

Trying it out

Apply the ScaledObject manifest
Start a Bento pipeline loading e.g. 100-1000 cities in Redis

(100 on smaller clusters / slower CPUs, 1000 on bigger / faster ones)
Check pod and nod resource usage
What do we see?

🤩 The Deployment scaled up automatically!

🤔 But Pod resource usage remains very low (A few busy pods, many idle)

💡 Bento doesn't submit enough requests in parallel!

Improving throughput

We're going to review multiple techniques:

Increase parallelism inside the Bento pipeline.
Run multiple Bento consumers.
Couple consumers and processors more tightly.

1️⃣ Increase pipeline parallelism

Set parallel to true in the http processor
Wrap the input around a batched input

(otherwise, we don't have enough messages in flight)
Increase http timeout significantly (e.g. to 5 minutes)

Results

🎉 More messages flow through the pipeline

🎉 Many requests happen in parallel

🤔 Average Pod and Node CPU utilization is higher, but not maxed out

🤔 HTTP queue size (measured with HAProxy metrics) is relatively high

🤔 Latency is higher too

Why?

Too many requests in parallel

Ealier, we didn't have enough...
...Now, we have too much!
However, for a very big request queue, it still wouldn't be enough

💡 We currently have a fixed parallelism. We need to make it dynamic!

2️⃣ Run multiple Bento consumers

Restore the original Bento configuration

(flip parallel back to false; remove the batched input)
Run Bento in a Deployment

(e.g. with the Bento Helm chart)
Autoscale that Deployment like we autoscaled the Ollama Deployment

Results

🤔🤔🤔 Pretty much the same as before!

(High throughput, high utilization but not maxed out, high latency...)

🤔🤔🤔 Why?

Unbalanced load balancing

All our requests go through the ollama Service
We're still using the default Kubernetes service proxy!
It doesn't spread the requests properly across all the backends

3️⃣ Couple consumers and processors

What if:

instead of sending requests to a load balancer,

each queue consumer had its own Ollama instance?

Current architecture

flowchart LR
  subgraph P1["Pod"]
    H1["HAProxy"] --> O1["Ollama"]
  end
  subgraph P2["Pod"]
    H2["HAProxy"] --> O2["Ollama"]
  end
  subgraph P3["Pod"]
    H3["HAProxy"] --> O3["Ollama"]
  end
  Q["Queue
(Redis)"] <--> C["Consumer
(Bento)"] --> LB["Load Balancer
(kube-proxy)"]
  LB --> H1 & H2 & H3

Proposed architecture

flowchart LR
  subgraph P1["Consumer Pod"]
    C1["Bento"] --> H1["HAProxy"] --> O1["Ollama"]
  end
  subgraph P2["Consumer Pod"]
    C2["Bento"] --> H2["HAProxy"] --> O2["Ollama"]
  end
  subgraph P3["Consumer Pod"]
    C3["Bento"] --> H3["HAProxy"] --> O3["Ollama"]
  end
  Queue["Queue"] <--> C1 & C2 & C3

🏗️ Let's build something!

Let's implement that architecture!
See next slides for hints / getting started

Hints

We need to:

Update the Bento consumer configuration to talk to localhost
Store that configuration in a ConfigMap
Add a Bento container to the Ollama Deployment
Profit!

Results

🎉 Node and Pod utilization is maximized

🎉 HTTP queue size is bounded

🎉 Deployment autoscales up and down

⚠️ Scaling down

Eventually, there are less messages in the queue
The HPA scales down the Ollama Deployment
This terminates some Ollama Pods

🤔 What happens if these Pods were processing requests?

The requests might be lost!

Avoiding lost messages

Option 1:

cleanly shutdown the consumer
make sure that Ollama can complete in-flight requests

(by extending its grace period)
find a way to terminate Ollama when no more requests are in flight

Option 2:

use message acknowledgement

8.1 KiB Raw Permalink Blame History Unescape Escape

Autoscaling with KEDA

Ways to implement autoscaling

Custom code

HPAv1

HPAv1 in practice

HPAv2 custom vs external

HPAv2 custom metrics

HPAv2 custom metrics in practice

HPAv2 external metrics

HPAv2 external metrics in practice

Installing KEDA

Scaling according to Redis

keda-redis-scaler.yaml

Notes

Trying it out

Improving throughput

1️⃣ Increase pipeline parallelism

Results

Too many requests in parallel

2️⃣ Run multiple Bento consumers

Results

Unbalanced load balancing

3️⃣ Couple consumers and processors

Current architecture

Proposed architecture

🏗️ Let's build something!

Hints

Results

⚠️ Scaling down

Avoiding lost messages

8.1 KiB

Raw Permalink Blame History

`keda-redis-scaler.yaml`