Files
container.training/slides/k8s/bento-hpa.md
2024-11-18 19:21:18 -06:00

8.1 KiB
Raw Permalink Blame History

Autoscaling with KEDA

  • Cluster autoscaling = automatically add nodes when needed

  • When needed = when Pods are Pending

  • How do these pods get created?

  • When the Ollama Deployment is scaled up

    • ... manually (e.g. kubectl scale)

    • ... automatically (that's what we want to investigate now!)


Ways to implement autoscaling

  • Custom code

    (e.g. crontab checking some value every few minutes and scaling accordingly)

  • Kubernetes Horizontal Pod Autoscaler v1

    (aka kubectl autoscale)

  • Kubernetes Horizontal Pod Autoscaler v2 with custom metrics

    (e.g. with Prometheus Adapter)

  • Kubernetes Horizontal Pod Autoscaler v2 with external metrics

    (e.g. with KEDA)


Custom code

  • No, we're not going to do that!

  • But this would be an interesting exercise in RBAC

    (setting minimal amount of permissions for the pod running our custom code)


HPAv1

Pros: very straightforward

Cons: can only scale on CPU utilization

How it works:

  • periodically measures average CPU utilization across pods

  • if utilization is above/below a target (default: 80%), scale up/down


HPAv1 in practice

  • Create the autoscaling policy:

    kubectl autoscale deployment ollama --max=1000
    

    (The --max is required; it's a safety limit.)

  • Check it:

    kubectl describe hpa
    
  • Send traffic, wait a bit: pods should be created automatically


HPAv2 custom vs external

  • Custom metrics = arbitrary metrics attached to Kubernetes objects

  • External metrics = arbitrary metrics not related to Kubernetes objects

--

🤔


HPAv2 custom metrics

  • Examples:

    • on Pods: CPU, RAM, network traffic...

    • on Ingress: requests per second, HTTP status codes, request duration...

    • on some worker Deployment: number of tasks processed, task duration...

  • Requires an adapter to:

    • expose the metrics through the Kubernetes aggregation layer

    • map the actual metrics source to Kubernetes objects

Example: the Prometheus adapter


HPAv2 custom metrics in practice

  • We're not going to cover this here

    (too complex / not enough time!)

  • If you want more details, check my other course material


HPAv2 external metrics

  • Examples:

    • arbitrary Prometheus query

    • arbitrary SQL query

    • number of messages in a queue

    • and many, many more

  • Also requires an extra components to expose the metrics

Example: KEDA (https://keda.sh/)


HPAv2 external metrics in practice

  • We're going to install KEDA

  • And set it up to autoscale depending on the number of messages in Redis


Installing KEDA

Multiple options (details in the documentation):

  • YAML

  • Operator Hub

  • Helm chart 💡

helm upgrade --install --repo https://kedacore.github.io/charts \
 --namespace keda-system --create-namespace keda keda

Scaling according to Redis

  • We need to create a KEDA Scaler

  • This is done with a "ScaledObject" manifest

  • Here is the documentation for the Redis Lists Scaler

  • Let's write that manifest!


keda-redis-scaler.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ollama
spec:
  scaleTargetRef:
    name: ollama
  triggers:
  - type: redis
    metadata:
      address: redis.`default`.svc:6379
      listName: cities
      listLength: "10"

Notes

  • We need to update the address field with our namespace

    (unless we are running in the default namespace)

  • Alternative: use addressFromEnv and set an env var in the Ollama pods

  • listLength gives the target ratio of messages / replicas

  • In our example, KEDA will scale the Deployment to messages / 100

    (rounded up!)


Trying it out

  • Apply the ScaledObject manifest

  • Start a Bento pipeline loading e.g. 100-1000 cities in Redis

    (100 on smaller clusters / slower CPUs, 1000 on bigger / faster ones)

  • Check pod and nod resource usage

  • What do we see?

--

🤩 The Deployment scaled up automatically!

--

🤔 But Pod resource usage remains very low (A few busy pods, many idle)

--

💡 Bento doesn't submit enough requests in parallel!


Improving throughput

We're going to review multiple techniques:

  1. Increase parallelism inside the Bento pipeline.

  2. Run multiple Bento consumers.

  3. Couple consumers and processors more tightly.


1 Increase pipeline parallelism

  • Set parallel to true in the http processor

  • Wrap the input around a batched input

    (otherwise, we don't have enough messages in flight)

  • Increase http timeout significantly (e.g. to 5 minutes)


Results

🎉 More messages flow through the pipeline

🎉 Many requests happen in parallel

🤔 Average Pod and Node CPU utilization is higher, but not maxed out

🤔 HTTP queue size (measured with HAProxy metrics) is relatively high

🤔 Latency is higher too

Why?


Too many requests in parallel

  • Ealier, we didn't have enough...

  • ...Now, we have too much!

  • However, for a very big request queue, it still wouldn't be enough

💡 We currently have a fixed parallelism. We need to make it dynamic!


2 Run multiple Bento consumers

  • Restore the original Bento configuration

    (flip parallel back to false; remove the batched input)

  • Run Bento in a Deployment

    (e.g. with the Bento Helm chart)

  • Autoscale that Deployment like we autoscaled the Ollama Deployment


Results

🤔🤔🤔 Pretty much the same as before!

(High throughput, high utilization but not maxed out, high latency...)

--

🤔🤔🤔 Why?


Unbalanced load balancing

  • All our requests go through the ollama Service

  • We're still using the default Kubernetes service proxy!

  • It doesn't spread the requests properly across all the backends


3 Couple consumers and processors

What if:

--

instead of sending requests to a load balancer,

--

each queue consumer had its own Ollama instance?


Current architecture

flowchart LR
  subgraph P1["Pod"]
    H1["HAProxy"] --> O1["Ollama"]
  end
  subgraph P2["Pod"]
    H2["HAProxy"] --> O2["Ollama"]
  end
  subgraph P3["Pod"]
    H3["HAProxy"] --> O3["Ollama"]
  end
  Q["Queue
(Redis)"] <--> C["Consumer
(Bento)"] --> LB["Load Balancer
(kube-proxy)"] LB --> H1 & H2 & H3

Proposed architecture

flowchart LR
  subgraph P1["Consumer Pod"]
    C1["Bento"] --> H1["HAProxy"] --> O1["Ollama"]
  end
  subgraph P2["Consumer Pod"]
    C2["Bento"] --> H2["HAProxy"] --> O2["Ollama"]
  end
  subgraph P3["Consumer Pod"]
    C3["Bento"] --> H3["HAProxy"] --> O3["Ollama"]
  end
  Queue["Queue"] <--> C1 & C2 & C3

🏗️ Let's build something!

  • Let's implement that architecture!

  • See next slides for hints / getting started


Hints

We need to:

  • Update the Bento consumer configuration to talk to localhost

  • Store that configuration in a ConfigMap

  • Add a Bento container to the Ollama Deployment

  • Profit!


Results

🎉 Node and Pod utilization is maximized

🎉 HTTP queue size is bounded

🎉 Deployment autoscales up and down


⚠️ Scaling down

  • Eventually, there are less messages in the queue

  • The HPA scales down the Ollama Deployment

  • This terminates some Ollama Pods

🤔 What happens if these Pods were processing requests?

--

  • The requests might be lost!

Avoiding lost messages

Option 1:

  • cleanly shutdown the consumer

  • make sure that Ollama can complete in-flight requests

    (by extending its grace period)

  • find a way to terminate Ollama when no more requests are in flight

Option 2:

  • use message acknowledgement