Add MLops material for QCON SF 2024

This commit is contained in:
Jérôme Petazzoni
2024-11-18 19:21:18 -06:00
parent 7305bcfe12
commit 0abc67e974
13 changed files with 2908 additions and 0 deletions

173
slides/k8s/bento-cnpg.md Normal file
View File

@@ -0,0 +1,173 @@
# Bento & PostgreSQL
- Bento can also use SQL databases for input/output
- We're going to demonstrate that by writing to a PostgreSQL database
- That database will be deployed with the Cloud Native PostGres operator
(https://cloudnative-pg.io/)
---
## CNPG in a nutshell
- Free, open source
- Originally created by [EDB] (EnterpriseDB, well-known PgSQL experts)
- Non-exhaustive list of features:
- provisioning of Postgres servers, replicas, bouncers
- automatic failover
- backups (full backups and WAL shipping)
- provisioning from scratch, from backups, PITR
- manual and automated switchover (e.g. for node maintenance)
- and many more!
[EDB]: https://www.enterprisedb.com/workload/kubernetes
---
## What we're going to do
1. Install CNPG.
2. Provision a Postgres cluster.
3. Configure Bento to write to that cluster.
4. Set up a Grafana dashboard to see the data.
---
## 1⃣ Installing CNPG
Many options available, see the [documentation][cnpg-install]:
- raw YAML manifests
- kubectl CNPG plugin (`kubectl cnpg install generate`)
- Helm chart
- OLM
[cnpg-install]: https://cloudnative-pg.io/documentation/1.24/installation_upgrade/
---
## 2⃣ Provisioning a Postgres cluster
Minimal manifest:
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: db
spec:
storage:
size: 1Gi
```
---
class: extra-details
## For production...
We might also add:
- `spec.monitoring.enablePodMonitor: true`
- `spec.instances: 2`
- `resources.{requests,limits}.{cpu,memory}`
- `walStorage.size`
- `backup`
- `postgresql.parameters`
See [this manifest][cluster-maximal] for a detailed example.
[cluster-maximal]: https://github.com/jpetazzo/pozok/blob/main/cluster-maximal.yaml
---
## 3⃣ Configuring Bento to write to SQL
- We'll use the [`sql_insert`][sql-insert] output
- If our cluster is named `mydb`, there will be a Secret `mydb-app`
- This Secret will contain a `uri` field
- That field can be used as the `dns` in the Bento configuration
- We will also need to create the table that we want to use
(see next slide for instructions)
[sql-insert]: https://warpstreamlabs.github.io/bento/docs/components/outputs/sql_insert
---
## Creating a table
- If we just want to store the city name and its population:
```sql
CREATE TABLE IF NOT EXISTS cities (
city varchar(100) NOT NULL,
population integer
);
```
- This statement can be executed:
- manually, by getting a `psql` shell with `kubectl cnpg psql mydb app`
- automatically, with Bento's `init_statatement`
---
## 4⃣ Viewing the table in Grafana
- In Grafana, in the home menu on the lift, click "connections"
- Add a PostgreSQL data source
- Enter the host:port, database, user, password
- Then add a visualization using that data source
(it should be relatively self-explanatory!)
---
class: extra-details
## Automating it all
- Expose PostgreSQL credentials through environment variables
(in the Bento container)
- Use the `${...}` syntax in Bento to use these environment variables
- Export the Grafana dashboard to a JSON file
- Store the JSON file in a ConfigMap, with label `grafana_dashboard=1`
- Create that ConfigMap in the namespace where Grafana is running
- Similarly, data sources (like the Redis and the PostgreSQL one) can be defined in YAML
- And that YAML can be put in a ConfigMap with label `grafana_datasource=1`

450
slides/k8s/bento-hpa.md Normal file
View File

@@ -0,0 +1,450 @@
# Autoscaling with KEDA
- Cluster autoscaling = automatically add nodes *when needed*
- *When needed* = when Pods are `Pending`
- How do these pods get created?
- When the Ollama Deployment is scaled up
- ... manually (e.g. `kubectl scale`)
- ... automatically (that's what we want to investigate now!)
---
## Ways to implement autoscaling
- Custom code
(e.g. crontab checking some value every few minutes and scaling accordingly)
- Kubernetes Horizontal Pod Autoscaler v1
(aka `kubectl autoscale`)
- Kubernetes Horizontal Pod Autoscaler v2 with custom metrics
(e.g. with Prometheus Adapter)
- Kubernetes Horizontal Pod Autoscaler v2 with external metrics
(e.g. with KEDA)
---
## Custom code
- No, we're not going to do that!
- But this would be an interesting exercise in RBAC
(setting minimal amount of permissions for the pod running our custom code)
---
## HPAv1
Pros: very straightforward
Cons: can only scale on CPU utilization
How it works:
- periodically measures average CPU *utilization* across pods
- if utilization is above/below a target (default: 80%), scale up/down
---
## HPAv1 in practice
- Create the autoscaling policy:
```bash
kubectl autoscale deployment ollama --max=1000
```
(The `--max` is required; it's a safety limit.)
- Check it:
```bash
kubectl describe hpa
```
- Send traffic, wait a bit: pods should be created automatically
---
## HPAv2 custom vs external
- Custom metrics = arbitrary metrics attached to Kubernetes objects
- External metrics = arbitrary metrics not related to Kubernetes objects
--
🤔
---
## HPAv2 custom metrics
- Examples:
- on Pods: CPU, RAM, network traffic...
- on Ingress: requests per second, HTTP status codes, request duration...
- on some worker Deployment: number of tasks processed, task duration...
- Requires an *adapter* to:
- expose the metrics through the Kubernetes *aggregation layer*
- map the actual metrics source to Kubernetes objects
Example: the [Prometheus adapter][prometheus-adapter]
[prometheus-adapter]: https://github.com/kubernetes-sigs/prometheus-adapter
---
## HPAv2 custom metrics in practice
- We're not going to cover this here
(too complex / not enough time!)
- If you want more details, check [my other course material][hpav2slides]
[hpav2slides]: https://2024-10-enix.container.training/4.yml.html#toc-scaling-with-custom-metrics
---
## HPAv2 external metrics
- Examples:
- arbitrary Prometheus query
- arbitrary SQL query
- number of messages in a queue
- and [many, many more][keda-scalers]
- Also requires an extra components to expose the metrics
Example: [KEDA (https://keda.sh/)](https://keda.sh)
[keda-scalers]: https://keda.sh/docs/latest/scalers/
---
## HPAv2 external metrics in practice
- We're going to install KEDA
- And set it up to autoscale depending on the number of messages in Redis
---
## Installing KEDA
Multiple options (details in the [documentation][keda-deploy]):
- YAML
- Operator Hub
- Helm chart 💡
```bash
helm upgrade --install --repo https://kedacore.github.io/charts \
--namespace keda-system --create-namespace keda keda
```
[keda-deploy]: https://keda.sh/docs/latest/deploy/
---
## Scaling according to Redis
- We need to create a KEDA Scaler
- This is done with a "ScaledObject" manifest
- [Here is the documentation][keda-redis-lists] for the Redis Lists Scaler
- Let's write that manifest!
[keda-redis-lists]: https://keda.sh/docs/latest/scalers/redis-lists/
---
## `keda-redis-scaler.yaml`
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ollama
spec:
scaleTargetRef:
name: ollama
triggers:
- type: redis
metadata:
address: redis.`default`.svc:6379
listName: cities
listLength: "10"
```
---
## Notes
- We need to update the `address` field with our namespace
(unless we are running in the `default` namespace)
- Alternative: use `addressFromEnv` and set an env var in the Ollama pods
- `listLength` gives the target ratio of `messages / replicas`
- In our example, KEDA will scale the Deployment to `messages / 100`
(rounded up!)
---
## Trying it out
- Apply the ScaledObject manifest
- Start a Bento pipeline loading e.g. 100-1000 cities in Redis
(100 on smaller clusters / slower CPUs, 1000 on bigger / faster ones)
- Check pod and nod resource usage
- What do we see?
--
🤩 The Deployment scaled up automatically!
--
🤔 But Pod resource usage remains very low (A few busy pods, many idle)
--
💡 Bento doesn't submit enough requests in parallel!
---
## Improving throughput
We're going to review multiple techniques:
1. Increase parallelism inside the Bento pipeline.
2. Run multiple Bento consumers.
3. Couple consumers and processors more tightly.
---
## 1⃣ Increase pipeline parallelism
- Set `parallel` to `true` in the `http` processor
- Wrap the input around a `batched` input
(otherwise, we don't have enough messages in flight)
- Increase `http` timeout significantly (e.g. to 5 minutes)
---
## Results
🎉 More messages flow through the pipeline
🎉 Many requests happen in parallel
🤔 Average Pod and Node CPU utilization is higher, but not maxed out
🤔 HTTP queue size (measured with HAProxy metrics) is relatively high
🤔 Latency is higher too
Why?
---
## Too many requests in parallel
- Ealier, we didn't have enough...
- ...Now, we have too much!
- However, for a very big request queue, it still wouldn't be enough
💡 We currently have a fixed parallelism. We need to make it dynamic!
---
## 2⃣ Run multiple Bento consumers
- Restore the original Bento configuration
(flip `parallel` back to `false`; remove the `batched` input)
- Run Bento in a Deployment
(e.g. with the [Bento Helm chart][bento-helm-chart])
- Autoscale that Deployment like we autoscaled the Ollama Deployment
[bento-helm-chart]: https://github.com/warpstreamlabs/bento-helm-chart
---
## Results
🤔🤔🤔 Pretty much the same as before!
(High throughput, high utilization but not maxed out, high latency...)
--
🤔🤔🤔 Why?
---
## Unbalanced load balancing
- All our requests go through the `ollama` Service
- We're still using the default Kubernetes service proxy!
- It doesn't spread the requests properly across all the backends
---
## 3⃣ Couple consumers and processors
What if:
--
instead of sending requests to a load balancer,
--
each queue consumer had its own Ollama instance?
---
## Current architecture
<pre class="mermaid">
flowchart LR
subgraph P1["Pod"]
H1["HAProxy"] --> O1["Ollama"]
end
subgraph P2["Pod"]
H2["HAProxy"] --> O2["Ollama"]
end
subgraph P3["Pod"]
H3["HAProxy"] --> O3["Ollama"]
end
Q["Queue<br/>(Redis)"] <--> C["Consumer<br/>(Bento)"] --> LB["Load Balancer<br/>(kube-proxy)"]
LB --> H1 & H2 & H3
</pre>
---
## Proposed architecture
<pre class="mermaid">
flowchart LR
subgraph P1["Consumer Pod"]
C1["Bento"] --> H1["HAProxy"] --> O1["Ollama"]
end
subgraph P2["Consumer Pod"]
C2["Bento"] --> H2["HAProxy"] --> O2["Ollama"]
end
subgraph P3["Consumer Pod"]
C3["Bento"] --> H3["HAProxy"] --> O3["Ollama"]
end
Queue["Queue"] <--> C1 & C2 & C3
</pre>
---
## 🏗️ Let's build something!
- Let's implement that architecture!
- See next slides for hints / getting started
---
## Hints
We need to:
- Update the Bento consumer configuration to talk to localhost
- Store that configuration in a ConfigMap
- Add a Bento container to the Ollama Deployment
- Profit!
---
## Results
🎉 Node and Pod utilization is maximized
🎉 HTTP queue size is bounded
🎉 Deployment autoscales up and down
---
## ⚠️ Scaling down
- Eventually, there are less messages in the queue
- The HPA scales down the Ollama Deployment
- This terminates some Ollama Pods
🤔 What happens if these Pods were processing requests?
--
- The requests might be lost!
---
## Avoiding lost messages
Option 1:
- cleanly shutdown the consumer
- make sure that Ollama can complete in-flight requests
(by extending its grace period)
- find a way to terminate Ollama when no more requests are in flight
Option 2:
- use *message acknowledgement*

628
slides/k8s/bento-intro.md Normal file
View File

@@ -0,0 +1,628 @@
# Getting started with Bento
How can we move to a message queue architecture...
*...without rewriting a bunch of code?*
🤔
---
## Bento
https://bento.dev/
"Fancy stream processing made operationally mundane"
"Written in Go, deployed as a static binary, declarative configuration. Open source and cloud native as utter heck."
With ✨ amazing ✨ documentation 😍
---
class: extra-details
## Tiny bit of history
- Original project: Benthos
- May 30, 2024: [Redpanda acquires Benthos][redpanda-acquires-benthos]
- Benthos is now Redpanda Connect
- some parts have been relicensed as commercial products
- May 31, 2024: [Warpstream forks Benthos][warpstream-forks-benthos]
- that fork is named "Bento"
- it's fully open source
- We're going to use Bento here, but Redpanda Connect should work fine too!
[redpanda-acquires-benthos]: https://www.redpanda.com/press/redpanda-acquires-benthos
[warpstream-forks-benthos]: https://www.warpstream.com/blog/announcing-bento-the-open-source-fork-of-the-project-formerly-known-as-benthos
---
## Bento concepts
- Message stream processor
- Each pipeline is configured by a YAML configuration that defines:
- input (where do we get the messages?)
- pipeline (optional: how do we transform the messages?)
- output (where do we put the messages afterwards?)
- Once Bento is started, it runs the pipelines forever
(except for pipelines that have a logical end, e.g. reading from a file)
- Embedded language (Bloblang) to manipulate/transform messages
---
## Messages
- Typically JSON objects
(but raw strings are also possible)
- Nesting, arrays, etc. are OK
---
## Getting started with Bento
We're going to:
1. Import a bunch of cities from a CSV file into a Redis queue.
2. Read back these cities using a web server.
3. Use an "enrichment workflow" to query our LLM for each city.
---
## 1⃣ Importing cities
Let's break down the work:
- download the data set
- create the Bento configuration
- deploy Redis
- start Bento
---
## Downloading the data set
- Example database:
https://www.kaggle.com/datasets/juanmah/world-cities
- Let's download and uncompress the data set:
```bash
curl -fsSL https://www.kaggle.com/api/v1/datasets/download/juanmah/world-cities |
funzip > cities.csv
```
(Ignore the "length error", it's harmless!)
- Check the structure of the data set:
```bash
head cities.csv
```
---
## Creating the Bento configuration
- We need to find which `input` and `output` to use
- Check the list with `bento list` or the [documentation]
- Then run `bento create INPUTNAME/PIPELINENAME/OUTPUTNAME`
- Generate a configuration file:
```bash
bento create csv//redis_list > csv2redis.yaml
```
- Edit that configuration file; look for the `(required)` parameters
(Everything else can go away!)
[documentation]: https://warpstreamlabs.github.io/bento/docs/components/inputs/about/
---
## Resulting configuration
If we trim all the default values, here is the result:
```yaml
input:
csv:
paths: ["cities.csv"]
output:
redis_list:
url: redis://redis:6379 # No default (required)
key: cities
```
We'll call that value `csv2redis.yaml`.
---
## Deploying Redis
- Create a Deployment:
```bash
kubectl create deployment redis --image redis
```
- Expose it:
```bash
kubectl expose deployment redis --port 6379
```
---
## Starting Bento
Option 1: run it manually in a pod, to see what's going on.
```bash
bento --config csv2redis.yaml
```
Option 2: run it with e.g. the Bento Helm chart.
*We're not going to do that yet, since this particular pipeline has a logical end.*
*(The Helm chart is best suited to pipelines that run forever.)*
---
## Expected output
.small[
```
INFO Running main config from specified file @service=bento bento_version="" path=csv2redis.yaml
INFO Launching a Bento instance, use CTRL+C to close @service=bento
INFO Listening for HTTP requests at: http://0.0.0.0:4195 @service=bento
INFO Input type csv is now active @service=bento label="" path=root.input
INFO Output type redis_list is now active @service=bento label="" path=root.output
INFO Pipeline has terminated. Shutting down the service @service=bento
```
]
The pipeline should complete in a just a few seconds.
---
## Checking what's in Redis
- Connect to our Redis instance:
```bash
redis-cli -h redis
```
- List keys:
```redis
KEYS *
```
- Check that the `cities` list has approx. 47000 elements:
```redis
LLEN cities
```
- Get the first element of the list:
```redis
LINDEX cities 0
```
---
## Fun with Bloblang
- Let's add a filter to keep only cities with a population above 10,000,000
- Add the following block to the Bento configuration:
```yaml
pipeline:
processors:
- switch:
- check: this.population == ""
processors:
- mapping: root = deleted()
- check: this.population.int64() < 10000000
processors:
- mapping: root = deleted()
```
(See the [docs][switch-docs] for details about the `switch` processor.)
[switch-docs]: https://warpstreamlabs.github.io/bento/docs/components/processors/switch/
---
## Testing our processor
- First, delete the existing `cities` list:
```bash
redis-cli -h redis DEL cities
```
- Then, run the Bento pipeline again:
```bash
bento --config csv2redis.yaml
```
(It should complain about a few cities where the population has a decimal point.)
- Check how many cities were loaded:
```bash
redis-cli -h redis LLEN cities
```
(There should be 47.)
---
## 2⃣ Consume the queue over HTTP
- We want to "get the next city" in the queue with a simple `curl`
- Our input will be `redis_list`
- Our output will be `http_server`
---
## Generate the Bento configuration
Option 1: `bento create redis_list//http_server`
Option 2: [read the docs][output-http-server]
[output-http-server]: https://warpstreamlabs.github.io/bento/docs/components/outputs/http_server
---
## 🙋 Choose your adventure
Do you want to try to write that configuration?
Or shall we see it right away?
--
⚠️ Spoilers on next slide!
---
## `redis2http.yaml`
```yaml
input:
redis_list:
url: redis://redis:6379
key: cities
output:
http_server:
path: /nextcity
```
This will set up an HTTP route to fetch *one* city.
It's also possible to batch, stream...
---
## Trying it out
- Run Bento with that configuration:
```bash
bento --config redis2http.yaml &
```
- Retrieve one city:
```bash
curl http://localhost:4195/nextcity
```
- Check what happens after we retrive *all* the cities!
---
## 3⃣ Query our LLM for each city
- We want to ask our LLM who's the mayor of each of these cities
- We'll use a prompt that will usually ensure a short answer
(so that it's faster; we don't want to wait 30 seconds per city!)
- We'll test the prompt with the Ollama CLI
- Then we'll craft a proper HTTP API query
- Finally, we'll configure an [enrichment workflow][enrichment] in Bento
[enrichment]: https://warpstreamlabs.github.io/bento/cookbooks/enrichments/
---
## Test our prompt
Assuming that our earlier Ollama Deployment is still running:
```bash
kubectl exec deployment/ollama -- \
ollama run qwen2:1.5b "
Who is the mayor of San Francisco?
Just give the name by itself on a single line.
If you don't know, don't say anything.
"
```
---
## Turn the prompt into an HTTP API query
Note: to install `http` in an Alpine container, run `apk add httpie`.
```bash
http http://ollama.default:11434/api/generate \
model=qwen2:1.5b stream:=false prompt="
Who is the mayor of Paris?
Just give the name by itself on a single line.
If you don't know, don't say anything.
"
```
We get a JSON payload, and we want to use the `response` field.
---
## Configure an enrichment workflow
The [documentation][enrichment] is really good!
We need to set up:
- a `branch` processor
- a `request_map` to transform the city into an Ollama request
- an `http` processor to submit the request to Ollama
- a `result_map` to transform the Ollama response
[enrichment]: https://warpstreamlabs.github.io/bento/cookbooks/enrichments/
---
## Without the `branch` processor
<pre class="mermaid">
flowchart LR
CITY["
city: Paris
country: France
population: 1106000
iso2: FR
...
"]
REQ["
model: qwen2:1.5b
stream: false
prompt: Who is the mayor of Paris?
"]
REP["
response: Anne Hidalgo
eval_count: ...
prompt_eval_count: ...
(other ollama fields)
"]
CITY@{ shape: card}
REQ@{ shape: card}
REP@{ shape: card}
style CITY text-align: left
style REQ text-align: left
style REP text-align: left
mapping@{ shape: diam }
http["http processor"]@{ shape: diam }
CITY --> mapping --> REQ --> http --> REP
</pre>
- We transform the `city` into an Ollama request
- The `http` processor submits the request to Ollama
- The final output is the Ollama response
---
## With the `branch` processor
<pre class="mermaid">
flowchart LR
CITY["
city: Paris
country: France
population: 1106000
iso2: FR
...
"]
REQ["
model: qwen2:1.5b
stream: false
prompt: Who is the mayor of Paris?
"]
REP["
response: Anne Hidalgo
eval_count: ...
prompt_eval_count: ...
(other ollama fields)
"]
OUT["
city: Paris
country: France
population: 1106000
iso2: FR
...
mayor: Anne Hidalgo
"]
CITY@{ shape: card}
REQ@{ shape: card}
REP@{ shape: card}
OUT@{ shape: card}
style CITY text-align: left
style REQ text-align: left
style REP text-align: left
style OUT text-align: left
branch@{ shape: diam }
request_map@{ shape: diam }
result_map@{ shape: diam }
http["http processor"]@{ shape: diam }
CITY --> branch
branch --> result_map
branch --> request_map
request_map --> REQ
REQ --> http
http --> REP
REP --> result_map
result_map --> OUT
</pre>
- The `branch` processor allows to do the processing "on the side"
- `request_map` and `result_map` transform the message before/after processing
- Then, the result if combined with the original message (the `city`)
---
```yaml
input:
csv:
paths: ["cities.csv"]
pipeline:
processors:
- branch:
request_map: |
root.model = "qwen2:1.5b"
root.stream = false
root.prompt = (
"Who is the mayor of %s? ".format(this.city) +
"Just give the name by itself on a single line. " +
"If you don't know, don't say anything."
)
processors:
- http:
url: http://ollama:11434/api/generate
verb: POST
result_map: |
root.mayor = this.response
```
---
## Trying it out
- Save the YAML on the previous page into a configuration file
- Run Bento with that configuration file
- What happens?
--
🤔 We're seeing errors due to timeouts
```
ERRO HTTP request to 'http://ollama...' failed: http://ollama...:
Post "http://ollama...": context deadline exceeded
(Client.Timeout exceeded while awaiting headers)
```
---
## 🙋 Choose your adventure
How should we address errors?
- Option 1: increase the timeout in the [http][doc-http] processor
- Option 2: use a [retry][doc-retry] processor in the pipeline
- Option 3: use a [reject_errored][doc-reject] output
[doc-http]: https://warpstreamlabs.github.io/bento/docs/components/processors/http/
[doc-retry]: https://warpstreamlabs.github.io/bento/docs/components/processors/retry
[doc-reject]: https://warpstreamlabs.github.io/bento/docs/components/outputs/reject_errored
---
## 🏗️ Let's build something!
- We want to process 1000 cities with our LLM
(guessing who the mayor is, or something similar)
- Store the output wherever we want
(Redis, CSV file, JSONL files...)
- Deal correctly with errors
(we'll check that there are, indeed, 1000 cities in the output)
- Scale out to process faster
(scale ollama to e.g. 10 replicas, enable parallelism in Bento)
---
class: title
🍱 Lunch time! 🍱
---
## What happened?
- If your Ollama pods have *resource requests*:
→ your cluster may have auto-scaled
- If your Ollama pods don't have *resource requests*:
→ you probably have a bunch of container restarts, due to out-of-memory errors
🤔 What's that about?

250
slides/k8s/bento-rmq.md Normal file
View File

@@ -0,0 +1,250 @@
# Bento & RabbitMQ
- In some of the previous runs, messages were dropped
(we start with 1000 messages in `cities` and have e.g. 955 in `mayors`)
- This is caused by various errors during processing
(e.g. too many timeouts; Bento being shutdown halfway through...)
- ...And by the fact that we are using a Redis queue
(which doesn't offer delivery guarantees or acknowledgements)
- Can we get something better?
---
## The problem
- Some inputs (like `redis_list`) don't support *acknowledgements*
- When a message is pulled from the queue, it is deleted immediately
- If the message is lost for any reason, it is lost permanently
---
## The solution
- Some inputs (like `amqp_0_9`) support acknowledgements
- When a message is pulled from the queue:
- it is not visible anymore to other consumers
- it needs to be explicitly acknowledged
- The acknowledgement is done by Bento when the message reaches the output
- The acknowledgement deletes the message
- No acknowledgement after a while? Consumer crashes/disconnects?
Message gets requeued automatically!
---
## `amqp_0_9`
- Protocol used by RabbitMQ
- Very simplified behavior:
- messages are published to an [*exchange*][amqp-exchanges]
- messages have a *routing key*
- the exchange routes the message to one (or zero or more) queues
</br>(possibly using the routing key or message headers to decide which queue(s))
- [*consumers*][amqp-consumers] subscribe to queues to receive messages
[amqp-exchanges]: https://www.rabbitmq.com/tutorials/amqp-concepts#exchanges
[amqp-consumers]: https://www.rabbitmq.com/tutorials/amqp-concepts#consumers
---
## Using the default exchange
- There is a default exchange (called `""` - empty string)
- The routing key indicates the name of the queue to deliver to
- The queue needs to exist (we need to create it beforehand)
---
class: extra-details
## Defining custom exchanges
- Create an exchange
- exchange types: direct, fanout, topic, headers
- durability: persisted to disk to survive server restart or not?
- Create a binding
- which exchange?
- which routing key? (for direct exchanges)
- which queue?
---
## RabbitMQ on Kubernetes
- RabbitMQ can be deployed on Kubernetes:
- directly (creating e.g. a StatefulSet)
- with the RabbitMQ operator
- We're going to do the latter!
- The operator includes the "topology operator"
(to configure queues, exchanges, and bindings through custom resources)
---
## Installing the RabbitMQ operator
- Let's install it with this Helm chart:
```bash
helm upgrade --install --repo https://charts.bitnami.com/bitnami \
--namespace rabbitmq-system --create-namespace \
rabbitmq-cluster-operator rabbitmq-cluster-operator
```
---
## Deploying a simple RabbitMQ cluster
- Let's use the YAML manifests in that directory:
https://github.com/jpetazzo/beyond-load-balancers/tree/main/rabbitmq
- This creates:
- a `RabbitmqCluster` called `mq`
- a `Secret` called `mq-default-user` containing access credentials
- a durable `Queue` named `q1`
(We can ignore the `Exchange` and the `Binding`, we won't use them.)
---
## 🏗️ Let's build something!
Let's replace the `cities` Redis list with our RabbitMQ queue.
(See next slide for steps and hints!)
---
## Steps
1. Edit the Bento configuration for our "CSV importer".
(replace the `redis_list` output with `amqp_0_9`)
2. Run that pipeline and confirm that messages show up in RabbitMQ.
3. Edit the Bento configuration for the Ollama consumer.
(replace the `redis_list` input with `amqp_0_9`)
4. Trigger a scale up of the Ollama consumer.
5. Update the KEDA Scaler to use RabbitMQ instead of Redis.
---
## 1⃣ Sending messages to RabbitMQ
- Edit our Bento configuration (the one feeding the CSV file to Redis)
- We want the following `output` section:
```yaml
output:
amqp_0_9:
exchange: ""
key: q1
mandatory: true
urls:
- "${AMQP_URL}"
```
- Then export the AMQP_URL environment variable using `connection_string` from Secret `mq-default-user`
💡 Yes, we can directly use environment variables in Bento configuration!
---
## 2⃣ Testing our AMQP output
- Run the Bento pipeline
- To check that our messages made it:
```bash
kubectl exec mq-server-0 -- rabbitmqctl list_queues
```
- We can also use Prometheus metrics, e.g. `rabbitmq_queue_messages`
---
## 3⃣ Receiving messages from RabbitMQ
- Edit our other Bento configuration (the one in the Ollama consumer Pod)
- We want the following `input` section:
```yaml
input:
amqp_0_9:
urls:
- `amqp://...:5672/`
queue: q1
```
---
## 4⃣ Triggering Ollama scale up
- If the autoscaler is configured to scale to zero, disable it
(easiest solution: delete the ScaledObject)
- Then manually scale the Deployment to e.g. 4 Pods
- Check that messages are processed and show up in the output
(it should still be a Redis list at this point)
---
## 5⃣ Autoscaling on RabbitMQ
- We need to update our ScaledObject
- Check the [RabbitMQ Queue Scaler][keda-rabbitmq]
- Multiple ways to pass the AMQP URL:
- hardcode it (easier solution for testing!)
- use `...fromEnv` and set environment variables in target pod
- create and use a TriggerAuthentication
💡 Since we have the AMQP URL in a Secret, TriggerAuthentication works great!
[keda-rabbitmq]: https://keda.sh/docs/latest/scalers/rabbitmq-queue/

132
slides/k8s/handson-mlops.md Normal file
View File

@@ -0,0 +1,132 @@
class: title
*Tell me and I forget.*
<br/>
*Teach me and I remember.*
<br/>
*Involve me and I learn.*
Misattributed to Benjamin Franklin
[(Probably inspired by Chinese Confucian philosopher Xunzi)](https://www.barrypopik.com/index.php/new_york_city/entry/tell_me_and_i_forget_teach_me_and_i_may_remember_involve_me_and_i_will_lear/)
---
## Hands-on sections
- There will be *a lot* of examples and demos
- If you are attending a live workshop:
- follow along the demos, ask questions at any time
- if you can, try to run some of the examples and demos in your environment
- if things are going too fast, ask the trainer to slow down :)
- If you are watching a recording or only reading the slides:
- it is **strongly** recommended to run **all** the examples and demos
- take advantage of the fact that you can pause at any time
---
class: in-person
## Where are we going to run our containers?
---
class: in-person, pic
![You get a cluster](images/you-get-a-cluster.jpg)
---
## If you're attending a live training or workshop
- Each person gets a private lab environment
- Your lab environments will be available for the duration of the workshop
(check with your instructor to know exactly when they'll be shutdown)
- Note that for budget reasons¹, your environment will be fairly modest
- scenario 1: 4 nodes with 2 cores and 4 GB RAM ; no cluster autoscaling
- scenario 2: 1 node with 4 cores and 8 GB RAM ; cluster autoscaling
.footnote[¹That cloud thing is mighty expensive, yo]
---
## Running your own lab environment
- If you are following a self-paced course...
- Or watching a replay of a recorded course...
- ...You will need to set up a local environment for the labs
*or*
- If you want to use a specific cloud provider...
- Or want to see these concepts "at scale"...
- ...You can set up your own clusters with whatever capacity suits you
---
## Deploying your own Kubernetes cluster
- You need cloud provider credentials for this
- Option 1: use the cloud provider CLI, web UI, ...
- Option 2: use [one of these Terraform configurations][one-kubernetes]
(set `cluster_name`, `node_size`, `max_nodes_per_pool`, `location`, and GO!)
[one-kubernetes]: https://github.com/jpetazzo/container.training/tree/main/prepare-labs/terraform/one-kubernetes
---
## Deploying your own Kubernetes cluster.red[**s**]
- If you want to deliver your own training or workshop:
- deployment scripts are available in the [prepare-labs] directory
- you can use them to automatically deploy many lab environments
- they support many different infrastructure providers
- they can deploy dozens (even hundreds) of clusters at a time
[prepare-labs]: https://github.com/jpetazzo/container.training/tree/main/prepare-labs
---
class: in-person
## Why don't we run containers locally?
- Installing this stuff can be hard on some machines
(32 bits CPU or OS... Laptops without administrator access... etc.)
- *"The whole team downloaded all these container images from the WiFi!
<br/>... and it went great!"* (Literally no-one ever)
- All you need is a computer (or even a phone or tablet!), with:
- an Internet connection
- a web browser
- an SSH client
- Some of the demos require multiple nodes to demonstrate scaling

165
slides/k8s/helmfile.md Normal file
View File

@@ -0,0 +1,165 @@
# Managing our stack with `helmfile`
- We've installed a few things with Helm
- And others with raw YAML manifests
- Perhaps you've used Kustomize sometimes
- How can we automate all this? Make it reproducible?
---
## Requirements
- We want something that is *idempotent*
= running it 1, 2, 3 times, should only install the stack once
- We want something that handles udpates
= modifying / reconfiguring without restarting from scratch
- We want something that is configurable
= with e.g. configuration files, environment variables...
- We want something that can handle *partial removals*
= ability to remove one element without affecting the rest
- Inspiration: Terraform, Docker Compose...
---
## Shell scripts?
✅ Idempotent, thanks to `kubectl apply -f`, `helm upgrade --install`
✅ Handles updates (edit script, re-run)
✅ Configurable
❌ Partial removals
If we remove an element from our script, it won't be uninstalled automatically.
---
## Umbrella chart?
Helm chart with dependencies on other charts.
✅ Idempotent
✅ Handles updates
✅ Configurable (with Helm values: YAML files and `--set`)
✅ Partial removals
❌ Complex (requires to learn advanced Helm features)
❌ Requires everything to be a Helm chart (adds (lots of) boilerplate)
---
## Helmfile
https://github.com/helmfile/helmfile
✅ Idempotent
✅ Handles updates
✅ Configurable (with values files, environment variables, and more)
✅ Partial removals
✅ Fairly easy to get started
🐙 Sometimes feels like summoning unspeakable powers / staring down the abyss
---
## What `helmfile` can install
- Helm charts from remote Helm repositories
- Helm charts from remote git repositories
- Helm charts from local directories
- Kustomizations
- Directories with raw YAML manifests
---
## How `helmfile` works
- Everything is defined in a main `helmfile.yaml`
- That file defines:
- `repositories` (remote Helm repositories)
- `releases` (things to install: Charts, YAML...)
- `environments` (optional: to specialize prod vs staging vs ...)
- Helm-style values file can be loaded in `enviroments`
- These values can then be used in the rest of the Helmfile
- Examples: [install essentials on a cluster][helmfile-ex-1], [run a Bento stack][helmfile-ex-2]
[helmfile-ex-1]: https://github.com/jpetazzo/beyond-load-balancers/blob/main/helmfile.yaml
[helmfile-ex-2]: https://github.com/jpetazzo/beyond-load-balancers/blob/main/bento/helmfile.yaml
---
## `helmfile` commands
- `helmfile init` (optional; downloads plugins if needed)
- `helmfile apply` (updates all releases that have changed)
- `helmfile sync` (updates all releases even if they haven't changed)
- `helmfile destroy` (guess!)
---
## Helmfile tips
As seen in [this example](https://github.com/jpetazzo/beyond-load-balancers/blob/main/bento/helmfile.yaml#L21):
- variables can be used to simplify the file
- configuration values and secrets can be loaded from external sources
(Kubernetes Secrets, Vault... See [vals] for details)
- current namespace isn't exposed by default
- there's often more than one way to do it!
(this particular section could be improved by using Bento `${...}`)
[vals]: https://github.com/helmfile/vals
---
## 🏗️ Let's build something!
- Write a helmfile (or two) to set up today's entire stack on a brand new cluster!
- Suggestion:
- one helmfile for singleton, cluster components
<br/>
(All our operators: Prometheus, Grafana, KEDA, CNPG, RabbitMQ Operator)
- one helmfile for the application stack
<br/>
(Bento, PostgreSQL cluster, RabbitMQ)

View File

@@ -0,0 +1,53 @@
## What we will / won't cover
- Kubernetes provides low-level building blocks (pods, deployments, services...)
- There are many high-level frameworks out there for serverless, AI...:
[Knative](https://knative.dev/docs/),
[KubeAI](https://www.kubeai.org/),
[Kueue](https://kueue.sigs.k8s.io/)...
- We're going to sit somewhere in the middle:
reimplement some of the features of these high-level frameworks, in a flexible way
- This workshop will (hopefully!) give you a better eye to evaluate these frameworks, too
- We won't showcase GPUs today for budget reasons
(giving everyone a few GPU nodes would be prohibitive, sorry!)
---
## A word about our demo app
- We'll use Ollama with a relatively small LLM
(qwen2:1.5b)
- We'll use it to generate very short completions
(a few seconds of CPU)
- All the challenges that we will address are also visible on longer requests
(in fact, they are even more visible on longer requests!)
- We're sticking to short requests to save time and cover a lot of ground today
(but feel free to use more expensive prompts if you'd like!)
---
## Tiny bit of backstory...
The original prompt that we used when building the first version of this content was:
```
If you go to {city}, I suggest that you
```
This would typically take 10-30 seconds - and with much bigger Kubernetes nodes.
Today, we suggest that we use a prompt that generates shorter answers!

321
slides/k8s/ollama-intro.md Normal file
View File

@@ -0,0 +1,321 @@
# Ollama in a nutshell
https://ollama.dev
"Get up and running with large language models"
"Docker, but for LLMs"
- Server to host (run) LLMs
- Controlled with CLI or API
- Download a model with `ollama pull`
- Run inference with `ollama run`
---
## Quick demo
⚠️ **Important note 1:** the commands in this section aren't meant
to be executed on your Kubernetes clusters. They are meant to
be executed on a local machine, and they assume that Ollama is
installed and running. If you don't have Ollama on your local
machine, it's OK to skip these demos!
⚠️ **Important note 2:** the models used by Ollama are fairly big
(1.5 GB for the one used here; up to 10s or 100s of GB for bigger
models). We do not recommend downloading them on conference WiFi.
Assuming Ollama is installed and running:
```
ollama run qwen2:1.5b "What's the solution to global warming?"
```
We're going to use that model because it's relatively small.
Many others are available (see https://ollama.dev/search).
---
## Other useful commands
- Start an interactive chat session:
```bash
ollama run qwen2:1.5b
```
- Pull an model (or check for updates):
```bash
ollama pull qwen2:1.5b
```
- See information on a model:
```bash
ollama show qwen2:1.5b
```
---
## Models on disk, in memory
- See models available on disk:
```bash
ollama list
```
- See models loaded in memory:
```bash
ollama ps
```
- Unload a model:
```bash
ollama stop qwen2:1.5b
```
Models are automatically unloaded after 5 minutes (by default).
Ollama loads models in RAM, and in VRAM if it detects a supported GPU.
---
# Ollama on Kubernetes
Let's run Ollama on our Kubernetes cluster!
- Option 1: `kubectl run`
- Option 2: create a Deployment and a Service
- Option 3: use a Helm chart
---
## 1⃣ `kubectl run`
Note: the `ollama/ollama` image is quite big (~2 GB transfer, ~4 GB on disk).
```bash
kubectl run ollama --image ollama/ollama
```
Wait for the pod to be up and running:
```bash
kubectl wait pod ollama --for=condition=Ready
```
(If that command times out, try again and/or specify a higher timeout.)
```bash
kubectl exec ollama -- ollama run qwen2:1.5b "What's Bach best piece?"
```
Shutdown the pod:
```bash
kubectl delete pod ollama
```
---
## 2⃣ Deployment + Service
Create the Deployment:
```bash
kubectl create deployment ollama --image ollama/ollama
```
Create the Service:
```bash
kubectl create service clusterip ollama --tcp 11343
```
Wait for the Service Endpoints to be available:
```bash
kubectl wait endpoints ollama --for=jsonpath={..ip}
```
---
## By the way... Why port 11434?
| 1 | 1 | 4 | 3 | 4 |
|---|---|---|---|---|
| L | L | A | M | A |
---
## Connecting to the Service
Let's use the `/api/generate` endpoint:
```bash
kubectl run httpclient --rm -it --image alpine/httpie -- --ignore-stdin \
http://ollama:11434/api/generate \
model=qwen2:1.5b prompt="Write a limerick about Kubernetes"
```
(See [Ollama API docs](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion) for details.)
--
🤔 We get an error: the model needs to be downloaded first.
💡 When we used the `ollama run` CLI command earlier, it did it automatically for us.
---
## Pulling the model
Method 1:
```bash
kubectl exec deployment/ollama -- ollama pull qwen2:1.5b
```
Method 2:
```bash
kubectl run httpclient --rm -it --image alpine/httpie -- --ignore-stdin \
http://ollama:11434/api/pull \
name=qwen2:1.5b
```
---
## Houston, we (are going to) have a problem...
- This works when there is only one pod
- What happens if we scale up the Deployment?
- We need to pull the model on every pod
- How should we do that?
---
## Potential solutions
- Bake the model in the image
🙅 Personal opinion: this is a bad idea (image size, maintenance...)
- Directly send a "pull" command to each pod, individually
🙁 Hackish, not great
- Use a Kubernetes lifecycle hook
💡 That works!
- Use a sidecar container to pull the model
🤔 Doable, but more work than the lifecycle hook
---
## 🙋 Choose your adventure
Should we add that lifecycle hook?
---
## 3⃣ Helm chart
- Let's check the [ArtifactHUB] for an Ollama Helm chart
- The most popular (as of November 2024) is [this one, by OTWLD][ollama-chart]
- ~~It has pockets~~
- It can pre-pull models! 🎉
[ArtifactHub]: https://artifacthub.io
[ollama-chart]: https://artifacthub.io/packages/helm/ollama-helm/ollama
---
## Installing the Helm chart
Traditional method:
```bash
helm repo add ollama https://otwld.github.io/ollama-helm/
helm install ollama ollama/ollama --set ollama.models={qwen2:1.5b}
```
Idempotent¹, single-command method:
```bash
helm ugprade --install --repo https://otwld.github.io/ollama-helm/ \
ollama ollama --set ollama.models={qwen2:1.5b}
```
.footnote[¹Idempotent: which can be executed multiple times without adverse effect.]
---
## Testing the Helm installation
Just like before:
```bash
kubectl run httpclient --rm -it --image alpine/httpie -- --ignore-stdin \
http://ollama:11434/api/generate \
model=qwen2:1.5b prompt="Write a limerick about YAML" stream:=false
```
And while we're here, check resource usage:
```bash
kubectl exec deployment/ollama -ti -- top
```
There should be two processes:
- `ollama` itself, relatively small (~100 MB)
- the LLM subprocess, relatively big (~1.4 GB for qwen2:1.5b)
---
## Sending some load
We're going to use `hey`:
```bash
kubectl run hey --rm -it --image nixery.dev/hey -- \
hey -c 10 -n 10 -t 60 -m POST \
-d '{"model": "qwen2:1.5b", "prompt": "vi or emacs?"}' \
http://ollama:11434/api/generate
```
Some explanations:
- `nixery.dev` = automatically generates images with [Nixery]
- `-c` = concurrent requests
- `-n` = total number of requests
- `-t` = timeout in seconds
This is probably going to take (literally) a minute.
[Nixery]: https://nixery.dev/
---
## Performance analysis
- Let's start an interactive container with `hey`
(e.g., use the `alpine` image, then `apk add hey`)
- Try 10 requests, with a concurrency of 1/2/4
- Meanwhile, check the logs of the `ollama` pod
- Some results (your results may vary depending on CPU, random seed...):
- 1 = 0.08 reqs/s, average latency: 12s
- 2 = 0.10 reqs/s, average latency: 18s
- 4 = 0.12 reqs/s, average latency: 28s
- Higher concurrency = slightly higher throughput, much higher latency
🤔 We need metrics!

View File

@@ -0,0 +1,273 @@
# Adding metrics
We want multiple kinds of metrics:
- instantaneous pod and node resource usage
- historical resource usage (=graphs)
- request duration
---
## 1⃣ Instantaneous resource usage
- We're going to use metrics-server
- Check if it's already installed:
```bash
kubectl top nodes
```
- If we see a list of nodes, with CPU and RAM usage:
*great, metrics-server is installed!*
- If we see `error: Metrics API not available`:
*metrics-server isn't installed, so we'll install it!*
---
## Installing metrics-server
- In a lot of places, this is done with a little bit of custom YAML
(derived from the [official installation instructions](https://github.com/kubernetes-sigs/metrics-server#installation))
- We can also use a Helm chart:
```bash
helm upgrade --install metrics-server metrics-server \
--create-namespace --namespace metrics-server \
--repo https://kubernetes-sigs.github.io/metrics-server/ \
--set args={--kubelet-insecure-tls=true}
```
- The `args` flag specified above should be sufficient on most clusters
- After a minute, `kubectl top nodes` should show resource usage
---
## 2⃣ Historical resource usage
- We're going to use Prometheus (specifically: kube-prometheus-stack)
- This is a Helm chart bundling:
- Prometheus
- multiple exporters (node, kube-state-metrics...)
- Grafana
- a handful of Grafana dashboards
- Open Source
- Commercial alternatives: Datadog, New Relic...
---
## Installing kube-prometheus-stack
We're going to expose both Prometheus and Grafana with a NodePort:
```bash
helm upgrade --install --repo https://prometheus-community.github.io/helm-charts \
promstack kube-prometheus-stack \
--namespace prom-system --create-namespace \
--set prometheus.service.type=NodePort \
--set grafana.service.type=NodePort \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
#
```
This chart installation can take a while (up to a couple of minutes).
---
class: extra-details
## `...NilUsersHelmValues=false` ???
- kube-prometheus-stack uses the "Prometheus Operator"
- To configure "scrape targets", we create PodMonitor or ServiceMonitor resources
- By default, the Prometheus Operator will only look at \*Monitors with the right labels
- Our extra options mean "use all the Monitors that you will find!"
---
## Connecting to Grafana
Check the NodePort allocated to Grafana:
```bash
kubectl get service promstack-grafana --namespace prom-system
```
Get the public address of one of our nodes:
```bash
kubectl get nodes -o wide
```
Connect to the public address of a node, on the node port.
The default login and password are `admin` / `prom-operator`.
Check the dashboard "Kubernetes / Compute Resources / Namespace (Pods)".
Select a namespace and see the CPU and RAM usage for the pods in that namespace.
---
## 3⃣ Request duration
- Unfortunately, as of November 2024, ollama doesn't expose metrics
(there is ongoing discussion about it: [issue 3144][3144], [PR 6537][6537])
- There are some [garbage AI-generated blog posts claiming otherwise][garbage]
(but it's AI-generated, so it bears no connection to truth whatsoever)
- So, what can we do?
[3144]: https://github.com/ollama/ollama/issues/3144#issuecomment-2153184254
[6537]: https://github.com/ollama/ollama/pull/6537
[garbage]: https://www.arsturn.com/blog/setting-up-ollama-prometheus-metrics
---
## HAProxy to the rescue
- HAProxy is a proxy that can handle TCP, HTTP, and more
- It can expose detailed Prometheus metrics about HTTP requests
- The plan: add a sidecar HAProxy to each Ollama container
- For that, we need to give up on the Ollama Helm chart
(and go back to basic manifests)
---
## 🙋 Choose your adventure
Do we want to...
- write all the corresponding manifests?
- look at pre-written manifests and explain how they work?
- apply the manifests and carry on?
---
## 🏗️ Let's build something!
- If you have created Deployments / Services: clean them up first!
- Deploy Ollama with a sidecar HAProxy (sample configuration on next slide)
- Run a short benchmark campaign
(e.g. scale to 4 pods, try 4/8/16 parallel requests, 2 minutes each)
- Check live resource usage with `kubectl top nodes` / `kubectl top pods`
- Check historical usage with the Grafana dashboards
(for HAProxy metrics, you can use [Grafana dashboard 12693, HAProxy 2 Full][grafana-12693])
- If you don't want to write the manifests, you can use [these ones][ollama-yaml]
[grafana-12693]: https://grafana.com/grafana/dashboards/12693-haproxy-2-full/
[ollama-yaml]: https://github.com/jpetazzo/beyond-load-balancers/tree/main/ollama
---
```
global
#log stdout format raw local0
#daemon
maxconn 32
defaults
#log global
timeout client 1h
timeout connect 1h
timeout server 1h
mode http
`option abortonclose`
frontend metrics
bind :9000
http-request use-service prometheus-exporter
frontend ollama_frontend
bind :8000
default_backend ollama_backend
`maxconn 16`
backend ollama_backend
server ollama_server localhost:11434 check
```
---
class: extra-details
## ⚠️ Connection queues
- HAProxy will happily queue *many* connections
- If a client sends a request, then disconnects:
- the request stays in the queue
- the request gets processed by the backend
- eventually, when the backend starts sending the reply, the connection is closed
- This can result in a backlog of queries that take a long time to resorb
- To avoid that: `option abortonclose` (see [HAProxy docs for details][abortonclose])
- Note that the issue is less severe when replies are streamed
[abortonclose]: https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4-option%20abortonclose
---
class: extra-details
## Ad-hoc HAProxy dashboard
- To consolidate all frontend and backend queues on a single graph:
- query: `haproxy_frontend_current_sessions`
- legend: `{{namespace}}/{{pod}}/{{proxy}}`
- options, "Color scheme", select "Classic palette (by series name)"
---
## What do we see?
- Imperfect load balancing
- Some backends receive more requests than others
- Sometimes, some backends are idle while others are busy
- However, CPU utilization on the node is maxed out
- This is because our node is oversubscribed
- This is because we didn't specify of resource requests/limits (yet)
(we'll do that later!)

155
slides/k8s/ollama-reqlim.md Normal file
View File

@@ -0,0 +1,155 @@
## Setting resource requests and limits
- Thanks to *requests*:
- our pods will have resources *reserved* for them
- we won't pack too many pods on a single node
- cluster autoscaling will trigger when needed (if possible!)
- Thanks to *limits*:
- our pods won't use more than a given amount of resources
- they won't use up all the available resources on the node
- behavior will be more consistent between loaded and unloaded state
---
## Memory
- Personal advice: set request and limit to the same value
- Check current or historical usage and add a bit of padding
(the more data historical data we have, the less padding we need)
- Consider 10% padding for "dataless" pods, more for pods with data
(so that the pod has "reserves" for page cache usage)
⚠️ Pods hitting their memory limit will be **killed!**
---
## CPU
- It's not necessary to set requests and limits to the same value
(this would cause a lot of waste for idle workloads)
- Let's see a few possible strategies!
---
## CPU for mostly idle pods
E.g.: web services, workers handling very few requests...
- Set the limit to at least one whole core
(to avoid throttling, especially on bursty workloads)
- Requests can be very low (e.g. 0.1 core)
⚠️ If requests are too low and the node is very loaded,
the pod will slow down significantly!
(Because CPU cycles are allocated proportionally to CPU requests.)
---
## Inelastic CPU-hungry pods
- Pods with a fixed number of threads:
*set requests and limits to that number of threads*
- Pods where a specific level of performance needs to be guaranteed:
*set requests and limits to the number of cores providing that performance*
⚠️ If you set limits to higher levels, performance will be unpredictible!
(You'll get good performance when the node has extra cycles.)
---
## Elastic CPU-hungry pods
- Pods that could potentially use all the cores
(e.g. machine learning training and inference, depending on the models)
- Decide how many pods per node you want to pack
- Set CPU requests as a fraction of the number of cores of the nodes
(minus some padding)
- Example:
- nodes with 32 cores
- we want 4 pods per node
- CPU request: 7.5 cores
- Set limits to a higher level (up to node size)
---
## In practice
- Check memory usage of our Ollama pods:
```bash
kubectl top pods
```
(Or even better, look at historical usage in Prometheus or Grafana!)
- Check how many cores we have on our nodes:
```bash
kubectl get nodes -o json | jq .items[].status.capacity.cpu
kubectl get nodes -o custom-columns=NAME:metadata.name,CPU:status.capacity.cpu
```
- Let's decide that we want two Ollama pods per node
- What requests/limits should we set?
---
## Setting resources for Ollama
- Assumptions:
- we want two pods per node
- each pod uses ~1500MiB RAM
- nodes have 4 cores
- We'll set memory requests and limits to 2G
- We'll set CPU requests to 1.5 (4 cores / 2 pods, minus padding)
- We'll set CPU limits to twice the requests
```bash
kubectl set resources deployment ollama \
--requests=cpu=1.5,memory=2G \
--limits=cpu=3,memory=2G
```
⚠️ If you have an HAProxy side car, this will set its resources too!
---
## Results
- After setting these resource requests, we should see cluster autoscaling
- If not: scale up the Ollama Deployment to at least 3 replicas
- Check cluster autoscaler status with:
```bash
kubectl describe configmap --namespace kube-system cluster-autoscaler-status
```

View File

@@ -0,0 +1,210 @@
# Message Queue Architecture
There are (at least) three ways to distribute load:
- load balancers
- batch jobs
- message queues
Let's do a quick review of their pros/cons!
---
## 1⃣ Load balancers
<pre class="mermaid">
flowchart TD
Client["Client"] ---> LB["Load balancer"]
LB ---> B1["Backend"] & B2["Backend"] & B3["Backend"]
</pre>
---
## Load balancers
- Latency: ~milliseconds (network latency)
- Overhead: very low (one extra network hop, one log message?)
- Great for short requests (a few milliseconds to a minute)
- Supported out of the box by the Kubernetes Service Proxy
(by default, this is `kube-proxy`)
- Suboptimal resource utilization due to imperfect balancing
(especially when there are multiple load balancers)
---
## 2⃣ Batch jobs
<pre class="mermaid">
flowchart TD
subgraph K["Kubernetes Control Plane"]
J1["Job"]@{ shape: card}
J2["Job"]@{ shape: card}
J3["..."]@{ shape: text}
J4["Job"]@{ shape: card}
end
C["Client"] ---> K
K <---> N1["Node"] & N2["Node"] & N3["Node"]
</pre>
---
## Batch jobs
- Latency: a few seconds (many Kubernetes controllers involved)
- Overhead: significant due to all the moving pieces involved
(job controller, scheduler, kubelet; many writes to etcd and logs)
- Great for long requests (a few minutes to a few days)
- Supported out of the box by Kubernetes
(`kubectl create job hello --image alpine -- sleep 60`)
- Asynchronous processing requires some refactoring
(we don't get the response immediately)
---
## 3⃣ Message queues
<pre class="mermaid">
flowchart TD
subgraph Q["Message queue"]
M1["Message"]@{ shape: card}
M2["Message"]@{ shape: card}
M3["..."]@{ shape: text}
M4["Message"]@{ shape: card}
end
C["Client"] ---> Q
Q <---> W1["Worker"] & W2["Worker"] & W3["Worker"]
</pre>
---
## Message queues
- Latency: a few milliseconds to a few seconds
- Overhead: intermediate
(very low with e.g. Redis, higher with e.g. Kafka)
- Great for all except very short requests
- Requires additional setup
- Asynchronous processing requires some refactoring
---
## Dealing with errors
- Load balancers
- errors reported immediately (client must retry)
- some load balancers can retry automatically
- Batch jobs
- Kubernetes retries automatically
- after `backoffLimit` retries, Job is marked as failed
- Message queues
- some queues have a concept of "acknowledgement"
- some queues have a concept of "dead letter queue"
- some extra work is required
---
## Some queue brokers
- Redis (with e.g. RPUSH, BLPOP)
*light, fast, easy to setup... no durability guarantee, no acknowledgement, no dead letter queue*
- Kafka
*heavy, complex to setup... strong deliverability guarantee, full featured*
- RabbitMQ
*somewhat in-between Redis and Kafka*
- SQL databases
*often requires polling, which adds extra latency; not as scalable as a "true" broker*
---
## More queue brokers
Many cloud providers offer hosted message queues (e.g.: Amazon SQS).
These are usually great options, with some drawbacks:
- vendor lock-in
- setting up extra environments (testing, staging...) can be more complex
(Setting up a singleton environment is usually very easy, thanks to web UI, CLI, etc.; setting up extra environments and assigning the right permissions with e.g. IAC is usually significantly more complex.)
---
## Implementing a message queue
1. Pick a broker
2. Deploy the broker
3. Set up the queue
4. Refactor our code
---
## Code refactoring (client)
Before:
```python
response = http.POST("http://api", payload=Request(...))
```
After:
```python
client = queue.connect(...)
client.publish(message=Request(...))
```
Note: we don't get the response right way (if at all)!
---
## Code refactoring (server)
Before:
```python
server = http.server(request_handler=handler)
server.listen("80")
server.run()
```
After:
```python
client = queue.connect(...)
while true:
message = client.consume()
response = handler(message)
# Write the response somewhere
```

44
slides/mlops.yml Normal file
View File

@@ -0,0 +1,44 @@
title: |
Asynchronous Architecture Patterns To Scale ML and Other High Latency Workloads on Kubernetes
#chat: "[Slack](https://dockercommunity.slack.com/messages/C7GKACWDV)"
#chat: "[Gitter](https://gitter.im/jpetazzo/workshop-yyyymmdd-city)"
chat: "In person!"
gitrepo: github.com/jpetazzo/container.training
slides: https://FIXME.container.training/
#slidenumberprefix: "#SomeHashTag &mdash; "
exclude:
- self-paced
content:
- shared/title.md
- logistics.md
- shared/about-slides.md
#- shared/chat-room-im.md
#- shared/chat-room-slack.md
#- shared/chat-room-zoom-meeting.md
#- shared/chat-room-zoom-webinar.md
- k8s/prereqs-advanced.md
- k8s/handson-mlops.md
- shared/connecting.md
- k8s/mlops-headsup.md
- shared/toc.md
-
- k8s/ollama-intro.md
- k8s/ollama-metrics.md
- k8s/queue-architecture.md
- k8s/bento-intro.md
-
- k8s/resource-limits.md
- k8s/cluster-autoscaler.md
- k8s/ollama-reqlim.md
- k8s/bento-hpa.md
- k8s/bento-rmq.md
- k8s/bento-cnpg.md
- k8s/helmfile.md
- shared/thankyou.md
- shared/contact.md

54
slides/shared/contact.md Normal file
View File

@@ -0,0 +1,54 @@
<table>
<tr>
<td style="vertical-align: sub; background: initial;">
<pre style="padding: 40px; font-size: 16px; line-height: 18px;">
█▀▀▀▀▀█ ▀▀▀█▄▀ ▀▄ ▀▄ ▀▄ ▄█▀ ▄ █▀▀▀▀▀█
█ ███ █ ▀▄█ ▀▀▄█ ▄▀▀ ██▄▄ █ ███ █
█ ▀▀▀ █ ▄▀█▀ █▀▀▀█ ▄█▀▄███ ▄ █ ▀▀▀ █
▀▀▀▀▀▀▀ █▄▀ █▄█ ▀ █ █ ▀▄█▄▀ █ ▀▀▀▀▀▀▀
▀▀ █▀▄▀ ▀▄ ▀▀█▄▄█▄▄ ▄▄▄ █▀ ▀▄▄ ▄▀
▄█▄▀▄▀▀██▀ ▀▀██▄█ ▀▀▄█ ██▀ █▄█▀█▀▀
▄ ▄▀▀ ▀ ▀█▀ ▄█▄▀▄▀ ▀ █ █ █▄▄▀▀▀▀▄█▄█▀
█ ▀▀█▄▀▀█▀█ ▄▀ ▀▀ █▀▄ ▀▄ ██▄▀ ▄█ ▄▀█
█▄▀▀▀ ▀▀ ███▀█▀▄ ▄▄█ ██ █▀▄▀▄ █▀▀▀
▄ █▀▄▀ ▄▀ ▄▀▄ ██ ▀▀█ ▄█ █▀▀▄█▀ ▄ █
█▀▀▄▄ ▀ ▀ ▀▀█ ▀▀▀ ▀▀ █▀██▄▀▀▀███▄█▀
█▀█▀▄█▀██ ██ ▀ █▄█▀ ▀ ██▀ ██▄ █▄█▄▄█
█▀█▀▄▄▀▀▀▄▀▄▀ ▄█ ▄▀█ ▄▀▄ █▄ ▀▀▄█▄▄▀
█▀█▄█ ▀ ▀▀▄█▀ █▄▀ █ ▄ ▄▀▄█ █▄▄█▄▄▀█
▀ ▀▀ ▀▀█▄ ▀ ▀ ▄▄███▄ ▄ █▀▀▀█▀██
█▀▀▀▀▀█ ▀██ █ █▀▀ ▀█▀██▄█▀▄█ ▀ █▄ ▄▀
█ ███ █ █▄██▀ ▀▄▀▀▄█▀ ▄▄▀██▀▀▀█▀▀ ▄ ▀
█ ▀▀▀ █ ▄█▀▀▀▀▄▀▄▄█ ▄▀█▀▄ ▀ ▀█ █▄█
▀▀▀▀▀▀▀ ▀▀ ▀▀ ▀ ▀ ▀ ▀ ▀ ▀ ▀ ▀
</pre>
.center[
👆
Please fill this [feedback form](https://docs.google.com/forms/d/e/1FAIpQLScYloWur4uVhKgVNIdUrfHZ8pk_mBmPcQwmbhjK2FlR9KWDCA/viewform).
Thank you! 🫶
]
</td>
<td style="vertical-align: sub; background: initial;">
Contact information:
📛 Jérôme Petazzoni
<br/>
📩 jerome.petazzoni@gmail.com
<br/>
🔗 https://linkedin.com/in/jpetazzo
<br/>
🦣 https://hachyderm.io/@jpetazzo
I can teach custom courses!<br/>
→ Docker, Kubernetes, MLOps<br/>
→ from intro level to "black belt"<br/>
→ on site or remotely<br/>
Reach out if you're interested!
</td>
</tr>
</table>