➕ Add MLops material for QCON SF 2024

2026-02-14 17:49:59 +00:00 · 2024-11-18 19:21:18 -06:00
parent 7305bcfe12
commit 0abc67e974
13 changed files with 2908 additions and 0 deletions
--- a/slides/k8s/bento-cnpg.md
+++ b/slides/k8s/bento-cnpg.md
@@ -0,0 +1,173 @@
+# Bento & PostgreSQL
+
+- Bento can also use SQL databases for input/output
+
+- We're going to demonstrate that by writing to a PostgreSQL database
+
+- That database will be deployed with the Cloud Native PostGres operator
+
+  (https://cloudnative-pg.io/)
+
+---
+
+## CNPG in a nutshell
+
+- Free, open source
+
+- Originally created by [EDB] (EnterpriseDB, well-known PgSQL experts)
+
+- Non-exhaustive list of features:
+
+  - provisioning of Postgres servers, replicas, bouncers
+
+  - automatic failover
+
+  - backups (full backups and WAL shipping)
+
+  - provisioning from scratch, from backups, PITR
+
+  - manual and automated switchover (e.g. for node maintenance)
+
+  - and many more!
+
+[EDB]: https://www.enterprisedb.com/workload/kubernetes
+
+---
+
+## What we're going to do
+
+1. Install CNPG.
+
+2. Provision a Postgres cluster.
+
+3. Configure Bento to write to that cluster.
+
+4. Set up a Grafana dashboard to see the data.
+
+---
+
+## 1️⃣ Installing CNPG
+
+Many options available, see the [documentation][cnpg-install]:
+
+- raw YAML manifests
+
+- kubectl CNPG plugin (`kubectl cnpg install generate`)
+
+- Helm chart
+
+- OLM
+
+[cnpg-install]: https://cloudnative-pg.io/documentation/1.24/installation_upgrade/
+
+---
+
+## 2️⃣ Provisioning a Postgres cluster
+
+Minimal manifest:
+
+```yaml
+apiVersion: postgresql.cnpg.io/v1
+kind: Cluster
+metadata:
+  name: db
+spec:
+  storage:
+    size: 1Gi
+```
+
+---
+
+class: extra-details
+
+## For production...
+
+We might also add:
+
+- `spec.monitoring.enablePodMonitor: true`
+
+- `spec.instances: 2`
+
+- `resources.{requests,limits}.{cpu,memory}`
+
+- `walStorage.size`
+
+- `backup`
+
+- `postgresql.parameters`
+
+See [this manifest][cluster-maximal] for a detailed example.
+
+[cluster-maximal]: https://github.com/jpetazzo/pozok/blob/main/cluster-maximal.yaml
+
+---
+
+## 3️⃣ Configuring Bento to write to SQL
+
+- We'll use the [`sql_insert`][sql-insert] output
+
+- If our cluster is named `mydb`, there will be a Secret `mydb-app`
+
+- This Secret will contain a `uri` field
+
+- That field can be used as the `dns` in the Bento configuration
+
+- We will also need to create the table that we want to use
+
+  (see next slide for instructions)
+
+[sql-insert]: https://warpstreamlabs.github.io/bento/docs/components/outputs/sql_insert
+
+---
+
+## Creating a table
+
+- If we just want to store the city name and its population:
+  ```sql
+      CREATE TABLE IF NOT EXISTS cities (
+        city varchar(100) NOT NULL,
+        population integer
+      );
+  ```
+
+- This statement can be executed:
+
+  - manually, by getting a `psql` shell with `kubectl cnpg psql mydb app`
+
+  - automatically, with Bento's `init_statatement`
+
+---
+
+## 4️⃣ Viewing the table in Grafana
+
+- In Grafana, in the home menu on the lift, click "connections"
+
+- Add a PostgreSQL data source
+
+- Enter the host:port, database, user, password
+
+- Then add a visualization using that data source
+
+  (it should be relatively self-explanatory!)
+
+---
+
+class: extra-details
+
+## Automating it all
+
+- Expose PostgreSQL credentials through environment variables
+
+  (in the Bento container)
+
+- Use the `${...}` syntax in Bento to use these environment variables
+
+- Export the Grafana dashboard to a JSON file
+
+- Store the JSON file in a ConfigMap, with label `grafana_dashboard=1`
+
+- Create that ConfigMap in the namespace where Grafana is running
+
+- Similarly, data sources (like the Redis and the PostgreSQL one) can be defined in YAML
+
+- And that YAML can be put in a ConfigMap with label `grafana_datasource=1`
--- a/slides/k8s/bento-hpa.md
+++ b/slides/k8s/bento-hpa.md
@@ -0,0 +1,450 @@
+# Autoscaling with KEDA
+
+- Cluster autoscaling = automatically add nodes *when needed*
+
+- *When needed* = when Pods are `Pending`
+
+- How do these pods get created?
+
+- When the Ollama Deployment is scaled up
+
+  - ... manually (e.g. `kubectl scale`)
+
+  - ... automatically (that's what we want to investigate now!)
+
+---
+
+## Ways to implement autoscaling
+
+- Custom code
+
+  (e.g. crontab checking some value every few minutes and scaling accordingly)
+
+- Kubernetes Horizontal Pod Autoscaler v1
+
+  (aka `kubectl autoscale`)
+
+- Kubernetes Horizontal Pod Autoscaler v2 with custom metrics
+
+  (e.g. with Prometheus Adapter)
+
+- Kubernetes Horizontal Pod Autoscaler v2 with external metrics
+
+  (e.g. with KEDA)
+
+---
+
+## Custom code
+
+- No, we're not going to do that!
+
+- But this would be an interesting exercise in RBAC
+
+  (setting minimal amount of permissions for the pod running our custom code)
+
+---
+
+## HPAv1
+
+Pros: very straightforward
+
+Cons: can only scale on CPU utilization
+
+How it works:
+
+- periodically measures average CPU *utilization* across pods
+
+- if utilization is above/below a target (default: 80%), scale up/down
+
+---
+
+## HPAv1 in practice
+
+- Create the autoscaling policy:
+  ```bash
+  kubectl autoscale deployment ollama --max=1000
+  ```
+  (The `--max` is required; it's a safety limit.)
+
+- Check it:
+  ```bash
+  kubectl describe hpa
+  ```
+
+- Send traffic, wait a bit: pods should be created automatically
+
+---
+
+## HPAv2 custom vs external
+
+- Custom metrics = arbitrary metrics attached to Kubernetes objects
+
+- External metrics = arbitrary metrics not related to Kubernetes objects
+
+--
+
+🤔
+
+---
+
+## HPAv2 custom metrics
+
+- Examples:
+
+  - on Pods: CPU, RAM, network traffic...
+
+  - on Ingress: requests per second, HTTP status codes, request duration...
+
+  - on some worker Deployment: number of tasks processed, task duration...
+
+- Requires an *adapter* to:
+
+  - expose the metrics through the Kubernetes *aggregation layer*
+
+  - map the actual metrics source to Kubernetes objects
+
+Example: the [Prometheus adapter][prometheus-adapter]
+
+[prometheus-adapter]: https://github.com/kubernetes-sigs/prometheus-adapter
+
+---
+
+## HPAv2 custom metrics in practice
+
+- We're not going to cover this here
+
+  (too complex / not enough time!)
+
+- If you want more details, check [my other course material][hpav2slides]
+
+[hpav2slides]: https://2024-10-enix.container.training/4.yml.html#toc-scaling-with-custom-metrics
+
+---
+
+## HPAv2 external metrics
+
+- Examples:
+
+  - arbitrary Prometheus query
+
+  - arbitrary SQL query
+
+  - number of messages in a queue
+
+  - and [many, many more][keda-scalers]
+
+- Also requires an extra components to expose the metrics
+
+Example: [KEDA (https://keda.sh/)](https://keda.sh)
+
+[keda-scalers]: https://keda.sh/docs/latest/scalers/
+
+---
+
+## HPAv2 external metrics in practice
+
+- We're going to install KEDA
+
+- And set it up to autoscale depending on the number of messages in Redis
+
+---
+
+## Installing KEDA
+
+Multiple options (details in the [documentation][keda-deploy]):
+
+- YAML
+
+- Operator Hub
+
+- Helm chart 💡
+
+```bash
+helm upgrade --install --repo https://kedacore.github.io/charts \
+ --namespace keda-system --create-namespace keda keda
+```
+
+[keda-deploy]: https://keda.sh/docs/latest/deploy/
+
+---
+
+## Scaling according to Redis
+
+- We need to create a KEDA Scaler
+
+- This is done with a "ScaledObject" manifest
+
+- [Here is the documentation][keda-redis-lists] for the Redis Lists Scaler
+
+- Let's write that manifest!
+
+[keda-redis-lists]: https://keda.sh/docs/latest/scalers/redis-lists/
+
+---
+
+## `keda-redis-scaler.yaml`
+
+```yaml
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: ollama
+spec:
+  scaleTargetRef:
+    name: ollama
+  triggers:
+  - type: redis
+    metadata:
+      address: redis.`default`.svc:6379
+      listName: cities
+      listLength: "10"
+```
+
+---
+
+## Notes
+
+- We need to update the `address` field with our namespace
+
+  (unless we are running in the `default` namespace)
+
+- Alternative: use `addressFromEnv` and set an env var in the Ollama pods
+
+- `listLength` gives the target ratio of `messages / replicas`
+
+- In our example, KEDA will scale the Deployment to `messages / 100`
+
+  (rounded up!)
+
+---
+
+## Trying it out
+
+- Apply the ScaledObject manifest
+
+- Start a Bento pipeline loading e.g. 100-1000 cities in Redis
+
+  (100 on smaller clusters / slower CPUs, 1000 on bigger / faster ones)
+
+- Check pod and nod resource usage
+
+- What do we see?
+
+--
+
+🤩 The Deployment scaled up automatically!
+
+--
+
+🤔 But Pod resource usage remains very low (A few busy pods, many idle)
+
+--
+
+💡 Bento doesn't submit enough requests in parallel!
+
+---
+
+## Improving throughput
+
+We're going to review multiple techniques:
+
+1. Increase parallelism inside the Bento pipeline.
+
+2. Run multiple Bento consumers.
+
+3. Couple consumers and processors more tightly.
+
+---
+
+## 1️⃣ Increase pipeline parallelism
+
+- Set `parallel` to `true` in the `http` processor
+
+- Wrap the input around a `batched` input
+
+  (otherwise, we don't have enough messages in flight)
+
+- Increase `http` timeout significantly (e.g. to 5 minutes)
+
+---
+
+## Results
+
+🎉 More messages flow through the pipeline
+
+🎉 Many requests happen in parallel
+
+🤔 Average Pod and Node CPU utilization is higher, but not maxed out
+
+🤔 HTTP queue size (measured with HAProxy metrics) is relatively high
+
+🤔 Latency is higher too
+
+Why?
+
+---
+
+## Too many requests in parallel
+
+- Ealier, we didn't have enough...
+
+- ...Now, we have too much!
+
+- However, for a very big request queue, it still wouldn't be enough
+
+💡 We currently have a fixed parallelism. We need to make it dynamic!
+
+---
+
+## 2️⃣ Run multiple Bento consumers
+
+- Restore the original Bento configuration
+
+  (flip `parallel` back to `false`; remove the `batched` input)
+
+- Run Bento in a Deployment
+
+  (e.g. with the [Bento Helm chart][bento-helm-chart])
+
+- Autoscale that Deployment like we autoscaled the Ollama Deployment
+
+[bento-helm-chart]: https://github.com/warpstreamlabs/bento-helm-chart
+
+---
+
+## Results
+
+🤔🤔🤔 Pretty much the same as before!
+
+(High throughput, high utilization but not maxed out, high latency...)
+
+--
+
+🤔🤔🤔 Why?
+
+---
+
+## Unbalanced load balancing
+
+- All our requests go through the `ollama` Service
+
+- We're still using the default Kubernetes service proxy!
+
+- It doesn't spread the requests properly across all the backends
+
+---
+
+## 3️⃣ Couple consumers and processors
+
+What if:
+
+--
+
+instead of sending requests to a load balancer,
+
+--
+
+each queue consumer had its own Ollama instance?
+
+---
+
+## Current architecture
+
+<pre class="mermaid">
+flowchart LR
+  subgraph P1["Pod"]
+    H1["HAProxy"] --> O1["Ollama"]
+  end
+  subgraph P2["Pod"]
+    H2["HAProxy"] --> O2["Ollama"]
+  end
+  subgraph P3["Pod"]
+    H3["HAProxy"] --> O3["Ollama"]
+  end
+  Q["Queue<br/>(Redis)"] <--> C["Consumer<br/>(Bento)"] --> LB["Load Balancer<br/>(kube-proxy)"]
+  LB --> H1 & H2 & H3
+</pre>
+
+---
+
+## Proposed architecture
+
+<pre class="mermaid">
+flowchart LR
+  subgraph P1["Consumer Pod"]
+    C1["Bento"] --> H1["HAProxy"] --> O1["Ollama"]
+  end
+  subgraph P2["Consumer Pod"]
+    C2["Bento"] --> H2["HAProxy"] --> O2["Ollama"]
+  end
+  subgraph P3["Consumer Pod"]
+    C3["Bento"] --> H3["HAProxy"] --> O3["Ollama"]
+  end
+  Queue["Queue"] <--> C1 & C2 & C3
+</pre>
+
+---
+
+## 🏗️ Let's build something!
+
+- Let's implement that architecture!
+
+- See next slides for hints / getting started
+
+---
+
+## Hints
+
+We need to:
+
+- Update the Bento consumer configuration to talk to localhost
+
+- Store that configuration in a ConfigMap
+
+- Add a Bento container to the Ollama Deployment
+
+- Profit!
+
+---
+
+## Results
+
+🎉 Node and Pod utilization is maximized
+
+🎉 HTTP queue size is bounded
+
+🎉 Deployment autoscales up and down
+
+---
+
+## ⚠️ Scaling down
+
+- Eventually, there are less messages in the queue
+
+- The HPA scales down the Ollama Deployment
+
+- This terminates some Ollama Pods
+
+🤔 What happens if these Pods were processing requests?
+
+--
+
+- The requests might be lost!
+
+---
+
+## Avoiding lost messages
+
+Option 1:
+
+- cleanly shutdown the consumer
+
+- make sure that Ollama can complete in-flight requests
+
+  (by extending its grace period)
+
+- find a way to terminate Ollama when no more requests are in flight
+
+Option 2:
+  
+- use *message acknowledgement*
--- a/slides/k8s/bento-intro.md
+++ b/slides/k8s/bento-intro.md
@@ -0,0 +1,628 @@
+# Getting started with Bento
+
+How can we move to a message queue architecture...
+
+*...without rewriting a bunch of code?*
+
+🤔
+
+---
+
+## Bento
+
+https://bento.dev/
+
+"Fancy stream processing made operationally mundane"
+
+"Written in Go, deployed as a static binary, declarative configuration. Open source and cloud native as utter heck."
+
+With ✨ amazing ✨ documentation 😍
+
+---
+
+class: extra-details
+
+## Tiny bit of history
+
+- Original project: Benthos
+
+- May 30, 2024: [Redpanda acquires Benthos][redpanda-acquires-benthos]
+
+  - Benthos is now Redpanda Connect
+
+  - some parts have been relicensed as commercial products
+
+- May 31, 2024: [Warpstream forks Benthos][warpstream-forks-benthos]
+
+  - that fork is named "Bento"
+
+  - it's fully open source
+
+- We're going to use Bento here, but Redpanda Connect should work fine too!
+
+[redpanda-acquires-benthos]: https://www.redpanda.com/press/redpanda-acquires-benthos
+[warpstream-forks-benthos]: https://www.warpstream.com/blog/announcing-bento-the-open-source-fork-of-the-project-formerly-known-as-benthos
+
+---
+
+## Bento concepts
+
+- Message stream processor
+
+- Each pipeline is configured by a YAML configuration that defines:
+
+  - input (where do we get the messages?)
+
+  - pipeline (optional: how do we transform the messages?)
+
+  - output (where do we put the messages afterwards?)
+
+- Once Bento is started, it runs the pipelines forever
+
+  (except for pipelines that have a logical end, e.g. reading from a file)
+
+- Embedded language (Bloblang) to manipulate/transform messages
+
+---
+
+## Messages
+
+- Typically JSON objects
+
+  (but raw strings are also possible)
+
+- Nesting, arrays, etc. are OK
+
+---
+
+## Getting started with Bento
+
+We're going to:
+
+1. Import a bunch of cities from a CSV file into a Redis queue.
+
+2. Read back these cities using a web server.
+
+3. Use an "enrichment workflow" to query our LLM for each city.
+
+---
+
+## 1️⃣ Importing cities
+
+Let's break down the work:
+
+- download the data set
+
+- create the Bento configuration
+
+- deploy Redis
+
+- start Bento
+
+---
+
+## Downloading the data set
+
+- Example database:
+
+  https://www.kaggle.com/datasets/juanmah/world-cities
+
+- Let's download and uncompress the data set:
+  ```bash
+  curl -fsSL https://www.kaggle.com/api/v1/datasets/download/juanmah/world-cities |
+    funzip > cities.csv
+  ```
+
+  (Ignore the "length error", it's harmless!)
+
+- Check the structure of the data set:
+  ```bash
+  head cities.csv
+  ```
+
+---
+
+## Creating the Bento configuration
+
+- We need to find which `input` and `output` to use
+
+- Check the list with `bento list` or the [documentation]
+
+- Then run `bento create INPUTNAME/PIPELINENAME/OUTPUTNAME`
+
+- Generate a configuration file:
+  ```bash
+  bento create csv//redis_list > csv2redis.yaml
+  ```
+
+- Edit that configuration file; look for the `(required)` parameters
+
+  (Everything else can go away!)
+
+[documentation]: https://warpstreamlabs.github.io/bento/docs/components/inputs/about/
+
+---
+
+## Resulting configuration
+
+If we trim all the default values, here is the result:
+
+```yaml
+input:
+  csv:
+    paths: ["cities.csv"]
+output:
+  redis_list:
+    url: redis://redis:6379 # No default (required)
+    key: cities
+```
+
+We'll call that value `csv2redis.yaml`.
+
+---
+
+## Deploying Redis
+
+- Create a Deployment:
+  ```bash
+  kubectl create deployment redis --image redis
+  ```
+
+- Expose it:
+  ```bash
+  kubectl expose deployment redis --port 6379
+  ```
+
+---
+
+## Starting Bento
+
+Option 1: run it manually in a pod, to see what's going on.
+
+```bash
+bento --config csv2redis.yaml
+```
+
+Option 2: run it with e.g. the Bento Helm chart.
+
+*We're not going to do that yet, since this particular pipeline has a logical end.*
+
+*(The Helm chart is best suited to pipelines that run forever.)*
+
+---
+
+## Expected output
+
+.small[
+```
+INFO Running main config from specified file       @service=bento bento_version="" path=csv2redis.yaml
+INFO Launching a Bento instance, use CTRL+C to close  @service=bento
+INFO Listening for HTTP requests at: http://0.0.0.0:4195  @service=bento
+INFO Input type csv is now active                  @service=bento label="" path=root.input
+INFO Output type redis_list is now active          @service=bento label="" path=root.output
+INFO Pipeline has terminated. Shutting down the service  @service=bento
+```
+]
+
+The pipeline should complete in a just a few seconds.
+
+---
+
+## Checking what's in Redis
+
+- Connect to our Redis instance:
+  ```bash
+  redis-cli -h redis
+  ```
+
+- List keys:
+  ```redis
+  KEYS *
+  ```
+
+- Check that the `cities` list has approx. 47000 elements:
+  ```redis
+  LLEN cities
+  ```
+
+- Get the first element of the list:
+  ```redis
+  LINDEX cities 0
+  ```
+
+---
+
+## Fun with Bloblang
+
+- Let's add a filter to keep only cities with a population above 10,000,000
+
+- Add the following block to the Bento configuration:
+
+```yaml
+pipeline:
+  processors:
+    - switch:
+        - check: this.population == ""
+          processors:
+            - mapping: root = deleted()
+        - check: this.population.int64() < 10000000
+          processors:
+            - mapping: root = deleted()
+```
+
+(See the [docs][switch-docs] for details about the `switch` processor.)
+
+[switch-docs]: https://warpstreamlabs.github.io/bento/docs/components/processors/switch/
+
+---
+
+## Testing our processor
+
+- First, delete the existing `cities` list:
+  ```bash
+  redis-cli -h redis DEL cities
+  ```
+
+- Then, run the Bento pipeline again:
+  ```bash
+  bento --config csv2redis.yaml
+  ```
+  (It should complain about a few cities where the population has a decimal point.)
+
+- Check how many cities were loaded:
+  ```bash
+  redis-cli -h redis LLEN cities
+  ```
+  (There should be 47.)
+
+---
+
+## 2️⃣ Consume the queue over HTTP
+
+- We want to "get the next city" in the queue with a simple `curl`
+
+- Our input will be `redis_list`
+
+- Our output will be `http_server`
+
+---
+
+## Generate the Bento configuration
+
+Option 1: `bento create redis_list//http_server`
+
+Option 2: [read the docs][output-http-server]
+
+[output-http-server]: https://warpstreamlabs.github.io/bento/docs/components/outputs/http_server
+
+---
+
+## 🙋 Choose your adventure
+
+Do you want to try to write that configuration?
+
+Or shall we see it right away?
+
+--
+
+⚠️ Spoilers on next slide!
+
+---
+
+## `redis2http.yaml`
+
+```yaml
+input:
+  redis_list:
+    url: redis://redis:6379
+    key: cities
+output:
+  http_server:
+    path: /nextcity
+```
+
+This will set up an HTTP route to fetch *one* city.
+
+It's also possible to batch, stream...
+
+---
+
+## Trying it out
+
+- Run Bento with that configuration:
+  ```bash
+  bento --config redis2http.yaml &
+  ```
+
+- Retrieve one city:
+  ```bash
+  curl http://localhost:4195/nextcity
+  ```
+
+- Check what happens after we retrive *all* the cities!
+
+---
+
+## 3️⃣ Query our LLM for each city
+
+- We want to ask our LLM who's the mayor of each of these cities
+
+- We'll use a prompt that will usually ensure a short answer
+
+  (so that it's faster; we don't want to wait 30 seconds per city!)
+
+- We'll test the prompt with the Ollama CLI
+
+- Then we'll craft a proper HTTP API query
+
+- Finally, we'll configure an [enrichment workflow][enrichment] in Bento
+
+[enrichment]: https://warpstreamlabs.github.io/bento/cookbooks/enrichments/
+
+---
+
+## Test our prompt
+
+Assuming that our earlier Ollama Deployment is still running:
+
+```bash
+kubectl exec deployment/ollama -- \
+ollama run qwen2:1.5b "
+Who is the mayor of San Francisco?
+Just give the name by itself on a single line.
+If you don't know, don't say anything.
+"
+```
+
+---
+
+## Turn the prompt into an HTTP API query
+
+Note: to install `http` in an Alpine container, run `apk add httpie`.
+
+```bash
+http http://ollama.default:11434/api/generate \
+model=qwen2:1.5b stream:=false prompt="
+Who is the mayor of Paris?
+Just give the name by itself on a single line.
+If you don't know, don't say anything.
+"
+```
+
+We get a JSON payload, and we want to use the `response` field.
+
+---
+
+## Configure an enrichment workflow
+
+The [documentation][enrichment] is really good!
+
+We need to set up:
+
+- a `branch` processor
+
+- a `request_map` to transform the city into an Ollama request
+
+- an `http` processor to submit the request to Ollama
+
+- a `result_map` to transform the Ollama response
+
+[enrichment]: https://warpstreamlabs.github.io/bento/cookbooks/enrichments/
+
+---
+
+## Without the `branch` processor
+
+<pre class="mermaid">
+flowchart LR
+
+  CITY["
+    city: Paris
+    country: France
+    population: 1106000
+    iso2: FR
+    ...
+    "]
+
+  REQ["
+    model: qwen2:1.5b
+    stream: false
+    prompt: Who is the mayor of Paris?
+    "]
+
+  REP["
+    response: Anne Hidalgo
+    eval_count: ...
+    prompt_eval_count: ...
+    (other ollama fields)
+    "]
+
+  CITY@{ shape: card}
+  REQ@{ shape: card}
+  REP@{ shape: card}
+
+  style CITY text-align: left
+  style REQ text-align: left
+  style REP text-align: left
+
+  mapping@{ shape: diam }
+  http["http processor"]@{ shape: diam }
+
+  CITY --> mapping --> REQ --> http --> REP
+</pre>
+
+- We transform the `city` into an Ollama request
+
+- The `http` processor submits the request to Ollama
+
+- The final output is the Ollama response
+
+---
+
+## With the `branch` processor
+
+<pre class="mermaid">
+flowchart LR
+
+  CITY["
+    city: Paris
+    country: France
+    population: 1106000
+    iso2: FR
+    ...
+    "]
+
+  REQ["
+    model: qwen2:1.5b
+    stream: false
+    prompt: Who is the mayor of Paris?
+    "]
+
+  REP["
+    response: Anne Hidalgo
+    eval_count: ...
+    prompt_eval_count: ...
+    (other ollama fields)
+    "]
+
+  OUT["
+    city: Paris
+    country: France
+    population: 1106000
+    iso2: FR
+    ...
+    mayor: Anne Hidalgo
+    "]
+
+  CITY@{ shape: card}
+  REQ@{ shape: card}
+  REP@{ shape: card}
+  OUT@{ shape: card}
+
+  style CITY text-align: left
+  style REQ text-align: left
+  style REP text-align: left
+  style OUT text-align: left
+
+  branch@{ shape: diam }
+  request_map@{ shape: diam }
+  result_map@{ shape: diam }
+  http["http processor"]@{ shape: diam }
+
+  CITY --> branch
+  branch --> result_map
+  branch --> request_map
+  request_map --> REQ
+  REQ --> http
+  http --> REP
+  REP --> result_map
+  result_map --> OUT
+</pre>
+
+- The `branch` processor allows to do the processing "on the side"
+
+- `request_map` and `result_map` transform the message before/after processing
+
+- Then, the result if combined with the original message (the `city`)
+
+---
+
+```yaml
+input:
+  csv:
+    paths: ["cities.csv"]
+pipeline:
+  processors:
+    - branch:
+        request_map: |
+          root.model = "qwen2:1.5b"
+          root.stream = false
+          root.prompt = (
+            "Who is the mayor of %s? ".format(this.city) +
+            "Just give the name by itself on a single line. " +
+            "If you don't know, don't say anything."
+            )
+        processors:
+          - http:
+              url: http://ollama:11434/api/generate
+              verb: POST
+        result_map: |
+          root.mayor = this.response
+```
+
+---
+
+## Trying it out
+
+- Save the YAML on the previous page into a configuration file
+
+- Run Bento with that configuration file
+
+- What happens?
+
+--
+
+🤔 We're seeing errors due to timeouts
+
+```
+ERRO HTTP request to 'http://ollama...' failed: http://ollama...:
+Post "http://ollama...": context deadline exceeded
+(Client.Timeout exceeded while awaiting headers)
+```
+
+---
+
+## 🙋 Choose your adventure
+
+How should we address errors?
+
+- Option 1: increase the timeout in the [http][doc-http] processor
+
+- Option 2: use a [retry][doc-retry] processor in the pipeline
+
+- Option 3: use a [reject_errored][doc-reject] output
+
+[doc-http]: https://warpstreamlabs.github.io/bento/docs/components/processors/http/
+[doc-retry]: https://warpstreamlabs.github.io/bento/docs/components/processors/retry
+[doc-reject]: https://warpstreamlabs.github.io/bento/docs/components/outputs/reject_errored
+
+---
+
+## 🏗️ Let's build something!
+
+- We want to process 1000 cities with our LLM
+
+  (guessing who the mayor is, or something similar)
+
+- Store the output wherever we want
+
+  (Redis, CSV file, JSONL files...)
+
+- Deal correctly with errors
+
+  (we'll check that there are, indeed, 1000 cities in the output)
+
+- Scale out to process faster
+
+  (scale ollama to e.g. 10 replicas, enable parallelism in Bento)
+
+---
+
+class: title
+
+🍱 Lunch time! 🍱
+
+---
+
+## What happened?
+
+- If your Ollama pods have *resource requests*:
+
+  → your cluster may have auto-scaled
+
+- If your Ollama pods don't have *resource requests*:
+
+  → you probably have a bunch of container restarts, due to out-of-memory errors
+
+🤔 What's that about?
+
--- a/slides/k8s/bento-rmq.md
+++ b/slides/k8s/bento-rmq.md
@@ -0,0 +1,250 @@
+# Bento & RabbitMQ
+
+- In some of the previous runs, messages were dropped
+
+  (we start with 1000 messages in `cities` and have e.g. 955 in `mayors`)
+
+- This is caused by various errors during processing
+
+  (e.g. too many timeouts; Bento being shutdown halfway through...)
+
+- ...And by the fact that we are using a Redis queue
+
+  (which doesn't offer delivery guarantees or acknowledgements)
+
+- Can we get something better?
+
+---
+
+## The problem
+
+- Some inputs (like `redis_list`) don't support *acknowledgements*
+
+- When a message is pulled from the queue, it is deleted immediately
+
+- If the message is lost for any reason, it is lost permanently
+
+---
+
+## The solution
+
+- Some inputs (like `amqp_0_9`) support acknowledgements
+
+- When a message is pulled from the queue:
+
+  - it is not visible anymore to other consumers
+
+  - it needs to be explicitly acknowledged
+
+- The acknowledgement is done by Bento when the message reaches the output
+
+- The acknowledgement deletes the message
+
+- No acknowledgement after a while? Consumer crashes/disconnects?
+
+  Message gets requeued automatically!
+
+---
+
+## `amqp_0_9`
+
+- Protocol used by RabbitMQ
+
+- Very simplified behavior:
+
+  - messages are published to an [*exchange*][amqp-exchanges]
+
+  - messages have a *routing key*
+
+  - the exchange routes the message to one (or zero or more) queues
+    </br>(possibly using the routing key or message headers to decide which queue(s))
+
+  - [*consumers*][amqp-consumers] subscribe to queues to receive messages
+
+[amqp-exchanges]: https://www.rabbitmq.com/tutorials/amqp-concepts#exchanges
+[amqp-consumers]: https://www.rabbitmq.com/tutorials/amqp-concepts#consumers
+
+---
+
+## Using the default exchange
+
+- There is a default exchange (called `""` - empty string)
+
+- The routing key indicates the name of the queue to deliver to
+
+- The queue needs to exist (we need to create it beforehand)
+
+---
+
+class: extra-details
+
+## Defining custom exchanges
+
+- Create an exchange
+
+  - exchange types: direct, fanout, topic, headers
+
+  - durability: persisted to disk to survive server restart or not?
+
+- Create a binding
+
+  - which exchange?
+
+  - which routing key? (for direct exchanges)
+
+  - which queue?
+
+---
+
+## RabbitMQ on Kubernetes
+
+- RabbitMQ can be deployed on Kubernetes:
+
+  - directly (creating e.g. a StatefulSet)
+
+  - with the RabbitMQ operator
+
+- We're going to do the latter!
+
+- The operator includes the "topology operator"
+
+  (to configure queues, exchanges, and bindings through custom resources)
+
+---
+
+## Installing the RabbitMQ operator
+
+- Let's install it with this Helm chart:
+
+  ```bash
+  helm upgrade --install --repo https://charts.bitnami.com/bitnami \
+      --namespace rabbitmq-system --create-namespace \
+      rabbitmq-cluster-operator rabbitmq-cluster-operator
+  ```
+
+---
+
+## Deploying a simple RabbitMQ cluster
+
+- Let's use the YAML manifests in that directory:
+
+  https://github.com/jpetazzo/beyond-load-balancers/tree/main/rabbitmq
+
+- This creates:
+
+  - a `RabbitmqCluster` called `mq`
+
+  - a `Secret` called `mq-default-user` containing access credentials
+
+  - a durable `Queue` named `q1` 
+
+(We can ignore the `Exchange` and the `Binding`, we won't use them.)
+
+---
+
+## 🏗️ Let's build something!
+
+Let's replace the `cities` Redis list with our RabbitMQ queue.
+
+(See next slide for steps and hints!)
+
+---
+
+## Steps
+
+1. Edit the Bento configuration for our "CSV importer".
+
+   (replace the `redis_list` output with `amqp_0_9`)
+
+2. Run that pipeline and confirm that messages show up in RabbitMQ.
+
+3. Edit the Bento configuration for the Ollama consumer.
+
+   (replace the `redis_list` input with `amqp_0_9`)
+
+4. Trigger a scale up of the Ollama consumer.
+
+5. Update the KEDA Scaler to use RabbitMQ instead of Redis.
+
+---
+
+## 1️⃣ Sending messages to RabbitMQ
+
+- Edit our Bento configuration (the one feeding the CSV file to Redis)
+
+- We want the following `output` section:
+  ```yaml
+    output:
+      amqp_0_9:
+        exchange: ""
+        key: q1
+        mandatory: true
+        urls:
+         - "${AMQP_URL}"
+  ```
+
+- Then export the AMQP_URL environment variable using `connection_string` from Secret `mq-default-user`
+
+💡 Yes, we can directly use environment variables in Bento configuration!
+
+---
+
+## 2️⃣ Testing our AMQP output
+
+- Run the Bento pipeline
+
+- To check that our messages made it:
+  ```bash
+  kubectl exec mq-server-0 -- rabbitmqctl list_queues
+  ```
+
+- We can also use Prometheus metrics, e.g. `rabbitmq_queue_messages`
+
+---
+
+## 3️⃣ Receiving messages from RabbitMQ
+
+- Edit our other Bento configuration (the one in the Ollama consumer Pod)
+
+- We want the following `input` section:
+  ```yaml
+    input:
+      amqp_0_9:
+        urls:
+          - `amqp://...:5672/`
+        queue: q1
+  ```
+
+---
+
+## 4️⃣ Triggering Ollama scale up
+
+- If the autoscaler is configured to scale to zero, disable it
+
+  (easiest solution: delete the ScaledObject)
+
+- Then manually scale the Deployment to e.g. 4 Pods
+
+- Check that messages are processed and show up in the output
+ 
+  (it should still be a Redis list at this point)
+
+---
+
+## 5️⃣ Autoscaling on RabbitMQ
+
+- We need to update our ScaledObject
+
+- Check the [RabbitMQ Queue Scaler][keda-rabbitmq]
+
+- Multiple ways to pass the AMQP URL:
+
+  - hardcode it (easier solution for testing!)
+
+  - use `...fromEnv` and set environment variables in target pod
+
+  - create and use a TriggerAuthentication
+
+💡 Since we have the AMQP URL in a Secret, TriggerAuthentication works great!
+
+[keda-rabbitmq]: https://keda.sh/docs/latest/scalers/rabbitmq-queue/
--- a/slides/k8s/handson-mlops.md
+++ b/slides/k8s/handson-mlops.md
@@ -0,0 +1,132 @@
+class: title
+
+*Tell me and I forget.*
+<br/>
+*Teach me and I remember.*
+<br/>
+*Involve me and I learn.*
+
+Misattributed to Benjamin Franklin
+
+[(Probably inspired by Chinese Confucian philosopher Xunzi)](https://www.barrypopik.com/index.php/new_york_city/entry/tell_me_and_i_forget_teach_me_and_i_may_remember_involve_me_and_i_will_lear/)
+
+---
+
+## Hands-on sections
+
+- There will be *a lot* of examples and demos
+
+- If you are attending a live workshop:
+
+  - follow along the demos, ask questions at any time
+
+  - if you can, try to run some of the examples and demos in your environment
+
+  - if things are going too fast, ask the trainer to slow down :)
+
+- If you are watching a recording or only reading the slides:
+
+  - it is **strongly** recommended to run **all** the examples and demos
+
+  - take advantage of the fact that you can pause at any time
+
+---
+
+class: in-person
+
+## Where are we going to run our containers?
+
+---
+
+class: in-person, pic
+
+![You get a cluster](images/you-get-a-cluster.jpg)
+
+---
+
+## If you're attending a live training or workshop
+
+- Each person gets a private lab environment
+
+- Your lab environments will be available for the duration of the workshop
+
+  (check with your instructor to know exactly when they'll be shutdown)
+
+- Note that for budget reasons¹, your environment will be fairly modest
+
+  - scenario 1: 4 nodes with 2 cores and 4 GB RAM ; no cluster autoscaling
+
+  - scenario 2: 1 node with 4 cores and 8 GB RAM ; cluster autoscaling
+
+.footnote[¹That cloud thing is mighty expensive, yo]
+
+---
+
+## Running your own lab environment
+
+- If you are following a self-paced course...
+
+- Or watching a replay of a recorded course...
+
+- ...You will need to set up a local environment for the labs
+
+  *or*
+
+- If you want to use a specific cloud provider...
+
+- Or want to see these concepts "at scale"...
+
+- ...You can set up your own clusters with whatever capacity suits you
+
+---
+
+## Deploying your own Kubernetes cluster
+
+- You need cloud provider credentials for this
+
+- Option 1: use the cloud provider CLI, web UI, ...
+
+- Option 2: use [one of these Terraform configurations][one-kubernetes]
+
+  (set `cluster_name`, `node_size`, `max_nodes_per_pool`, `location`, and GO!)
+
+[one-kubernetes]: https://github.com/jpetazzo/container.training/tree/main/prepare-labs/terraform/one-kubernetes
+
+---
+
+## Deploying your own Kubernetes cluster.red[**s**]
+
+- If you want to deliver your own training or workshop:
+
+  - deployment scripts are available in the [prepare-labs] directory
+
+  - you can use them to automatically deploy many lab environments
+
+  - they support many different infrastructure providers
+
+  - they can deploy dozens (even hundreds) of clusters at a time
+
+[prepare-labs]: https://github.com/jpetazzo/container.training/tree/main/prepare-labs
+
+---
+
+class: in-person
+
+## Why don't we run containers locally?
+
+- Installing this stuff can be hard on some machines
+
+  (32 bits CPU or OS... Laptops without administrator access... etc.)
+
+- *"The whole team downloaded all these container images from the WiFi!
+  <br/>... and it went great!"* (Literally no-one ever)
+
+- All you need is a computer (or even a phone or tablet!), with:
+
+  - an Internet connection
+
+  - a web browser
+
+  - an SSH client
+
+- Some of the demos require multiple nodes to demonstrate scaling
--- a/slides/k8s/helmfile.md
+++ b/slides/k8s/helmfile.md
@@ -0,0 +1,165 @@
+# Managing our stack with `helmfile`
+
+- We've installed a few things with Helm
+
+- And others with raw YAML manifests
+
+- Perhaps you've used Kustomize sometimes
+
+- How can we automate all this? Make it reproducible?
+
+---
+
+## Requirements
+
+- We want something that is *idempotent*
+
+  = running it 1, 2, 3 times, should only install the stack once
+
+- We want something that handles udpates
+
+  = modifying / reconfiguring without restarting from scratch
+
+- We want something that is configurable
+
+  = with e.g. configuration files, environment variables...
+
+- We want something that can handle *partial removals*
+
+  = ability to remove one element without affecting the rest
+
+- Inspiration: Terraform, Docker Compose...
+
+---
+
+## Shell scripts?
+
+✅ Idempotent, thanks to `kubectl apply -f`, `helm upgrade --install`
+
+✅ Handles updates (edit script, re-run)
+
+✅ Configurable
+
+❌ Partial removals
+
+If we remove an element from our script, it won't be uninstalled automatically.
+
+---
+
+## Umbrella chart?
+
+Helm chart with dependencies on other charts.
+
+✅ Idempotent
+
+✅ Handles updates
+
+✅ Configurable (with Helm values: YAML files and `--set`)
+
+✅ Partial removals
+
+❌ Complex (requires to learn advanced Helm features)
+
+❌ Requires everything to be a Helm chart (adds (lots of) boilerplate)
+
+---
+
+## Helmfile
+
+https://github.com/helmfile/helmfile
+
+✅ Idempotent
+
+✅ Handles updates
+
+✅ Configurable (with values files, environment variables, and more)
+
+✅ Partial removals
+
+✅ Fairly easy to get started
+
+🐙 Sometimes feels like summoning unspeakable powers / staring down the abyss
+
+---
+
+## What `helmfile` can install
+
+- Helm charts from remote Helm repositories
+
+- Helm charts from remote git repositories
+
+- Helm charts from local directories
+
+- Kustomizations
+
+- Directories with raw YAML manifests
+
+---
+
+## How `helmfile` works
+
+- Everything is defined in a main `helmfile.yaml`
+
+- That file defines:
+
+  - `repositories` (remote Helm repositories)
+
+  - `releases` (things to install: Charts, YAML...)
+
+  - `environments` (optional: to specialize prod vs staging vs ...)
+
+- Helm-style values file can be loaded in `enviroments`
+
+- These values can then be used in the rest of the Helmfile
+
+- Examples: [install essentials on a cluster][helmfile-ex-1], [run a Bento stack][helmfile-ex-2]
+
+[helmfile-ex-1]: https://github.com/jpetazzo/beyond-load-balancers/blob/main/helmfile.yaml
+[helmfile-ex-2]: https://github.com/jpetazzo/beyond-load-balancers/blob/main/bento/helmfile.yaml
+
+---
+
+## `helmfile` commands
+
+- `helmfile init` (optional; downloads plugins if needed)
+
+- `helmfile apply` (updates all releases that have changed)
+
+- `helmfile sync` (updates all releases even if they haven't changed)
+
+- `helmfile destroy` (guess!)
+
+---
+
+## Helmfile tips
+
+As seen in [this example](https://github.com/jpetazzo/beyond-load-balancers/blob/main/bento/helmfile.yaml#L21):
+
+- variables can be used to simplify the file
+
+- configuration values and secrets can be loaded from external sources
+
+  (Kubernetes Secrets, Vault... See [vals] for details)
+
+- current namespace isn't exposed by default
+
+- there's often more than one way to do it!
+
+  (this particular section could be improved by using Bento `${...}`)
+
+[vals]: https://github.com/helmfile/vals
+---
+
+## 🏗️ Let's build something!
+
+- Write a helmfile (or two) to set up today's entire stack on a brand new cluster!
+
+- Suggestion:
+
+  - one helmfile for singleton, cluster components
+    <br/>
+    (All our operators: Prometheus, Grafana, KEDA, CNPG, RabbitMQ Operator)
+
+  - one helmfile for the application stack
+    <br/>
+    (Bento, PostgreSQL cluster, RabbitMQ)
--- a/slides/k8s/mlops-headsup.md
+++ b/slides/k8s/mlops-headsup.md
@@ -0,0 +1,53 @@
+## What we will / won't cover
+
+- Kubernetes provides low-level building blocks (pods, deployments, services...)
+
+- There are many high-level frameworks out there for serverless, AI...:
+
+  [Knative](https://knative.dev/docs/),
+  [KubeAI](https://www.kubeai.org/),
+  [Kueue](https://kueue.sigs.k8s.io/)...
+
+- We're going to sit somewhere in the middle:
+
+  reimplement some of the features of these high-level frameworks, in a flexible way
+
+- This workshop will (hopefully!) give you a better eye to evaluate these frameworks, too
+
+- We won't showcase GPUs today for budget reasons
+
+  (giving everyone a few GPU nodes would be prohibitive, sorry!)
+
+---
+
+## A word about our demo app
+
+- We'll use Ollama with a relatively small LLM
+
+  (qwen2:1.5b)
+
+- We'll use it to generate very short completions
+
+  (a few seconds of CPU)
+
+- All the challenges that we will address are also visible on longer requests
+
+  (in fact, they are even more visible on longer requests!)
+
+- We're sticking to short requests to save time and cover a lot of ground today
+
+  (but feel free to use more expensive prompts if you'd like!)
+
+---
+
+## Tiny bit of backstory...
+
+The original prompt that we used when building the first version of this content was:
+
+```
+If you go to {city}, I suggest that you
+```
+
+This would typically take 10-30 seconds - and with much bigger Kubernetes nodes.
+
+Today, we suggest that we use a prompt that generates shorter answers!
--- a/slides/k8s/ollama-intro.md
+++ b/slides/k8s/ollama-intro.md
@@ -0,0 +1,321 @@
+# Ollama in a nutshell
+
+https://ollama.dev
+
+"Get up and running with large language models"
+
+"Docker, but for LLMs"
+
+- Server to host (run) LLMs
+
+- Controlled with CLI or API
+
+- Download a model with `ollama pull`
+
+- Run inference with `ollama run`
+
+---
+
+## Quick demo
+
+⚠️ **Important note 1:** the commands in this section aren't meant
+to be executed on your Kubernetes clusters. They are meant to
+be executed on a local machine, and they assume that Ollama is
+installed and running. If you don't have Ollama on your local
+machine, it's OK to skip these demos!
+
+⚠️ **Important note 2:** the models used by Ollama are fairly big
+(1.5 GB for the one used here; up to 10s or 100s of GB for bigger
+models). We do not recommend downloading them on conference WiFi.
+
+Assuming Ollama is installed and running:
+
+```
+ollama run qwen2:1.5b "What's the solution to global warming?"
+```
+
+We're going to use that model because it's relatively small.
+
+Many others are available (see https://ollama.dev/search).
+
+---
+
+## Other useful commands
+
+- Start an interactive chat session:
+  ```bash
+  ollama run qwen2:1.5b
+  ```
+
+- Pull an model (or check for updates):
+  ```bash
+  ollama pull qwen2:1.5b
+  ```
+
+- See information on a model:
+  ```bash
+  ollama show qwen2:1.5b
+  ```
+
+---
+
+## Models on disk, in memory
+
+- See models available on disk:
+  ```bash
+  ollama list
+  ```
+
+- See models loaded in memory:
+  ```bash
+  ollama ps
+  ```
+
+- Unload a model:
+  ```bash
+  ollama stop qwen2:1.5b
+  ```
+
+Models are automatically unloaded after 5 minutes (by default).
+
+Ollama loads models in RAM, and in VRAM if it detects a supported GPU.
+
+---
+
+# Ollama on Kubernetes
+
+Let's run Ollama on our Kubernetes cluster!
+
+- Option 1: `kubectl run`
+
+- Option 2: create a Deployment and a Service
+
+- Option 3: use a Helm chart
+
+---
+
+## 1️⃣ `kubectl run`
+
+Note: the `ollama/ollama` image is quite big (~2 GB transfer, ~4 GB on disk).
+
+```bash
+kubectl run ollama --image ollama/ollama
+```
+
+Wait for the pod to be up and running:
+```bash
+kubectl wait pod ollama --for=condition=Ready
+```
+
+(If that command times out, try again and/or specify a higher timeout.)
+
+```bash
+kubectl exec ollama -- ollama run qwen2:1.5b "What's Bach best piece?"
+```
+
+Shutdown the pod:
+```bash
+kubectl delete pod ollama
+```
+
+---
+
+## 2️⃣ Deployment + Service
+
+Create the Deployment:
+```bash
+kubectl create deployment ollama --image ollama/ollama
+```
+
+Create the Service:
+```bash
+kubectl create service clusterip ollama --tcp 11343
+```
+
+Wait for the Service Endpoints to be available:
+```bash
+kubectl wait endpoints ollama --for=jsonpath={..ip}
+```
+
+---
+
+## By the way... Why port 11434?
+
+| 1 | 1 | 4 | 3 | 4 |
+|---|---|---|---|---|
+| L | L | A | M | A |
+
+---
+
+## Connecting to the Service
+
+Let's use the `/api/generate` endpoint:
+
+```bash
+kubectl run httpclient --rm -it --image alpine/httpie -- --ignore-stdin \
+  http://ollama:11434/api/generate \
+  model=qwen2:1.5b prompt="Write a limerick about Kubernetes"
+```
+
+(See [Ollama API docs](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion) for details.)
+
+--
+
+🤔 We get an error: the model needs to be downloaded first.
+
+💡 When we used the `ollama run` CLI command earlier, it did it automatically for us.
+
+---
+
+## Pulling the model
+
+Method 1:
+```bash
+kubectl exec deployment/ollama -- ollama pull qwen2:1.5b
+```
+
+Method 2:
+```bash
+kubectl run httpclient --rm -it --image alpine/httpie -- --ignore-stdin \
+  http://ollama:11434/api/pull \
+  name=qwen2:1.5b
+```
+
+---
+
+## Houston, we (are going to) have a problem...
+
+- This works when there is only one pod
+
+- What happens if we scale up the Deployment?
+
+- We need to pull the model on every pod
+
+- How should we do that?
+
+---
+
+## Potential solutions
+
+- Bake the model in the image
+
+  🙅 Personal opinion: this is a bad idea (image size, maintenance...)
+
+- Directly send a "pull" command to each pod, individually
+
+  🙁 Hackish, not great
+
+- Use a Kubernetes lifecycle hook
+
+  💡 That works!
+
+- Use a sidecar container to pull the model
+
+  🤔 Doable, but more work than the lifecycle hook
+
+---
+
+## 🙋 Choose your adventure
+
+Should we add that lifecycle hook?
+
+---
+
+## 3️⃣ Helm chart
+
+- Let's check the [ArtifactHUB] for an Ollama Helm chart
+
+- The most popular (as of November 2024) is [this one, by OTWLD][ollama-chart]
+
+- ~~It has pockets~~
+
+- It can pre-pull models! 🎉
+
+[ArtifactHub]: https://artifacthub.io
+[ollama-chart]: https://artifacthub.io/packages/helm/ollama-helm/ollama
+
+---
+
+## Installing the Helm chart
+
+Traditional method:
+```bash
+helm repo add ollama https://otwld.github.io/ollama-helm/
+helm install ollama ollama/ollama --set ollama.models={qwen2:1.5b}
+```
+
+Idempotent¹, single-command method:
+```bash
+helm ugprade --install --repo https://otwld.github.io/ollama-helm/ \
+  ollama ollama --set ollama.models={qwen2:1.5b}
+```
+
+.footnote[¹Idempotent: which can be executed multiple times without adverse effect.]
+
+---
+
+## Testing the Helm installation
+
+Just like before:
+```bash
+kubectl run httpclient --rm -it --image alpine/httpie -- --ignore-stdin \
+  http://ollama:11434/api/generate \
+  model=qwen2:1.5b prompt="Write a limerick about YAML" stream:=false
+```
+
+And while we're here, check resource usage:
+```bash
+kubectl exec deployment/ollama -ti -- top
+```
+
+There should be two processes:
+
+- `ollama` itself, relatively small (~100 MB)
+
+- the LLM subprocess, relatively big (~1.4 GB for qwen2:1.5b)
+
+---
+
+## Sending some load
+
+We're going to use `hey`:
+
+```bash
+kubectl run hey --rm -it --image nixery.dev/hey -- \
+  hey -c 10 -n 10 -t 60 -m POST \
+  -d '{"model": "qwen2:1.5b", "prompt": "vi or emacs?"}' \
+  http://ollama:11434/api/generate
+```
+
+Some explanations:
+
+- `nixery.dev` = automatically generates images with [Nixery]
+- `-c` = concurrent requests
+- `-n` = total number of requests
+- `-t` = timeout in seconds
+
+This is probably going to take (literally) a minute.
+
+[Nixery]: https://nixery.dev/
+
+---
+
+## Performance analysis
+
+- Let's start an interactive container with `hey`
+
+  (e.g., use the `alpine` image, then `apk add hey`)
+
+- Try 10 requests, with a concurrency of 1/2/4
+
+- Meanwhile, check the logs of the `ollama` pod
+
+- Some results (your results may vary depending on CPU, random seed...):
+
+  - 1 = 0.08 reqs/s, average latency: 12s
+  - 2 = 0.10 reqs/s, average latency: 18s
+  - 4 = 0.12 reqs/s, average latency: 28s
+
+- Higher concurrency = slightly higher throughput, much higher latency
+
+🤔 We need metrics!
--- a/slides/k8s/ollama-metrics.md
+++ b/slides/k8s/ollama-metrics.md
@@ -0,0 +1,273 @@
+# Adding metrics
+
+We want multiple kinds of metrics:
+
+- instantaneous pod and node resource usage
+
+- historical resource usage (=graphs)
+
+- request duration
+
+---
+
+## 1️⃣ Instantaneous resource usage
+
+- We're going to use metrics-server
+
+- Check if it's already installed:
+  ```bash
+  kubectl top nodes
+  ```
+
+- If we see a list of nodes, with CPU and RAM usage:
+
+  *great, metrics-server is installed!*
+
+- If we see `error: Metrics API not available`:
+
+  *metrics-server isn't installed, so we'll install it!*
+
+---
+
+## Installing metrics-server
+
+- In a lot of places, this is done with a little bit of custom YAML
+
+  (derived from the [official installation instructions](https://github.com/kubernetes-sigs/metrics-server#installation))
+
+- We can also use a Helm chart:
+  ```bash
+    helm upgrade --install metrics-server metrics-server \
+      --create-namespace --namespace metrics-server \
+      --repo https://kubernetes-sigs.github.io/metrics-server/ \
+      --set args={--kubelet-insecure-tls=true}
+  ```
+
+- The `args` flag specified above should be sufficient on most clusters
+
+- After a minute, `kubectl top nodes` should show resource usage
+
+---
+
+## 2️⃣ Historical resource usage
+
+- We're going to use Prometheus (specifically: kube-prometheus-stack)
+
+- This is a Helm chart bundling:
+
+  - Prometheus
+
+  - multiple exporters (node, kube-state-metrics...)
+
+  - Grafana
+
+  - a handful of Grafana dashboards
+
+- Open Source
+
+- Commercial alternatives: Datadog, New Relic...
+
+---
+
+## Installing kube-prometheus-stack
+
+We're going to expose both Prometheus and Grafana with a NodePort:
+
+```bash
+helm upgrade --install --repo https://prometheus-community.github.io/helm-charts \
+  promstack kube-prometheus-stack \
+  --namespace prom-system --create-namespace \
+  --set prometheus.service.type=NodePort \
+  --set grafana.service.type=NodePort \
+  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
+  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
+  #
+```
+
+This chart installation can take a while (up to a couple of minutes).
+
+---
+
+class: extra-details
+
+## `...NilUsersHelmValues=false` ???
+
+- kube-prometheus-stack uses the "Prometheus Operator"
+
+- To configure "scrape targets", we create PodMonitor or ServiceMonitor resources
+
+- By default, the Prometheus Operator will only look at \*Monitors with the right labels
+
+- Our extra options mean "use all the Monitors that you will find!"
+
+---
+
+## Connecting to Grafana
+
+Check the NodePort allocated to Grafana:
+
+```bash
+kubectl get service promstack-grafana --namespace prom-system
+```
+
+Get the public address of one of our nodes:
+
+```bash
+kubectl get nodes -o wide
+```
+
+Connect to the public address of a node, on the node port.
+
+The default login and password are `admin` / `prom-operator`.
+
+Check the dashboard "Kubernetes / Compute Resources / Namespace (Pods)".
+
+Select a namespace and see the CPU and RAM usage for the pods in that namespace.
+
+---
+
+## 3️⃣ Request duration
+
+- Unfortunately, as of November 2024, ollama doesn't expose metrics
+
+  (there is ongoing discussion about it: [issue 3144][3144], [PR 6537][6537])
+
+- There are some [garbage AI-generated blog posts claiming otherwise][garbage]
+
+  (but it's AI-generated, so it bears no connection to truth whatsoever)
+
+- So, what can we do?
+
+[3144]: https://github.com/ollama/ollama/issues/3144#issuecomment-2153184254
+[6537]: https://github.com/ollama/ollama/pull/6537
+[garbage]: https://www.arsturn.com/blog/setting-up-ollama-prometheus-metrics
+
+---
+
+## HAProxy to the rescue
+
+- HAProxy is a proxy that can handle TCP, HTTP, and more
+
+- It can expose detailed Prometheus metrics about HTTP requests
+
+- The plan: add a sidecar HAProxy to each Ollama container
+
+- For that, we need to give up on the Ollama Helm chart
+
+  (and go back to basic manifests)
+
+---
+
+## 🙋 Choose your adventure
+
+Do we want to...
+
+- write all the corresponding manifests?
+
+- look at pre-written manifests and explain how they work?
+
+- apply the manifests and carry on?
+
+---
+
+## 🏗️ Let's build something!
+
+- If you have created Deployments / Services: clean them up first!
+
+- Deploy Ollama with a sidecar HAProxy (sample configuration on next slide)
+
+- Run a short benchmark campaign
+
+  (e.g. scale to 4 pods, try 4/8/16 parallel requests, 2 minutes each)
+
+- Check live resource usage with `kubectl top nodes` / `kubectl top pods`
+
+- Check historical usage with the Grafana dashboards
+
+  (for HAProxy metrics, you can use [Grafana dashboard 12693, HAProxy 2 Full][grafana-12693])
+
+- If you don't want to write the manifests, you can use [these ones][ollama-yaml]
+
+[grafana-12693]: https://grafana.com/grafana/dashboards/12693-haproxy-2-full/
+[ollama-yaml]: https://github.com/jpetazzo/beyond-load-balancers/tree/main/ollama
+
+---
+
+```
+global
+  #log stdout format raw local0
+  #daemon
+  maxconn 32
+defaults
+  #log global
+  timeout client 1h
+  timeout connect 1h
+  timeout server 1h
+  mode http
+  `option abortonclose`
+frontend metrics
+  bind :9000
+  http-request use-service prometheus-exporter
+frontend ollama_frontend
+  bind :8000
+  default_backend ollama_backend
+  `maxconn 16`
+backend ollama_backend
+  server ollama_server localhost:11434 check
+```
+
+---
+
+class: extra-details
+
+## ⚠️ Connection queues
+
+- HAProxy will happily queue *many* connections
+
+- If a client sends a request, then disconnects:
+
+  - the request stays in the queue
+
+  - the request gets processed by the backend
+
+  - eventually, when the backend starts sending the reply, the connection is closed
+
+- This can result in a backlog of queries that take a long time to resorb
+
+- To avoid that: `option abortonclose` (see [HAProxy docs for details][abortonclose])
+
+- Note that the issue is less severe when replies are streamed
+
+[abortonclose]: https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4-option%20abortonclose
+
+---
+
+class: extra-details
+
+## Ad-hoc HAProxy dashboard
+
+- To consolidate all frontend and backend queues on a single graph:
+
+  - query: `haproxy_frontend_current_sessions`
+
+  - legend: `{{namespace}}/{{pod}}/{{proxy}}`
+
+  - options, "Color scheme", select "Classic palette (by series name)"
+
+---
+
+## What do we see?
+
+- Imperfect load balancing
+
+- Some backends receive more requests than others
+
+- Sometimes, some backends are idle while others are busy
+
+- However, CPU utilization on the node is maxed out
+
+- This is because our node is oversubscribed
+
+- This is because we didn't specify of resource requests/limits (yet)
+
+  (we'll do that later!)
--- a/slides/k8s/ollama-reqlim.md
+++ b/slides/k8s/ollama-reqlim.md
@@ -0,0 +1,155 @@
+## Setting resource requests and limits
+
+- Thanks to *requests*:
+
+  - our pods will have resources *reserved* for them
+
+  - we won't pack too many pods on a single node
+
+  - cluster autoscaling will trigger when needed (if possible!)
+
+- Thanks to *limits*:
+
+  - our pods won't use more than a given amount of resources
+
+  - they won't use up all the available resources on the node
+
+  - behavior will be more consistent between loaded and unloaded state
+
+---
+
+## Memory
+
+- Personal advice: set request and limit to the same value
+
+- Check current or historical usage and add a bit of padding
+
+  (the more data historical data we have, the less padding we need)
+
+- Consider 10% padding for "dataless" pods, more for pods with data
+
+  (so that the pod has "reserves" for page cache usage)
+
+⚠️ Pods hitting their memory limit will be **killed!**
+
+---
+
+## CPU
+
+- It's not necessary to set requests and limits to the same value
+
+  (this would cause a lot of waste for idle workloads)
+
+- Let's see a few possible strategies!
+
+---
+
+## CPU for mostly idle pods
+
+E.g.: web services, workers handling very few requests...
+
+- Set the limit to at least one whole core
+    
+  (to avoid throttling, especially on bursty workloads)
+
+- Requests can be very low (e.g. 0.1 core)
+
+⚠️ If requests are too low and the node is very loaded,
+the pod will slow down significantly!
+
+(Because CPU cycles are allocated proportionally to CPU requests.)
+
+---
+
+## Inelastic CPU-hungry pods
+
+- Pods with a fixed number of threads:
+
+  *set requests and limits to that number of threads*
+
+- Pods where a specific level of performance needs to be guaranteed:
+
+  *set requests and limits to the number of cores providing that performance*
+
+⚠️ If you set limits to higher levels, performance will be unpredictible!
+
+(You'll get good performance when the node has extra cycles.)
+
+---
+
+## Elastic CPU-hungry pods
+
+- Pods that could potentially use all the cores
+
+  (e.g. machine learning training and inference, depending on the models)
+
+- Decide how many pods per node you want to pack
+
+- Set CPU requests as a fraction of the number of cores of the nodes
+
+  (minus some padding)
+
+- Example:
+
+  - nodes with 32 cores
+  - we want 4 pods per node
+  - CPU request: 7.5 cores
+
+- Set limits to a higher level (up to node size)
+
+---
+
+## In practice
+
+- Check memory usage of our Ollama pods:
+  ```bash
+  kubectl top pods
+  ```
+  (Or even better, look at historical usage in Prometheus or Grafana!)
+
+- Check how many cores we have on our nodes:
+  ```bash
+  kubectl get nodes -o json | jq .items[].status.capacity.cpu
+  kubectl get nodes -o custom-columns=NAME:metadata.name,CPU:status.capacity.cpu
+  ```
+
+- Let's decide that we want two Ollama pods per node
+
+- What requests/limits should we set?
+
+---
+
+## Setting resources for Ollama
+
+- Assumptions:
+
+  - we want two pods per node
+  - each pod uses ~1500MiB RAM
+  - nodes have 4 cores
+
+- We'll set memory requests and limits to 2G
+
+- We'll set CPU requests to 1.5 (4 cores / 2 pods, minus padding)
+
+- We'll set CPU limits to twice the requests
+
+```bash
+kubectl set resources deployment ollama \
+  --requests=cpu=1.5,memory=2G \
+  --limits=cpu=3,memory=2G
+```
+
+⚠️ If you have an HAProxy side car, this will set its resources too!
+
+---
+
+## Results
+
+- After setting these resource requests, we should see cluster autoscaling
+
+- If not: scale up the Ollama Deployment to at least 3 replicas
+
+- Check cluster autoscaler status with:
+  ```bash
+  kubectl describe configmap --namespace kube-system cluster-autoscaler-status
+  ```
--- a/slides/k8s/queue-architecture.md
+++ b/slides/k8s/queue-architecture.md
@@ -0,0 +1,210 @@
+# Message Queue Architecture
+
+There are (at least) three ways to distribute load:
+
+- load balancers
+
+- batch jobs
+
+- message queues
+
+Let's do a quick review of their pros/cons!
+
+---
+
+## 1️⃣ Load balancers
+
+<pre class="mermaid">
+  flowchart TD
+    Client["Client"] ---> LB["Load balancer"]
+    LB ---> B1["Backend"] & B2["Backend"] & B3["Backend"]
+</pre>
+
+---
+
+## Load balancers
+
+- Latency: ~milliseconds (network latency)
+
+- Overhead: very low (one extra network hop, one log message?)
+
+- Great for short requests (a few milliseconds to a minute)
+
+- Supported out of the box by the Kubernetes Service Proxy
+
+  (by default, this is `kube-proxy`)
+
+- Suboptimal resource utilization due to imperfect balancing
+
+  (especially when there are multiple load balancers) 
+
+---
+
+## 2️⃣ Batch jobs
+
+<pre class="mermaid">
+  flowchart TD
+  subgraph K["Kubernetes Control Plane"]
+    J1["Job"]@{ shape: card}
+    J2["Job"]@{ shape: card}
+    J3["..."]@{ shape: text}
+    J4["Job"]@{ shape: card}
+  end
+  C["Client"] ---> K
+  K <---> N1["Node"] & N2["Node"] & N3["Node"]
+</pre>
+
+---
+
+## Batch jobs
+
+- Latency: a few seconds (many Kubernetes controllers involved)
+
+- Overhead: significant due to all the moving pieces involved
+
+  (job controller, scheduler, kubelet; many writes to etcd and logs)
+
+- Great for long requests (a few minutes to a few days)
+
+- Supported out of the box by Kubernetes
+
+  (`kubectl create job hello --image alpine -- sleep 60`)
+
+- Asynchronous processing requires some refactoring
+
+  (we don't get the response immediately)
+
+---
+
+## 3️⃣ Message queues
+
+<pre class="mermaid">
+  flowchart TD
+  subgraph Q["Message queue"]
+    M1["Message"]@{ shape: card}
+    M2["Message"]@{ shape: card}
+    M3["..."]@{ shape: text}
+    M4["Message"]@{ shape: card}
+  end
+  C["Client"] ---> Q
+  Q <---> W1["Worker"] & W2["Worker"] & W3["Worker"]
+</pre>
+
+---
+
+## Message queues
+
+- Latency: a few milliseconds to a few seconds
+
+- Overhead: intermediate
+
+  (very low with e.g. Redis, higher with e.g. Kafka)
+
+- Great for all except very short requests
+
+- Requires additional setup
+
+- Asynchronous processing requires some refactoring
+
+---
+
+## Dealing with errors
+
+- Load balancers
+
+  - errors reported immediately (client must retry)
+  - some load balancers can retry automatically
+
+- Batch jobs
+
+  - Kubernetes retries automatically
+  - after `backoffLimit` retries, Job is marked as failed
+
+- Message queues
+
+  - some queues have a concept of "acknowledgement"
+  - some queues have a concept of "dead letter queue"
+  - some extra work is required
+
+---
+
+## Some queue brokers
+
+- Redis (with e.g. RPUSH, BLPOP)
+
+  *light, fast, easy to setup... no durability guarantee, no acknowledgement, no dead letter queue*
+
+- Kafka
+
+  *heavy, complex to setup... strong deliverability guarantee, full featured*
+
+- RabbitMQ
+
+  *somewhat in-between Redis and Kafka*
+
+- SQL databases
+
+  *often requires polling, which adds extra latency; not as scalable as a "true" broker*
+
+---
+
+## More queue brokers
+
+Many cloud providers offer hosted message queues (e.g.: Amazon SQS).
+
+These are usually great options, with some drawbacks:
+
+- vendor lock-in
+
+- setting up extra environments (testing, staging...) can be more complex
+
+(Setting up a singleton environment is usually very easy, thanks to web UI, CLI, etc.; setting up extra environments and assigning the right permissions with e.g. IAC is usually significantly more complex.)
+
+---
+
+## Implementing a message queue
+
+1. Pick a broker
+
+2. Deploy the broker
+
+3. Set up the queue
+
+4. Refactor our code
+
+---
+
+## Code refactoring (client)
+
+Before:
+```python
+response = http.POST("http://api", payload=Request(...))
+```
+
+After:
+```python
+client = queue.connect(...)
+client.publish(message=Request(...))
+```
+
+Note: we don't get the response right way (if at all)!
+
+---
+
+## Code refactoring (server)
+
+Before:
+```python
+server = http.server(request_handler=handler)
+server.listen("80")
+server.run()
+```
+
+After:
+```python
+client = queue.connect(...)
+while true:
+  message = client.consume()
+  response = handler(message)
+  # Write the response somewhere
+```
--- a/slides/mlops.yml
+++ b/slides/mlops.yml
@@ -0,0 +1,44 @@
+title: |
+  Asynchronous Architecture Patterns To Scale ML and Other High Latency Workloads on Kubernetes
+
+#chat: "[Slack](https://dockercommunity.slack.com/messages/C7GKACWDV)"
+#chat: "[Gitter](https://gitter.im/jpetazzo/workshop-yyyymmdd-city)"
+chat: "In person!"
+
+gitrepo: github.com/jpetazzo/container.training
+
+slides: https://FIXME.container.training/
+
+#slidenumberprefix: "#SomeHashTag &mdash; "
+
+exclude:
+- self-paced
+
+content:
+- shared/title.md
+- logistics.md
+- shared/about-slides.md
+#- shared/chat-room-im.md
+#- shared/chat-room-slack.md
+#- shared/chat-room-zoom-meeting.md
+#- shared/chat-room-zoom-webinar.md
+- k8s/prereqs-advanced.md
+- k8s/handson-mlops.md
+- shared/connecting.md
+- k8s/mlops-headsup.md
+- shared/toc.md
+-
+  - k8s/ollama-intro.md
+  - k8s/ollama-metrics.md
+  - k8s/queue-architecture.md
+  - k8s/bento-intro.md
+-
+  - k8s/resource-limits.md
+  - k8s/cluster-autoscaler.md
+  - k8s/ollama-reqlim.md
+  - k8s/bento-hpa.md
+  - k8s/bento-rmq.md
+  - k8s/bento-cnpg.md
+  - k8s/helmfile.md
+  - shared/thankyou.md
+  - shared/contact.md
--- a/slides/shared/contact.md
+++ b/slides/shared/contact.md
@@ -0,0 +1,54 @@
+<table>
+<tr>
+<td style="vertical-align: sub; background: initial;">
+<pre style="padding: 40px; font-size: 16px; line-height: 18px;">
+█▀▀▀▀▀█ ▀▀▀█▄▀ ▀▄ ▀▄ ▀▄ ▄█▀ ▄ █▀▀▀▀▀█
+█ ███ █  ▀▄█ ▀▀▄█  ▄▀▀   ██▄▄ █ ███ █
+█ ▀▀▀ █  ▄▀█▀ █▀▀▀█ ▄█▀▄███ ▄ █ ▀▀▀ █
+▀▀▀▀▀▀▀ █▄▀ █▄█ ▀ █ █ ▀▄█▄▀ █ ▀▀▀▀▀▀▀
+▀▀ █▀▄▀  ▀▄ ▀▀█▄▄█▄▄ ▄▄▄   █▀ ▀▄▄  ▄▀
+▄█▄▀▄▀▀██▀  ▀▀██▄█  ▀▀▄█  ██▀ █▄█▀█▀▀
+▄ ▄▀▀ ▀ ▀█▀ ▄█▄▀▄▀ ▀ █ █ █▄▄▀▀▀▀▄█▄█▀
+█ ▀▀█▄▀▀█▀█ ▄▀ ▀▀ █▀▄ ▀▄  ██▄▀ ▄█ ▄▀█
+█▄▀▀▀ ▀▀ ███▀█▀▄ ▄▄█  ██   █▀▄▀▄ █▀▀▀
+ ▄ █▀▄▀ ▄▀ ▄▀▄ ██ ▀▀█ ▄█ █▀▀▄█▀  ▄  █
+█▀▀▄▄ ▀ ▀ ▀▀█ ▀▀▀   ▀▀ █▀██▄▀▀▀███▄█▀
+█▀█▀▄█▀██ ██ ▀ █▄█▀ ▀ ██▀ ██▄  █▄█▄▄█
+█▀█▀▄▄▀▀▀▄▀▄▀ ▄█   ▄▀█ ▄▀▄ █▄ ▀▀▄█▄▄▀
+█▀█▄█ ▀  ▀▀▄█▀ █▄▀ █  ▄ ▄▀▄█ █▄▄█▄▄▀█
+▀ ▀▀  ▀▀█▄ ▀ ▀    ▄▄███▄  ▄ █▀▀▀█▀██
+█▀▀▀▀▀█  ▀██ █ █▀▀ ▀█▀██▄█▀▄█ ▀ █▄ ▄▀
+█ ███ █ █▄██▀ ▀▄▀▀▄█▀ ▄▄▀██▀▀▀█▀▀ ▄ ▀
+█ ▀▀▀ █ ▄█▀▀▀▀▄▀▄▄█ ▄▀█▀▄    ▀ ▀█ █▄█
+▀▀▀▀▀▀▀ ▀▀ ▀▀   ▀  ▀ ▀ ▀ ▀  ▀    ▀  ▀
+</pre>
+
+.center[
+👆
+
+Please fill this [feedback form](https://docs.google.com/forms/d/e/1FAIpQLScYloWur4uVhKgVNIdUrfHZ8pk_mBmPcQwmbhjK2FlR9KWDCA/viewform).
+
+Thank you! 🫶
+]
+
+</td>
+<td style="vertical-align: sub; background: initial;">
+Contact information:
+
+📛 Jérôme Petazzoni
+<br/>
+📩 jerome.petazzoni@gmail.com
+<br/>
+🔗 https://linkedin.com/in/jpetazzo
+<br/>
+🦣 https://hachyderm.io/@jpetazzo
+
+I can teach custom courses!<br/>
+→ Docker, Kubernetes, MLOps<br/>
+→ from intro level to "black belt"<br/>
+→ on site or remotely<br/>
+Reach out if you're interested!
+
+</td>
+</tr>
+</table>