➕ Add Terraform workshop with GKE and node pools

2026-02-14 09:39:56 +00:00 · 2022-01-17 00:00:49 +01:00
parent de0ad83686
commit 69c7ac2371
9 changed files with 696 additions and 0 deletions
--- a/slides/exercises/tf-nodepools-brief.md
+++ b/slides/exercises/tf-nodepools-brief.md
@@ -0,0 +1,9 @@
+## Exercise — Terraform Node Pools
+
+- Write a Terraform configuration to deploy a cluster
+
+- The cluster should have two node pools with autoscaling
+
+- Deploy two apps, each using exclusively one node pool
+
+- Bonus: deploy an app balanced across both node pools
--- a/slides/exercises/tf-nodepools-details.md
+++ b/slides/exercises/tf-nodepools-details.md
@@ -0,0 +1,69 @@
+# Exercise — Terraform Node Pools
+
+- Write a Terraform configuration to deploy a cluster
+
+- The cluster should have two node pools with autoscaling
+
+- Deploy two apps, each using exclusively one node pool
+
+- Bonus: deploy an app balanced across both node pools
+
+---
+
+## Cluster deployment
+
+- Write a Terraform configuration to deploy a cluster
+
+- We want to have two node pools with autoscaling
+
+- Example for sizing:
+
+  - 4 GB / 1 CPU per node
+
+  - pools of 1 to 4 nodes
+
+---
+
+## Cluster autoscaling
+
+- Deploy an app on the cluster
+
+  (you can use `nginx`, `jpetazzo/color`...)
+
+- Set a resource request (e.g. 1 GB RAM)
+
+- Scale up and verify that the autoscaler kicks in
+
+---
+
+## Pool isolation
+
+- We want to deploy two apps
+
+- The first app should be deployed exclusively on the first pool
+
+- The second app should be deployed exclusively on the second pool
+
+- Check the next slide for hints!
+
+---
+
+## Hints
+
+- One solution involves adding a `nodeSelector` to the pod templates
+
+- Another solution involves adding:
+
+  - `taints` to the node pools
+
+  - matching `tolerations` to the pod templates
+
+---
+
+## Balancing
+
+- Step 1: make sure that the pools are not balanced
+
+- Step 2: deploy a new app, check that it goes to the emptiest pool
+
+- Step 3: update the app so that it balances (as much as possible) between pools
--- a/slides/terraform/intro.md
+++ b/slides/terraform/intro.md
@@ -0,0 +1,70 @@
+# Terraform
+
+“An open-source **infrastructure as code** software tool created by HashiCorp¹.”
+
+- Other products in that space: Pulumi, Cloud Formation...
+
+- Very rich ecosystem
+
+- Supports many cloud providers
+
+.footnote[¹Also creators of Consul, Nomad, Packer, Vagrant, Vault...]
+
+---
+
+## Infrastructure as code?
+
+1. Write configuration files that describe resources, e.g.:
+
+   - some GKE and Kapsule Kubernetes clusters
+   - some S3 buckets
+   - a bunch of Linode/Digital Ocean instances
+   - ...and more
+
+2. Run `terraform apply` to create all these things
+
+3. Make changes to the configuration files
+
+4. Run `terraform apply` again to create/update/delete resources
+
+  (Vagrant, Packer, Consul, Vault, Nomad...)
+
+5. Run `terraform destroy` to delete all these things
+
+---
+
+## What Terraform *is not*
+
+- It's not a tool to abstract the differences between cloud providers
+
+  (“I want to move my AWS workloads to Scaleway!”)
+
+- It's not a configuration management tool
+
+  (“I want to install and configure packages on my servers!”)
+
+- It's not an application deployment tool
+
+  (“I want to deploy a new build of my app!”
+
+- It can be used for these things anyway (more or less succesfully)
+
+---
+
+## Vocabulary
+
+- Configuration = a set of Terraform files
+
+  - typically in HCL (HashiCorp Config Language), `.tf` extension
+
+  - can also be JSON
+
+- Resource = a thing that will be managed by Terraform
+
+  - e.g. VM, cluster, load balancer...
+
+- Provider = plugin to manage a family of resources
+
+  - example: `google` provider to talk with GCP APIs
+
+  - example: `tls` provider to generate keys
--- a/slides/terraform/nodepools-gke.md
+++ b/slides/terraform/nodepools-gke.md
@@ -0,0 +1,148 @@
+## Node pools on GKE
+
+⚠️ Disclaimer
+
+I do not pretend to fully know and understand GKE's concepts and APIs.
+
+I do not know their rationales and underlying implementations.
+
+The techniques that I'm going to explain here work for me, but there
+might be better ones.
+
+---
+
+## The default node pool
+
+- Defined within the `google_container_cluster` resource
+
+- Uses `node_config` block and `initial_node_count`
+
+- If it's defined, it should be the only node pool!
+
+- Disable it with either:
+
+  `initial_node_count=1` and `remove_default_node_pool=true`
+
+  *or*
+
+  a dummy `node_pool` block and a `lifecycle` block to ignore changes to the `node_pool`
+
+---
+
+class: extra-details
+
+## What's going on with the node pools?
+
+When we run `terraform apply` (or, more accurately, `terraform plan`)...
+
+- Terraform invokes the `google` provider to enumerate resources
+
+- the provider lists the clusters and node pools
+
+- it includes the node pools in the cluster resources
+
+- ...even if they are declared separately
+
+- Terraform notices these "new" node pools and wants to remove them
+
+- we can tell Terraform to ignore these node pools with a `lifecycle` block
+
+- I *think* that `remove_default_node_pool` achieves the same result 🤔
+
+---
+
+## Our new cluster resource
+
+```tf
+resource "google_container_cluster" "mycluster" {
+  name               = "klstr"
+  location           = "europe-north1-a"
+
+  # We won't use that node pool but we have to declare it anyway.
+  # It will remain empty so we don't have to worry about it.
+  node_pool {
+    name       = "builtin"
+  }
+  lifecycle {
+    ignore_changes = [ node_pool ]
+  }
+}
+```
+
+---
+
+## Our normal node pool
+
+```tf
+resource "google_container_node_pool" "ondemand" {
+  name       = "ondemand"
+  cluster    = google_container_cluster.mycluster.id
+  autoscaling {
+    min_node_count = 0
+    max_node_count = 5
+  }
+  node_config {
+    preemptible  = false
+  }
+}
+```
+
+---
+
+## Our preemptible node pool
+
+```tf
+resource "google_container_node_pool" "preemptible" {
+  name       = "preemptible"
+  cluster    = google_container_cluster.mycluster.id
+  initial_node_count = 1
+  autoscaling {
+    min_node_count = 1
+    max_node_count = 5
+  }
+  node_config {
+    preemptible  = true
+  }
+}
+```
+
+---
+
+## Scale to zero
+
+- It is possible to scale a single node pool to zero
+
+- The cluster autoscaler will be able to scale up an empty node pool
+
+  (and scale it back down to zero when it's not needed anymore)
+
+- However, our cluster must have at least one node
+
+  (the cluster autoscaler can't/won't work if we have zero node)
+
+- Make sure that at least one pool has at least one node!
+
+---
+
+## Taints and labels
+
+- We will typically use node selectors and tolerations to schedule pods
+
+- The corresponding labels and taints must be set on the node pools
+
+```tf
+resource "google_container_node_pool" "bignodes" {
+  ...
+  node_config {
+    machine_type = "n2-standard-4"
+    labels = {
+      expensive = ""
+    }
+    taint {
+      key = "expensive"
+      value = ""
+      effect = "NO_SCHEDULE"
+    }
+  }
+}
+```
--- a/slides/terraform/nodepools.md
+++ b/slides/terraform/nodepools.md
@@ -0,0 +1,125 @@
+## Saving (lots of) money
+
+- Our load (number and size of pods) is probably variable
+
+- We need *cluster autoscaling*
+
+  (add/remove nodes as we need them, pay only for what we use)
+
+- We might need nodes of different sizes
+
+  (or with specialized hardware: local fast disks, GPUs...)
+
+- If possible, we should leverage "spot" or "preemptible" capacity
+
+  (VMs that are significantly cheaper but can be terminated on short notice)
+
+---
+
+## Node pools
+
+- We will have multiple *node pools*
+
+- A node pool is a set of nodes running in a single zone
+
+- The nodes usually¹ have the same size
+
+- They have the same "preemptability"
+
+  (i.e. a node pool is either "on-demand" or "preemptible")
+
+- The Kubernetes cluster autoscaler is aware of the node pools
+
+- When it scales up the cluster, it decides which pool(s) to scale up
+
+.footnote[¹On AWS EKS, node pools map to ASGs, which can have mixed instance types.]
+
+---
+
+## Example: big batch
+
+- Every few days, we want to process a batch made of thousands of jobs
+
+- Each job requires lots of RAM (10+ GB) and takes hours to complete
+
+- We want to process the batch as fast as possible
+
+- We don't want to pay for nodes when we don't use them
+
+- Solution:
+
+  - one node group with tiny nodes for basic cluster services
+
+  - one node group with huge nodes for batch processing
+
+  - that second node group "scales to zero"
+
+---
+
+## Gotchas
+
+- Make sure that long-running pods *never* run on big nodes
+
+  (use *taints* and *tolerations*)
+
+- Keep an eye on preemptions
+
+  (especially on very long jobs taking 10+ hours or even days)
+
+---
+
+## Example: mixed load
+
+- Running a majority of stateless apps
+
+- We want to reduce overall cost (target: 25-50%)
+
+- We can accept occasional small disruptions (performance degradations)
+
+- Solution:
+
+  - one node group with "on demand" nodes
+
+  - one node group with "spot" / "preemptible" nodes
+
+  - pin stateful apps to "on demand" nodes
+
+  - *try* to balance stateless apps between the two pools
+
+---
+
+## Gotchas
+
+- We can tell the Kubernetes scheduler to *prefer* balancing across pools
+
+- We don't have a way to *require* it
+
+- What should be done anyway if it's not possible to balance?
+ 
+  (e.g. if spot capacity is unavailable)
+
+- In practice, preemption can be very rare
+
+- This means big savings, but we should have a "plan B" just in case
+
+  (perhaps think about which services can tolerate a rare outage)
+
+---
+
+## In practice
+
+- Most managed Kubernetes providers give us ways to create multiple node pools
+
+- Sometimes the pools are declared as *blocks* within the cluster resources
+
+  - pros: simpler, sometimes faster to provision
+
+  - cons: changing the pool configuration generally forces re-creation of the cluster
+
+- Sometimes the pools are declared as independent resources
+
+  - pros: can add/remove/change pools without destroying the cluster
+
+  - cons: more complex
+
+- Most providers recommend to declare the pools independently
--- a/slides/terraform/quickstart-gke.md
+++ b/slides/terraform/quickstart-gke.md
@@ -0,0 +1,120 @@
+## GKE quick start
+
+- Install Terraform and GCP SDK (`gcloud`)
+
+- Authenticate with `gcloud auth login`
+
+- Create a project or use one of your existing ones
+
+- Set the `GOOGLE_PROJECT` env var to the project name
+
+  (this will be used by Terraform)
+
+Note 1: there must be a billing account linked to the project.
+
+Note 2: if the required APIs are not enabled on the project,
+we will get error messages telling us "please enable that API"
+when using the APIs for the first time. The error messages
+should include instructions to do this one-time process.
+
+---
+
+## Create configuration
+
+- Create empty directory
+
+- Create a bunch of `.tf` files as shown in next slides
+
+  (feel free to adjust the values!)
+
+---
+
+## Configuring providers
+
+- We'll use the [google provider](https://registry.terraform.io/providers/hashicorp/google)
+
+- It's an official provider (maintained by `hashicorp`)
+
+- Which means that we don't have to add it explicitly to our configuration
+
+  (`terraform init` will take care of it automatically)
+
+- That'll simplify a tiny bit our "getting started" experience!
+
+---
+
+## `cluster.tf`
+
+```tf
+resource "google_container_cluster" "mycluster" {
+  name = "klstr"
+  location = "europe-north1-a"
+  initial_node_count = 2
+}
+```
+
+- [`google_container_cluster`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster) is the *type* of the resource
+
+- `mycluster` is the internal Terraform name for that resource
+
+  (useful if we have multiple resources of that type)
+
+- `location` can be a *zone* or a *region* (see next slides for details)
+
+- don't forget `initial_node_count` otherwise we get a zero-node cluster 🙃
+
+---
+
+## Regional vs Zonal vs Multi-Zonal
+
+- If `location` is a zone, we get a "zonal" cluster
+
+  (control plane and nodes are in a single zone)
+
+- If `location` is a region, we get a "regional" cluster
+
+  (control plane and nodes span all zones in this region)
+
+- In a region with Z zones, if we say we want N nodes...
+
+  ...we get Z×N nodes
+
+- We can also set `location` to be a zone, and set additional `node_locations`
+
+- We get a "multi-zonal" cluster with control plane in a single zone
+
+---
+
+## Create the cluster
+
+- Initialize providers
+  ```bash
+  terraform init
+  ```
+
+- Create the cluster:
+  ```bash
+  terraform apply
+  ```
+
+- We'll explain later what that "plan" thing is; just approve it for now!
+
+- Check what happens if we run `terraform apply` again
+
+---
+
+## Now what?
+
+- Let's connect to the cluster
+
+- Get the credentials for the cluster:
+  ```bash
+  gcloud container clusters get-credentials klstr --zone=europe-north1
+  ```
+  (Adjust the `zone` if you changed it earlier!)
+
+- This will add the cluster to our `kubeconfig` file
+
+- Deploy a simple app to the cluster
+
+🎉
--- a/slides/terraform/quickstart.md
+++ b/slides/terraform/quickstart.md
@@ -0,0 +1,25 @@
+## Quick start
+
+- We're going to use Terraform to deploy a basic Kubernetes cluster
+
+- We will need cloud credentials (make sure you have a valid cloud account!)
+
+---
+
+## Steps
+
+1. Install Terraform (download single Go binary)
+
+2. Configure credentials (e.g. `gcloud auth login`)
+
+3. Create Terraform *configuration*
+
+4. Add *providers* to the configuration
+
+5. Initialize providers with `terraform init`
+
+6. Add *resources* to the configuration
+
+7. Realize the resources with `terraform apply`
+
+8. Repeat 6-7 or 4-5-6-7
--- a/slides/terraform/state.md
+++ b/slides/terraform/state.md
@@ -0,0 +1,59 @@
+## State
+
+- Terraform keeps track of the *state*
+
+- Resources created by Terraform are added to the state
+
+- When we run Terraform, it will:
+
+  - *refresh* the state (check if resources have changed since last time it ran)
+
+  - generate a *plan* (decide which actions need to be taken)
+
+  - ask confirmation (this can be skipped)
+
+  - *apply* that plan
+
+---
+
+## Remote state
+
+- By default, the state is stored in `terraform.tfstate`
+
+- This is a JSON file (feel free to inspect it!)
+
+- The state can also be stored in a central place
+
+  (e.g. cloud object store, Consul, etcd...)
+
+- This is more convenient when working as a team
+
+- It also requires *locking*
+
+  (to prevent concurrent modifications)
+
+---
+
+## Working with remote state
+
+- This is beyond the scope of this workshop
+
+- Note that if a Terraform configuration defines e.g. an S3 bucket to store its state...
+
+  ...that configuration cannot create that S3 bucket!
+
+- The bucket must be created beforehand
+
+  (Terraform won't be able to run until the bucket is available)
+
+---
+
+## Manipulating state
+
+`terraform state list`
+
+`terraform state show google_container_cluster.mycluster`
+
+`terraform state rm`
+
+`terraform import`
--- a/slides/terraform/variables.md
+++ b/slides/terraform/variables.md
@@ -0,0 +1,71 @@
+## Variables
+
+- At this point, we are probably:
+
+  - duplicating a lot of information (e.g. zone, number of nodes...)
+
+  - hard-coding a lot of things as well (ditto!)
+
+- Let's see how we can do better!
+
+---
+
+## [Input variables](https://www.terraform.io/language/values/variables)
+
+Declaring an input variable:
+```tf
+variable "location" {
+  type    = string
+  default = "europe-north1-a"
+}
+```
+
+Using an input variable:
+```tf
+resource "google_container_cluster" "mycluster" {
+  location = var.location
+  ...
+}
+```
+
+---
+
+## Setting variables
+
+Input variables can be set with:
+
+- environment variables (`export TFVAR_location=us-west1`)
+
+- a file named `terraform.tfvars` (`location = "us-west1"`)
+
+- a file named `terraform.tfvars.json`
+
+- files named `*.auto.tfvars` and `*.auto.tfvars.json`
+
+- command-line literal values (`-var location=us-west1`)
+
+- command-line file names (`-var-file carbon-neutral.tfvars`)
+
+The latter taking precedence over the former.
+
+---
+
+## [Local values](https://www.terraform.io/language/values/locals)
+
+Declaring and setting a local value:
+```tf
+locals {
+  location = var.location != null ? var.location : "europe-north1-a"
+  region   = replace(local.location, "/-[a-z]$/", "")
+}
+```
+
+We can have multiple `locals` blocks.
+
+Using a local value:
+```tf
+resource "google_container_cluster" "mycluster" {
+  location = local.location
+  ...
+}
+```