diff --git a/slides/exercises/tf-nodepools-brief.md b/slides/exercises/tf-nodepools-brief.md new file mode 100644 index 00000000..8bf86428 --- /dev/null +++ b/slides/exercises/tf-nodepools-brief.md @@ -0,0 +1,9 @@ +## Exercise — Terraform Node Pools + +- Write a Terraform configuration to deploy a cluster + +- The cluster should have two node pools with autoscaling + +- Deploy two apps, each using exclusively one node pool + +- Bonus: deploy an app balanced across both node pools diff --git a/slides/exercises/tf-nodepools-details.md b/slides/exercises/tf-nodepools-details.md new file mode 100644 index 00000000..0d51a04d --- /dev/null +++ b/slides/exercises/tf-nodepools-details.md @@ -0,0 +1,69 @@ +# Exercise — Terraform Node Pools + +- Write a Terraform configuration to deploy a cluster + +- The cluster should have two node pools with autoscaling + +- Deploy two apps, each using exclusively one node pool + +- Bonus: deploy an app balanced across both node pools + +--- + +## Cluster deployment + +- Write a Terraform configuration to deploy a cluster + +- We want to have two node pools with autoscaling + +- Example for sizing: + + - 4 GB / 1 CPU per node + + - pools of 1 to 4 nodes + +--- + +## Cluster autoscaling + +- Deploy an app on the cluster + + (you can use `nginx`, `jpetazzo/color`...) + +- Set a resource request (e.g. 1 GB RAM) + +- Scale up and verify that the autoscaler kicks in + +--- + +## Pool isolation + +- We want to deploy two apps + +- The first app should be deployed exclusively on the first pool + +- The second app should be deployed exclusively on the second pool + +- Check the next slide for hints! + +--- + +## Hints + +- One solution involves adding a `nodeSelector` to the pod templates + +- Another solution involves adding: + + - `taints` to the node pools + + - matching `tolerations` to the pod templates + +--- + +## Balancing + +- Step 1: make sure that the pools are not balanced + +- Step 2: deploy a new app, check that it goes to the emptiest pool + +- Step 3: update the app so that it balances (as much as possible) between pools diff --git a/slides/terraform/intro.md b/slides/terraform/intro.md new file mode 100644 index 00000000..95c56318 --- /dev/null +++ b/slides/terraform/intro.md @@ -0,0 +1,70 @@ +# Terraform + +“An open-source **infrastructure as code** software tool created by HashiCorp¹.” + +- Other products in that space: Pulumi, Cloud Formation... + +- Very rich ecosystem + +- Supports many cloud providers + +.footnote[¹Also creators of Consul, Nomad, Packer, Vagrant, Vault...] + +--- + +## Infrastructure as code? + +1. Write configuration files that describe resources, e.g.: + + - some GKE and Kapsule Kubernetes clusters + - some S3 buckets + - a bunch of Linode/Digital Ocean instances + - ...and more + +2. Run `terraform apply` to create all these things + +3. Make changes to the configuration files + +4. Run `terraform apply` again to create/update/delete resources + + (Vagrant, Packer, Consul, Vault, Nomad...) + +5. Run `terraform destroy` to delete all these things + +--- + +## What Terraform *is not* + +- It's not a tool to abstract the differences between cloud providers + + (“I want to move my AWS workloads to Scaleway!”) + +- It's not a configuration management tool + + (“I want to install and configure packages on my servers!”) + +- It's not an application deployment tool + + (“I want to deploy a new build of my app!” + +- It can be used for these things anyway (more or less succesfully) + +--- + +## Vocabulary + +- Configuration = a set of Terraform files + + - typically in HCL (HashiCorp Config Language), `.tf` extension + + - can also be JSON + +- Resource = a thing that will be managed by Terraform + + - e.g. VM, cluster, load balancer... + +- Provider = plugin to manage a family of resources + + - example: `google` provider to talk with GCP APIs + + - example: `tls` provider to generate keys diff --git a/slides/terraform/nodepools-gke.md b/slides/terraform/nodepools-gke.md new file mode 100644 index 00000000..33f21cc0 --- /dev/null +++ b/slides/terraform/nodepools-gke.md @@ -0,0 +1,148 @@ +## Node pools on GKE + +⚠️ Disclaimer + +I do not pretend to fully know and understand GKE's concepts and APIs. + +I do not know their rationales and underlying implementations. + +The techniques that I'm going to explain here work for me, but there +might be better ones. + +--- + +## The default node pool + +- Defined within the `google_container_cluster` resource + +- Uses `node_config` block and `initial_node_count` + +- If it's defined, it should be the only node pool! + +- Disable it with either: + + `initial_node_count=1` and `remove_default_node_pool=true` + + *or* + + a dummy `node_pool` block and a `lifecycle` block to ignore changes to the `node_pool` + +--- + +class: extra-details + +## What's going on with the node pools? + +When we run `terraform apply` (or, more accurately, `terraform plan`)... + +- Terraform invokes the `google` provider to enumerate resources + +- the provider lists the clusters and node pools + +- it includes the node pools in the cluster resources + +- ...even if they are declared separately + +- Terraform notices these "new" node pools and wants to remove them + +- we can tell Terraform to ignore these node pools with a `lifecycle` block + +- I *think* that `remove_default_node_pool` achieves the same result 🤔 + +--- + +## Our new cluster resource + +```tf +resource "google_container_cluster" "mycluster" { + name = "klstr" + location = "europe-north1-a" + + # We won't use that node pool but we have to declare it anyway. + # It will remain empty so we don't have to worry about it. + node_pool { + name = "builtin" + } + lifecycle { + ignore_changes = [ node_pool ] + } +} +``` + +--- + +## Our normal node pool + +```tf +resource "google_container_node_pool" "ondemand" { + name = "ondemand" + cluster = google_container_cluster.mycluster.id + autoscaling { + min_node_count = 0 + max_node_count = 5 + } + node_config { + preemptible = false + } +} +``` + +--- + +## Our preemptible node pool + +```tf +resource "google_container_node_pool" "preemptible" { + name = "preemptible" + cluster = google_container_cluster.mycluster.id + initial_node_count = 1 + autoscaling { + min_node_count = 1 + max_node_count = 5 + } + node_config { + preemptible = true + } +} +``` + +--- + +## Scale to zero + +- It is possible to scale a single node pool to zero + +- The cluster autoscaler will be able to scale up an empty node pool + + (and scale it back down to zero when it's not needed anymore) + +- However, our cluster must have at least one node + + (the cluster autoscaler can't/won't work if we have zero node) + +- Make sure that at least one pool has at least one node! + +--- + +## Taints and labels + +- We will typically use node selectors and tolerations to schedule pods + +- The corresponding labels and taints must be set on the node pools + +```tf +resource "google_container_node_pool" "bignodes" { + ... + node_config { + machine_type = "n2-standard-4" + labels = { + expensive = "" + } + taint { + key = "expensive" + value = "" + effect = "NO_SCHEDULE" + } + } +} +``` diff --git a/slides/terraform/nodepools.md b/slides/terraform/nodepools.md new file mode 100644 index 00000000..cba89f9f --- /dev/null +++ b/slides/terraform/nodepools.md @@ -0,0 +1,125 @@ +## Saving (lots of) money + +- Our load (number and size of pods) is probably variable + +- We need *cluster autoscaling* + + (add/remove nodes as we need them, pay only for what we use) + +- We might need nodes of different sizes + + (or with specialized hardware: local fast disks, GPUs...) + +- If possible, we should leverage "spot" or "preemptible" capacity + + (VMs that are significantly cheaper but can be terminated on short notice) + +--- + +## Node pools + +- We will have multiple *node pools* + +- A node pool is a set of nodes running in a single zone + +- The nodes usually¹ have the same size + +- They have the same "preemptability" + + (i.e. a node pool is either "on-demand" or "preemptible") + +- The Kubernetes cluster autoscaler is aware of the node pools + +- When it scales up the cluster, it decides which pool(s) to scale up + +.footnote[¹On AWS EKS, node pools map to ASGs, which can have mixed instance types.] + +--- + +## Example: big batch + +- Every few days, we want to process a batch made of thousands of jobs + +- Each job requires lots of RAM (10+ GB) and takes hours to complete + +- We want to process the batch as fast as possible + +- We don't want to pay for nodes when we don't use them + +- Solution: + + - one node group with tiny nodes for basic cluster services + + - one node group with huge nodes for batch processing + + - that second node group "scales to zero" + +--- + +## Gotchas + +- Make sure that long-running pods *never* run on big nodes + + (use *taints* and *tolerations*) + +- Keep an eye on preemptions + + (especially on very long jobs taking 10+ hours or even days) + +--- + +## Example: mixed load + +- Running a majority of stateless apps + +- We want to reduce overall cost (target: 25-50%) + +- We can accept occasional small disruptions (performance degradations) + +- Solution: + + - one node group with "on demand" nodes + + - one node group with "spot" / "preemptible" nodes + + - pin stateful apps to "on demand" nodes + + - *try* to balance stateless apps between the two pools + +--- + +## Gotchas + +- We can tell the Kubernetes scheduler to *prefer* balancing across pools + +- We don't have a way to *require* it + +- What should be done anyway if it's not possible to balance? + + (e.g. if spot capacity is unavailable) + +- In practice, preemption can be very rare + +- This means big savings, but we should have a "plan B" just in case + + (perhaps think about which services can tolerate a rare outage) + +--- + +## In practice + +- Most managed Kubernetes providers give us ways to create multiple node pools + +- Sometimes the pools are declared as *blocks* within the cluster resources + + - pros: simpler, sometimes faster to provision + + - cons: changing the pool configuration generally forces re-creation of the cluster + +- Sometimes the pools are declared as independent resources + + - pros: can add/remove/change pools without destroying the cluster + + - cons: more complex + +- Most providers recommend to declare the pools independently diff --git a/slides/terraform/quickstart-gke.md b/slides/terraform/quickstart-gke.md new file mode 100644 index 00000000..06e53f3d --- /dev/null +++ b/slides/terraform/quickstart-gke.md @@ -0,0 +1,120 @@ +## GKE quick start + +- Install Terraform and GCP SDK (`gcloud`) + +- Authenticate with `gcloud auth login` + +- Create a project or use one of your existing ones + +- Set the `GOOGLE_PROJECT` env var to the project name + + (this will be used by Terraform) + +Note 1: there must be a billing account linked to the project. + +Note 2: if the required APIs are not enabled on the project, +we will get error messages telling us "please enable that API" +when using the APIs for the first time. The error messages +should include instructions to do this one-time process. + +--- + +## Create configuration + +- Create empty directory + +- Create a bunch of `.tf` files as shown in next slides + + (feel free to adjust the values!) + +--- + +## Configuring providers + +- We'll use the [google provider](https://registry.terraform.io/providers/hashicorp/google) + +- It's an official provider (maintained by `hashicorp`) + +- Which means that we don't have to add it explicitly to our configuration + + (`terraform init` will take care of it automatically) + +- That'll simplify a tiny bit our "getting started" experience! + +--- + +## `cluster.tf` + +```tf +resource "google_container_cluster" "mycluster" { + name = "klstr" + location = "europe-north1-a" + initial_node_count = 2 +} +``` + +- [`google_container_cluster`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster) is the *type* of the resource + +- `mycluster` is the internal Terraform name for that resource + + (useful if we have multiple resources of that type) + +- `location` can be a *zone* or a *region* (see next slides for details) + +- don't forget `initial_node_count` otherwise we get a zero-node cluster 🙃 + +--- + +## Regional vs Zonal vs Multi-Zonal + +- If `location` is a zone, we get a "zonal" cluster + + (control plane and nodes are in a single zone) + +- If `location` is a region, we get a "regional" cluster + + (control plane and nodes span all zones in this region) + +- In a region with Z zones, if we say we want N nodes... + + ...we get Z×N nodes + +- We can also set `location` to be a zone, and set additional `node_locations` + +- We get a "multi-zonal" cluster with control plane in a single zone + +--- + +## Create the cluster + +- Initialize providers + ```bash + terraform init + ``` + +- Create the cluster: + ```bash + terraform apply + ``` + +- We'll explain later what that "plan" thing is; just approve it for now! + +- Check what happens if we run `terraform apply` again + +--- + +## Now what? + +- Let's connect to the cluster + +- Get the credentials for the cluster: + ```bash + gcloud container clusters get-credentials klstr --zone=europe-north1 + ``` + (Adjust the `zone` if you changed it earlier!) + +- This will add the cluster to our `kubeconfig` file + +- Deploy a simple app to the cluster + +🎉 diff --git a/slides/terraform/quickstart.md b/slides/terraform/quickstart.md new file mode 100644 index 00000000..ce5aaaaa --- /dev/null +++ b/slides/terraform/quickstart.md @@ -0,0 +1,25 @@ +## Quick start + +- We're going to use Terraform to deploy a basic Kubernetes cluster + +- We will need cloud credentials (make sure you have a valid cloud account!) + +--- + +## Steps + +1. Install Terraform (download single Go binary) + +2. Configure credentials (e.g. `gcloud auth login`) + +3. Create Terraform *configuration* + +4. Add *providers* to the configuration + +5. Initialize providers with `terraform init` + +6. Add *resources* to the configuration + +7. Realize the resources with `terraform apply` + +8. Repeat 6-7 or 4-5-6-7 diff --git a/slides/terraform/state.md b/slides/terraform/state.md new file mode 100644 index 00000000..74b82ed1 --- /dev/null +++ b/slides/terraform/state.md @@ -0,0 +1,59 @@ +## State + +- Terraform keeps track of the *state* + +- Resources created by Terraform are added to the state + +- When we run Terraform, it will: + + - *refresh* the state (check if resources have changed since last time it ran) + + - generate a *plan* (decide which actions need to be taken) + + - ask confirmation (this can be skipped) + + - *apply* that plan + +--- + +## Remote state + +- By default, the state is stored in `terraform.tfstate` + +- This is a JSON file (feel free to inspect it!) + +- The state can also be stored in a central place + + (e.g. cloud object store, Consul, etcd...) + +- This is more convenient when working as a team + +- It also requires *locking* + + (to prevent concurrent modifications) + +--- + +## Working with remote state + +- This is beyond the scope of this workshop + +- Note that if a Terraform configuration defines e.g. an S3 bucket to store its state... + + ...that configuration cannot create that S3 bucket! + +- The bucket must be created beforehand + + (Terraform won't be able to run until the bucket is available) + +--- + +## Manipulating state + +`terraform state list` + +`terraform state show google_container_cluster.mycluster` + +`terraform state rm` + +`terraform import` diff --git a/slides/terraform/variables.md b/slides/terraform/variables.md new file mode 100644 index 00000000..60301e50 --- /dev/null +++ b/slides/terraform/variables.md @@ -0,0 +1,71 @@ +## Variables + +- At this point, we are probably: + + - duplicating a lot of information (e.g. zone, number of nodes...) + + - hard-coding a lot of things as well (ditto!) + +- Let's see how we can do better! + +--- + +## [Input variables](https://www.terraform.io/language/values/variables) + +Declaring an input variable: +```tf +variable "location" { + type = string + default = "europe-north1-a" +} +``` + +Using an input variable: +```tf +resource "google_container_cluster" "mycluster" { + location = var.location + ... +} +``` + +--- + +## Setting variables + +Input variables can be set with: + +- environment variables (`export TFVAR_location=us-west1`) + +- a file named `terraform.tfvars` (`location = "us-west1"`) + +- a file named `terraform.tfvars.json` + +- files named `*.auto.tfvars` and `*.auto.tfvars.json` + +- command-line literal values (`-var location=us-west1`) + +- command-line file names (`-var-file carbon-neutral.tfvars`) + +The latter taking precedence over the former. + +--- + +## [Local values](https://www.terraform.io/language/values/locals) + +Declaring and setting a local value: +```tf +locals { + location = var.location != null ? var.location : "europe-north1-a" + region = replace(local.location, "/-[a-z]$/", "") +} +``` + +We can have multiple `locals` blocks. + +Using a local value: +```tf +resource "google_container_cluster" "mycluster" { + location = local.location + ... +} +```