Add Terraform workshop with GKE and node pools

This commit is contained in:
Jérôme Petazzoni
2022-01-17 00:00:49 +01:00
parent de0ad83686
commit 69c7ac2371
9 changed files with 696 additions and 0 deletions

View File

@@ -0,0 +1,9 @@
## Exercise — Terraform Node Pools
- Write a Terraform configuration to deploy a cluster
- The cluster should have two node pools with autoscaling
- Deploy two apps, each using exclusively one node pool
- Bonus: deploy an app balanced across both node pools

View File

@@ -0,0 +1,69 @@
# Exercise — Terraform Node Pools
- Write a Terraform configuration to deploy a cluster
- The cluster should have two node pools with autoscaling
- Deploy two apps, each using exclusively one node pool
- Bonus: deploy an app balanced across both node pools
---
## Cluster deployment
- Write a Terraform configuration to deploy a cluster
- We want to have two node pools with autoscaling
- Example for sizing:
- 4 GB / 1 CPU per node
- pools of 1 to 4 nodes
---
## Cluster autoscaling
- Deploy an app on the cluster
(you can use `nginx`, `jpetazzo/color`...)
- Set a resource request (e.g. 1 GB RAM)
- Scale up and verify that the autoscaler kicks in
---
## Pool isolation
- We want to deploy two apps
- The first app should be deployed exclusively on the first pool
- The second app should be deployed exclusively on the second pool
- Check the next slide for hints!
---
## Hints
- One solution involves adding a `nodeSelector` to the pod templates
- Another solution involves adding:
- `taints` to the node pools
- matching `tolerations` to the pod templates
---
## Balancing
- Step 1: make sure that the pools are not balanced
- Step 2: deploy a new app, check that it goes to the emptiest pool
- Step 3: update the app so that it balances (as much as possible) between pools

70
slides/terraform/intro.md Normal file
View File

@@ -0,0 +1,70 @@
# Terraform
“An open-source **infrastructure as code** software tool created by HashiCorp¹.”
- Other products in that space: Pulumi, Cloud Formation...
- Very rich ecosystem
- Supports many cloud providers
.footnote[¹Also creators of Consul, Nomad, Packer, Vagrant, Vault...]
---
## Infrastructure as code?
1. Write configuration files that describe resources, e.g.:
- some GKE and Kapsule Kubernetes clusters
- some S3 buckets
- a bunch of Linode/Digital Ocean instances
- ...and more
2. Run `terraform apply` to create all these things
3. Make changes to the configuration files
4. Run `terraform apply` again to create/update/delete resources
(Vagrant, Packer, Consul, Vault, Nomad...)
5. Run `terraform destroy` to delete all these things
---
## What Terraform *is not*
- It's not a tool to abstract the differences between cloud providers
(“I want to move my AWS workloads to Scaleway!”)
- It's not a configuration management tool
(“I want to install and configure packages on my servers!”)
- It's not an application deployment tool
(“I want to deploy a new build of my app!”
- It can be used for these things anyway (more or less succesfully)
---
## Vocabulary
- Configuration = a set of Terraform files
- typically in HCL (HashiCorp Config Language), `.tf` extension
- can also be JSON
- Resource = a thing that will be managed by Terraform
- e.g. VM, cluster, load balancer...
- Provider = plugin to manage a family of resources
- example: `google` provider to talk with GCP APIs
- example: `tls` provider to generate keys

View File

@@ -0,0 +1,148 @@
## Node pools on GKE
⚠️ Disclaimer
I do not pretend to fully know and understand GKE's concepts and APIs.
I do not know their rationales and underlying implementations.
The techniques that I'm going to explain here work for me, but there
might be better ones.
---
## The default node pool
- Defined within the `google_container_cluster` resource
- Uses `node_config` block and `initial_node_count`
- If it's defined, it should be the only node pool!
- Disable it with either:
`initial_node_count=1` and `remove_default_node_pool=true`
*or*
a dummy `node_pool` block and a `lifecycle` block to ignore changes to the `node_pool`
---
class: extra-details
## What's going on with the node pools?
When we run `terraform apply` (or, more accurately, `terraform plan`)...
- Terraform invokes the `google` provider to enumerate resources
- the provider lists the clusters and node pools
- it includes the node pools in the cluster resources
- ...even if they are declared separately
- Terraform notices these "new" node pools and wants to remove them
- we can tell Terraform to ignore these node pools with a `lifecycle` block
- I *think* that `remove_default_node_pool` achieves the same result 🤔
---
## Our new cluster resource
```tf
resource "google_container_cluster" "mycluster" {
name = "klstr"
location = "europe-north1-a"
# We won't use that node pool but we have to declare it anyway.
# It will remain empty so we don't have to worry about it.
node_pool {
name = "builtin"
}
lifecycle {
ignore_changes = [ node_pool ]
}
}
```
---
## Our normal node pool
```tf
resource "google_container_node_pool" "ondemand" {
name = "ondemand"
cluster = google_container_cluster.mycluster.id
autoscaling {
min_node_count = 0
max_node_count = 5
}
node_config {
preemptible = false
}
}
```
---
## Our preemptible node pool
```tf
resource "google_container_node_pool" "preemptible" {
name = "preemptible"
cluster = google_container_cluster.mycluster.id
initial_node_count = 1
autoscaling {
min_node_count = 1
max_node_count = 5
}
node_config {
preemptible = true
}
}
```
---
## Scale to zero
- It is possible to scale a single node pool to zero
- The cluster autoscaler will be able to scale up an empty node pool
(and scale it back down to zero when it's not needed anymore)
- However, our cluster must have at least one node
(the cluster autoscaler can't/won't work if we have zero node)
- Make sure that at least one pool has at least one node!
---
## Taints and labels
- We will typically use node selectors and tolerations to schedule pods
- The corresponding labels and taints must be set on the node pools
```tf
resource "google_container_node_pool" "bignodes" {
...
node_config {
machine_type = "n2-standard-4"
labels = {
expensive = ""
}
taint {
key = "expensive"
value = ""
effect = "NO_SCHEDULE"
}
}
}
```

View File

@@ -0,0 +1,125 @@
## Saving (lots of) money
- Our load (number and size of pods) is probably variable
- We need *cluster autoscaling*
(add/remove nodes as we need them, pay only for what we use)
- We might need nodes of different sizes
(or with specialized hardware: local fast disks, GPUs...)
- If possible, we should leverage "spot" or "preemptible" capacity
(VMs that are significantly cheaper but can be terminated on short notice)
---
## Node pools
- We will have multiple *node pools*
- A node pool is a set of nodes running in a single zone
- The nodes usually¹ have the same size
- They have the same "preemptability"
(i.e. a node pool is either "on-demand" or "preemptible")
- The Kubernetes cluster autoscaler is aware of the node pools
- When it scales up the cluster, it decides which pool(s) to scale up
.footnote[¹On AWS EKS, node pools map to ASGs, which can have mixed instance types.]
---
## Example: big batch
- Every few days, we want to process a batch made of thousands of jobs
- Each job requires lots of RAM (10+ GB) and takes hours to complete
- We want to process the batch as fast as possible
- We don't want to pay for nodes when we don't use them
- Solution:
- one node group with tiny nodes for basic cluster services
- one node group with huge nodes for batch processing
- that second node group "scales to zero"
---
## Gotchas
- Make sure that long-running pods *never* run on big nodes
(use *taints* and *tolerations*)
- Keep an eye on preemptions
(especially on very long jobs taking 10+ hours or even days)
---
## Example: mixed load
- Running a majority of stateless apps
- We want to reduce overall cost (target: 25-50%)
- We can accept occasional small disruptions (performance degradations)
- Solution:
- one node group with "on demand" nodes
- one node group with "spot" / "preemptible" nodes
- pin stateful apps to "on demand" nodes
- *try* to balance stateless apps between the two pools
---
## Gotchas
- We can tell the Kubernetes scheduler to *prefer* balancing across pools
- We don't have a way to *require* it
- What should be done anyway if it's not possible to balance?
(e.g. if spot capacity is unavailable)
- In practice, preemption can be very rare
- This means big savings, but we should have a "plan B" just in case
(perhaps think about which services can tolerate a rare outage)
---
## In practice
- Most managed Kubernetes providers give us ways to create multiple node pools
- Sometimes the pools are declared as *blocks* within the cluster resources
- pros: simpler, sometimes faster to provision
- cons: changing the pool configuration generally forces re-creation of the cluster
- Sometimes the pools are declared as independent resources
- pros: can add/remove/change pools without destroying the cluster
- cons: more complex
- Most providers recommend to declare the pools independently

View File

@@ -0,0 +1,120 @@
## GKE quick start
- Install Terraform and GCP SDK (`gcloud`)
- Authenticate with `gcloud auth login`
- Create a project or use one of your existing ones
- Set the `GOOGLE_PROJECT` env var to the project name
(this will be used by Terraform)
Note 1: there must be a billing account linked to the project.
Note 2: if the required APIs are not enabled on the project,
we will get error messages telling us "please enable that API"
when using the APIs for the first time. The error messages
should include instructions to do this one-time process.
---
## Create configuration
- Create empty directory
- Create a bunch of `.tf` files as shown in next slides
(feel free to adjust the values!)
---
## Configuring providers
- We'll use the [google provider](https://registry.terraform.io/providers/hashicorp/google)
- It's an official provider (maintained by `hashicorp`)
- Which means that we don't have to add it explicitly to our configuration
(`terraform init` will take care of it automatically)
- That'll simplify a tiny bit our "getting started" experience!
---
## `cluster.tf`
```tf
resource "google_container_cluster" "mycluster" {
name = "klstr"
location = "europe-north1-a"
initial_node_count = 2
}
```
- [`google_container_cluster`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster) is the *type* of the resource
- `mycluster` is the internal Terraform name for that resource
(useful if we have multiple resources of that type)
- `location` can be a *zone* or a *region* (see next slides for details)
- don't forget `initial_node_count` otherwise we get a zero-node cluster 🙃
---
## Regional vs Zonal vs Multi-Zonal
- If `location` is a zone, we get a "zonal" cluster
(control plane and nodes are in a single zone)
- If `location` is a region, we get a "regional" cluster
(control plane and nodes span all zones in this region)
- In a region with Z zones, if we say we want N nodes...
...we get Z×N nodes
- We can also set `location` to be a zone, and set additional `node_locations`
- We get a "multi-zonal" cluster with control plane in a single zone
---
## Create the cluster
- Initialize providers
```bash
terraform init
```
- Create the cluster:
```bash
terraform apply
```
- We'll explain later what that "plan" thing is; just approve it for now!
- Check what happens if we run `terraform apply` again
---
## Now what?
- Let's connect to the cluster
- Get the credentials for the cluster:
```bash
gcloud container clusters get-credentials klstr --zone=europe-north1
```
(Adjust the `zone` if you changed it earlier!)
- This will add the cluster to our `kubeconfig` file
- Deploy a simple app to the cluster
🎉

View File

@@ -0,0 +1,25 @@
## Quick start
- We're going to use Terraform to deploy a basic Kubernetes cluster
- We will need cloud credentials (make sure you have a valid cloud account!)
---
## Steps
1. Install Terraform (download single Go binary)
2. Configure credentials (e.g. `gcloud auth login`)
3. Create Terraform *configuration*
4. Add *providers* to the configuration
5. Initialize providers with `terraform init`
6. Add *resources* to the configuration
7. Realize the resources with `terraform apply`
8. Repeat 6-7 or 4-5-6-7

59
slides/terraform/state.md Normal file
View File

@@ -0,0 +1,59 @@
## State
- Terraform keeps track of the *state*
- Resources created by Terraform are added to the state
- When we run Terraform, it will:
- *refresh* the state (check if resources have changed since last time it ran)
- generate a *plan* (decide which actions need to be taken)
- ask confirmation (this can be skipped)
- *apply* that plan
---
## Remote state
- By default, the state is stored in `terraform.tfstate`
- This is a JSON file (feel free to inspect it!)
- The state can also be stored in a central place
(e.g. cloud object store, Consul, etcd...)
- This is more convenient when working as a team
- It also requires *locking*
(to prevent concurrent modifications)
---
## Working with remote state
- This is beyond the scope of this workshop
- Note that if a Terraform configuration defines e.g. an S3 bucket to store its state...
...that configuration cannot create that S3 bucket!
- The bucket must be created beforehand
(Terraform won't be able to run until the bucket is available)
---
## Manipulating state
`terraform state list`
`terraform state show google_container_cluster.mycluster`
`terraform state rm`
`terraform import`

View File

@@ -0,0 +1,71 @@
## Variables
- At this point, we are probably:
- duplicating a lot of information (e.g. zone, number of nodes...)
- hard-coding a lot of things as well (ditto!)
- Let's see how we can do better!
---
## [Input variables](https://www.terraform.io/language/values/variables)
Declaring an input variable:
```tf
variable "location" {
type = string
default = "europe-north1-a"
}
```
Using an input variable:
```tf
resource "google_container_cluster" "mycluster" {
location = var.location
...
}
```
---
## Setting variables
Input variables can be set with:
- environment variables (`export TFVAR_location=us-west1`)
- a file named `terraform.tfvars` (`location = "us-west1"`)
- a file named `terraform.tfvars.json`
- files named `*.auto.tfvars` and `*.auto.tfvars.json`
- command-line literal values (`-var location=us-west1`)
- command-line file names (`-var-file carbon-neutral.tfvars`)
The latter taking precedence over the former.
---
## [Local values](https://www.terraform.io/language/values/locals)
Declaring and setting a local value:
```tf
locals {
location = var.location != null ? var.location : "europe-north1-a"
region = replace(local.location, "/-[a-z]$/", "")
}
```
We can have multiple `locals` blocks.
Using a local value:
```tf
resource "google_container_cluster" "mycluster" {
location = local.location
...
}
```