mirror of
https://github.com/jpetazzo/container.training.git
synced 2026-02-14 09:39:56 +00:00
➕ Add Terraform workshop with GKE and node pools
This commit is contained in:
9
slides/exercises/tf-nodepools-brief.md
Normal file
9
slides/exercises/tf-nodepools-brief.md
Normal file
@@ -0,0 +1,9 @@
|
||||
## Exercise — Terraform Node Pools
|
||||
|
||||
- Write a Terraform configuration to deploy a cluster
|
||||
|
||||
- The cluster should have two node pools with autoscaling
|
||||
|
||||
- Deploy two apps, each using exclusively one node pool
|
||||
|
||||
- Bonus: deploy an app balanced across both node pools
|
||||
69
slides/exercises/tf-nodepools-details.md
Normal file
69
slides/exercises/tf-nodepools-details.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Exercise — Terraform Node Pools
|
||||
|
||||
- Write a Terraform configuration to deploy a cluster
|
||||
|
||||
- The cluster should have two node pools with autoscaling
|
||||
|
||||
- Deploy two apps, each using exclusively one node pool
|
||||
|
||||
- Bonus: deploy an app balanced across both node pools
|
||||
|
||||
---
|
||||
|
||||
## Cluster deployment
|
||||
|
||||
- Write a Terraform configuration to deploy a cluster
|
||||
|
||||
- We want to have two node pools with autoscaling
|
||||
|
||||
- Example for sizing:
|
||||
|
||||
- 4 GB / 1 CPU per node
|
||||
|
||||
- pools of 1 to 4 nodes
|
||||
|
||||
---
|
||||
|
||||
## Cluster autoscaling
|
||||
|
||||
- Deploy an app on the cluster
|
||||
|
||||
(you can use `nginx`, `jpetazzo/color`...)
|
||||
|
||||
- Set a resource request (e.g. 1 GB RAM)
|
||||
|
||||
- Scale up and verify that the autoscaler kicks in
|
||||
|
||||
---
|
||||
|
||||
## Pool isolation
|
||||
|
||||
- We want to deploy two apps
|
||||
|
||||
- The first app should be deployed exclusively on the first pool
|
||||
|
||||
- The second app should be deployed exclusively on the second pool
|
||||
|
||||
- Check the next slide for hints!
|
||||
|
||||
---
|
||||
|
||||
## Hints
|
||||
|
||||
- One solution involves adding a `nodeSelector` to the pod templates
|
||||
|
||||
- Another solution involves adding:
|
||||
|
||||
- `taints` to the node pools
|
||||
|
||||
- matching `tolerations` to the pod templates
|
||||
|
||||
---
|
||||
|
||||
## Balancing
|
||||
|
||||
- Step 1: make sure that the pools are not balanced
|
||||
|
||||
- Step 2: deploy a new app, check that it goes to the emptiest pool
|
||||
|
||||
- Step 3: update the app so that it balances (as much as possible) between pools
|
||||
70
slides/terraform/intro.md
Normal file
70
slides/terraform/intro.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Terraform
|
||||
|
||||
“An open-source **infrastructure as code** software tool created by HashiCorp¹.”
|
||||
|
||||
- Other products in that space: Pulumi, Cloud Formation...
|
||||
|
||||
- Very rich ecosystem
|
||||
|
||||
- Supports many cloud providers
|
||||
|
||||
.footnote[¹Also creators of Consul, Nomad, Packer, Vagrant, Vault...]
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure as code?
|
||||
|
||||
1. Write configuration files that describe resources, e.g.:
|
||||
|
||||
- some GKE and Kapsule Kubernetes clusters
|
||||
- some S3 buckets
|
||||
- a bunch of Linode/Digital Ocean instances
|
||||
- ...and more
|
||||
|
||||
2. Run `terraform apply` to create all these things
|
||||
|
||||
3. Make changes to the configuration files
|
||||
|
||||
4. Run `terraform apply` again to create/update/delete resources
|
||||
|
||||
(Vagrant, Packer, Consul, Vault, Nomad...)
|
||||
|
||||
5. Run `terraform destroy` to delete all these things
|
||||
|
||||
---
|
||||
|
||||
## What Terraform *is not*
|
||||
|
||||
- It's not a tool to abstract the differences between cloud providers
|
||||
|
||||
(“I want to move my AWS workloads to Scaleway!”)
|
||||
|
||||
- It's not a configuration management tool
|
||||
|
||||
(“I want to install and configure packages on my servers!”)
|
||||
|
||||
- It's not an application deployment tool
|
||||
|
||||
(“I want to deploy a new build of my app!”
|
||||
|
||||
- It can be used for these things anyway (more or less succesfully)
|
||||
|
||||
---
|
||||
|
||||
## Vocabulary
|
||||
|
||||
- Configuration = a set of Terraform files
|
||||
|
||||
- typically in HCL (HashiCorp Config Language), `.tf` extension
|
||||
|
||||
- can also be JSON
|
||||
|
||||
- Resource = a thing that will be managed by Terraform
|
||||
|
||||
- e.g. VM, cluster, load balancer...
|
||||
|
||||
- Provider = plugin to manage a family of resources
|
||||
|
||||
- example: `google` provider to talk with GCP APIs
|
||||
|
||||
- example: `tls` provider to generate keys
|
||||
148
slides/terraform/nodepools-gke.md
Normal file
148
slides/terraform/nodepools-gke.md
Normal file
@@ -0,0 +1,148 @@
|
||||
## Node pools on GKE
|
||||
|
||||
⚠️ Disclaimer
|
||||
|
||||
I do not pretend to fully know and understand GKE's concepts and APIs.
|
||||
|
||||
I do not know their rationales and underlying implementations.
|
||||
|
||||
The techniques that I'm going to explain here work for me, but there
|
||||
might be better ones.
|
||||
|
||||
---
|
||||
|
||||
## The default node pool
|
||||
|
||||
- Defined within the `google_container_cluster` resource
|
||||
|
||||
- Uses `node_config` block and `initial_node_count`
|
||||
|
||||
- If it's defined, it should be the only node pool!
|
||||
|
||||
- Disable it with either:
|
||||
|
||||
`initial_node_count=1` and `remove_default_node_pool=true`
|
||||
|
||||
*or*
|
||||
|
||||
a dummy `node_pool` block and a `lifecycle` block to ignore changes to the `node_pool`
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## What's going on with the node pools?
|
||||
|
||||
When we run `terraform apply` (or, more accurately, `terraform plan`)...
|
||||
|
||||
- Terraform invokes the `google` provider to enumerate resources
|
||||
|
||||
- the provider lists the clusters and node pools
|
||||
|
||||
- it includes the node pools in the cluster resources
|
||||
|
||||
- ...even if they are declared separately
|
||||
|
||||
- Terraform notices these "new" node pools and wants to remove them
|
||||
|
||||
- we can tell Terraform to ignore these node pools with a `lifecycle` block
|
||||
|
||||
- I *think* that `remove_default_node_pool` achieves the same result 🤔
|
||||
|
||||
---
|
||||
|
||||
## Our new cluster resource
|
||||
|
||||
```tf
|
||||
resource "google_container_cluster" "mycluster" {
|
||||
name = "klstr"
|
||||
location = "europe-north1-a"
|
||||
|
||||
# We won't use that node pool but we have to declare it anyway.
|
||||
# It will remain empty so we don't have to worry about it.
|
||||
node_pool {
|
||||
name = "builtin"
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [ node_pool ]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Our normal node pool
|
||||
|
||||
```tf
|
||||
resource "google_container_node_pool" "ondemand" {
|
||||
name = "ondemand"
|
||||
cluster = google_container_cluster.mycluster.id
|
||||
autoscaling {
|
||||
min_node_count = 0
|
||||
max_node_count = 5
|
||||
}
|
||||
node_config {
|
||||
preemptible = false
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Our preemptible node pool
|
||||
|
||||
```tf
|
||||
resource "google_container_node_pool" "preemptible" {
|
||||
name = "preemptible"
|
||||
cluster = google_container_cluster.mycluster.id
|
||||
initial_node_count = 1
|
||||
autoscaling {
|
||||
min_node_count = 1
|
||||
max_node_count = 5
|
||||
}
|
||||
node_config {
|
||||
preemptible = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scale to zero
|
||||
|
||||
- It is possible to scale a single node pool to zero
|
||||
|
||||
- The cluster autoscaler will be able to scale up an empty node pool
|
||||
|
||||
(and scale it back down to zero when it's not needed anymore)
|
||||
|
||||
- However, our cluster must have at least one node
|
||||
|
||||
(the cluster autoscaler can't/won't work if we have zero node)
|
||||
|
||||
- Make sure that at least one pool has at least one node!
|
||||
|
||||
---
|
||||
|
||||
## Taints and labels
|
||||
|
||||
- We will typically use node selectors and tolerations to schedule pods
|
||||
|
||||
- The corresponding labels and taints must be set on the node pools
|
||||
|
||||
```tf
|
||||
resource "google_container_node_pool" "bignodes" {
|
||||
...
|
||||
node_config {
|
||||
machine_type = "n2-standard-4"
|
||||
labels = {
|
||||
expensive = ""
|
||||
}
|
||||
taint {
|
||||
key = "expensive"
|
||||
value = ""
|
||||
effect = "NO_SCHEDULE"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
125
slides/terraform/nodepools.md
Normal file
125
slides/terraform/nodepools.md
Normal file
@@ -0,0 +1,125 @@
|
||||
## Saving (lots of) money
|
||||
|
||||
- Our load (number and size of pods) is probably variable
|
||||
|
||||
- We need *cluster autoscaling*
|
||||
|
||||
(add/remove nodes as we need them, pay only for what we use)
|
||||
|
||||
- We might need nodes of different sizes
|
||||
|
||||
(or with specialized hardware: local fast disks, GPUs...)
|
||||
|
||||
- If possible, we should leverage "spot" or "preemptible" capacity
|
||||
|
||||
(VMs that are significantly cheaper but can be terminated on short notice)
|
||||
|
||||
---
|
||||
|
||||
## Node pools
|
||||
|
||||
- We will have multiple *node pools*
|
||||
|
||||
- A node pool is a set of nodes running in a single zone
|
||||
|
||||
- The nodes usually¹ have the same size
|
||||
|
||||
- They have the same "preemptability"
|
||||
|
||||
(i.e. a node pool is either "on-demand" or "preemptible")
|
||||
|
||||
- The Kubernetes cluster autoscaler is aware of the node pools
|
||||
|
||||
- When it scales up the cluster, it decides which pool(s) to scale up
|
||||
|
||||
.footnote[¹On AWS EKS, node pools map to ASGs, which can have mixed instance types.]
|
||||
|
||||
---
|
||||
|
||||
## Example: big batch
|
||||
|
||||
- Every few days, we want to process a batch made of thousands of jobs
|
||||
|
||||
- Each job requires lots of RAM (10+ GB) and takes hours to complete
|
||||
|
||||
- We want to process the batch as fast as possible
|
||||
|
||||
- We don't want to pay for nodes when we don't use them
|
||||
|
||||
- Solution:
|
||||
|
||||
- one node group with tiny nodes for basic cluster services
|
||||
|
||||
- one node group with huge nodes for batch processing
|
||||
|
||||
- that second node group "scales to zero"
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
- Make sure that long-running pods *never* run on big nodes
|
||||
|
||||
(use *taints* and *tolerations*)
|
||||
|
||||
- Keep an eye on preemptions
|
||||
|
||||
(especially on very long jobs taking 10+ hours or even days)
|
||||
|
||||
---
|
||||
|
||||
## Example: mixed load
|
||||
|
||||
- Running a majority of stateless apps
|
||||
|
||||
- We want to reduce overall cost (target: 25-50%)
|
||||
|
||||
- We can accept occasional small disruptions (performance degradations)
|
||||
|
||||
- Solution:
|
||||
|
||||
- one node group with "on demand" nodes
|
||||
|
||||
- one node group with "spot" / "preemptible" nodes
|
||||
|
||||
- pin stateful apps to "on demand" nodes
|
||||
|
||||
- *try* to balance stateless apps between the two pools
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
- We can tell the Kubernetes scheduler to *prefer* balancing across pools
|
||||
|
||||
- We don't have a way to *require* it
|
||||
|
||||
- What should be done anyway if it's not possible to balance?
|
||||
|
||||
(e.g. if spot capacity is unavailable)
|
||||
|
||||
- In practice, preemption can be very rare
|
||||
|
||||
- This means big savings, but we should have a "plan B" just in case
|
||||
|
||||
(perhaps think about which services can tolerate a rare outage)
|
||||
|
||||
---
|
||||
|
||||
## In practice
|
||||
|
||||
- Most managed Kubernetes providers give us ways to create multiple node pools
|
||||
|
||||
- Sometimes the pools are declared as *blocks* within the cluster resources
|
||||
|
||||
- pros: simpler, sometimes faster to provision
|
||||
|
||||
- cons: changing the pool configuration generally forces re-creation of the cluster
|
||||
|
||||
- Sometimes the pools are declared as independent resources
|
||||
|
||||
- pros: can add/remove/change pools without destroying the cluster
|
||||
|
||||
- cons: more complex
|
||||
|
||||
- Most providers recommend to declare the pools independently
|
||||
120
slides/terraform/quickstart-gke.md
Normal file
120
slides/terraform/quickstart-gke.md
Normal file
@@ -0,0 +1,120 @@
|
||||
## GKE quick start
|
||||
|
||||
- Install Terraform and GCP SDK (`gcloud`)
|
||||
|
||||
- Authenticate with `gcloud auth login`
|
||||
|
||||
- Create a project or use one of your existing ones
|
||||
|
||||
- Set the `GOOGLE_PROJECT` env var to the project name
|
||||
|
||||
(this will be used by Terraform)
|
||||
|
||||
Note 1: there must be a billing account linked to the project.
|
||||
|
||||
Note 2: if the required APIs are not enabled on the project,
|
||||
we will get error messages telling us "please enable that API"
|
||||
when using the APIs for the first time. The error messages
|
||||
should include instructions to do this one-time process.
|
||||
|
||||
---
|
||||
|
||||
## Create configuration
|
||||
|
||||
- Create empty directory
|
||||
|
||||
- Create a bunch of `.tf` files as shown in next slides
|
||||
|
||||
(feel free to adjust the values!)
|
||||
|
||||
---
|
||||
|
||||
## Configuring providers
|
||||
|
||||
- We'll use the [google provider](https://registry.terraform.io/providers/hashicorp/google)
|
||||
|
||||
- It's an official provider (maintained by `hashicorp`)
|
||||
|
||||
- Which means that we don't have to add it explicitly to our configuration
|
||||
|
||||
(`terraform init` will take care of it automatically)
|
||||
|
||||
- That'll simplify a tiny bit our "getting started" experience!
|
||||
|
||||
---
|
||||
|
||||
## `cluster.tf`
|
||||
|
||||
```tf
|
||||
resource "google_container_cluster" "mycluster" {
|
||||
name = "klstr"
|
||||
location = "europe-north1-a"
|
||||
initial_node_count = 2
|
||||
}
|
||||
```
|
||||
|
||||
- [`google_container_cluster`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster) is the *type* of the resource
|
||||
|
||||
- `mycluster` is the internal Terraform name for that resource
|
||||
|
||||
(useful if we have multiple resources of that type)
|
||||
|
||||
- `location` can be a *zone* or a *region* (see next slides for details)
|
||||
|
||||
- don't forget `initial_node_count` otherwise we get a zero-node cluster 🙃
|
||||
|
||||
---
|
||||
|
||||
## Regional vs Zonal vs Multi-Zonal
|
||||
|
||||
- If `location` is a zone, we get a "zonal" cluster
|
||||
|
||||
(control plane and nodes are in a single zone)
|
||||
|
||||
- If `location` is a region, we get a "regional" cluster
|
||||
|
||||
(control plane and nodes span all zones in this region)
|
||||
|
||||
- In a region with Z zones, if we say we want N nodes...
|
||||
|
||||
...we get Z×N nodes
|
||||
|
||||
- We can also set `location` to be a zone, and set additional `node_locations`
|
||||
|
||||
- We get a "multi-zonal" cluster with control plane in a single zone
|
||||
|
||||
---
|
||||
|
||||
## Create the cluster
|
||||
|
||||
- Initialize providers
|
||||
```bash
|
||||
terraform init
|
||||
```
|
||||
|
||||
- Create the cluster:
|
||||
```bash
|
||||
terraform apply
|
||||
```
|
||||
|
||||
- We'll explain later what that "plan" thing is; just approve it for now!
|
||||
|
||||
- Check what happens if we run `terraform apply` again
|
||||
|
||||
---
|
||||
|
||||
## Now what?
|
||||
|
||||
- Let's connect to the cluster
|
||||
|
||||
- Get the credentials for the cluster:
|
||||
```bash
|
||||
gcloud container clusters get-credentials klstr --zone=europe-north1
|
||||
```
|
||||
(Adjust the `zone` if you changed it earlier!)
|
||||
|
||||
- This will add the cluster to our `kubeconfig` file
|
||||
|
||||
- Deploy a simple app to the cluster
|
||||
|
||||
🎉
|
||||
25
slides/terraform/quickstart.md
Normal file
25
slides/terraform/quickstart.md
Normal file
@@ -0,0 +1,25 @@
|
||||
## Quick start
|
||||
|
||||
- We're going to use Terraform to deploy a basic Kubernetes cluster
|
||||
|
||||
- We will need cloud credentials (make sure you have a valid cloud account!)
|
||||
|
||||
---
|
||||
|
||||
## Steps
|
||||
|
||||
1. Install Terraform (download single Go binary)
|
||||
|
||||
2. Configure credentials (e.g. `gcloud auth login`)
|
||||
|
||||
3. Create Terraform *configuration*
|
||||
|
||||
4. Add *providers* to the configuration
|
||||
|
||||
5. Initialize providers with `terraform init`
|
||||
|
||||
6. Add *resources* to the configuration
|
||||
|
||||
7. Realize the resources with `terraform apply`
|
||||
|
||||
8. Repeat 6-7 or 4-5-6-7
|
||||
59
slides/terraform/state.md
Normal file
59
slides/terraform/state.md
Normal file
@@ -0,0 +1,59 @@
|
||||
## State
|
||||
|
||||
- Terraform keeps track of the *state*
|
||||
|
||||
- Resources created by Terraform are added to the state
|
||||
|
||||
- When we run Terraform, it will:
|
||||
|
||||
- *refresh* the state (check if resources have changed since last time it ran)
|
||||
|
||||
- generate a *plan* (decide which actions need to be taken)
|
||||
|
||||
- ask confirmation (this can be skipped)
|
||||
|
||||
- *apply* that plan
|
||||
|
||||
---
|
||||
|
||||
## Remote state
|
||||
|
||||
- By default, the state is stored in `terraform.tfstate`
|
||||
|
||||
- This is a JSON file (feel free to inspect it!)
|
||||
|
||||
- The state can also be stored in a central place
|
||||
|
||||
(e.g. cloud object store, Consul, etcd...)
|
||||
|
||||
- This is more convenient when working as a team
|
||||
|
||||
- It also requires *locking*
|
||||
|
||||
(to prevent concurrent modifications)
|
||||
|
||||
---
|
||||
|
||||
## Working with remote state
|
||||
|
||||
- This is beyond the scope of this workshop
|
||||
|
||||
- Note that if a Terraform configuration defines e.g. an S3 bucket to store its state...
|
||||
|
||||
...that configuration cannot create that S3 bucket!
|
||||
|
||||
- The bucket must be created beforehand
|
||||
|
||||
(Terraform won't be able to run until the bucket is available)
|
||||
|
||||
---
|
||||
|
||||
## Manipulating state
|
||||
|
||||
`terraform state list`
|
||||
|
||||
`terraform state show google_container_cluster.mycluster`
|
||||
|
||||
`terraform state rm`
|
||||
|
||||
`terraform import`
|
||||
71
slides/terraform/variables.md
Normal file
71
slides/terraform/variables.md
Normal file
@@ -0,0 +1,71 @@
|
||||
## Variables
|
||||
|
||||
- At this point, we are probably:
|
||||
|
||||
- duplicating a lot of information (e.g. zone, number of nodes...)
|
||||
|
||||
- hard-coding a lot of things as well (ditto!)
|
||||
|
||||
- Let's see how we can do better!
|
||||
|
||||
---
|
||||
|
||||
## [Input variables](https://www.terraform.io/language/values/variables)
|
||||
|
||||
Declaring an input variable:
|
||||
```tf
|
||||
variable "location" {
|
||||
type = string
|
||||
default = "europe-north1-a"
|
||||
}
|
||||
```
|
||||
|
||||
Using an input variable:
|
||||
```tf
|
||||
resource "google_container_cluster" "mycluster" {
|
||||
location = var.location
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Setting variables
|
||||
|
||||
Input variables can be set with:
|
||||
|
||||
- environment variables (`export TFVAR_location=us-west1`)
|
||||
|
||||
- a file named `terraform.tfvars` (`location = "us-west1"`)
|
||||
|
||||
- a file named `terraform.tfvars.json`
|
||||
|
||||
- files named `*.auto.tfvars` and `*.auto.tfvars.json`
|
||||
|
||||
- command-line literal values (`-var location=us-west1`)
|
||||
|
||||
- command-line file names (`-var-file carbon-neutral.tfvars`)
|
||||
|
||||
The latter taking precedence over the former.
|
||||
|
||||
---
|
||||
|
||||
## [Local values](https://www.terraform.io/language/values/locals)
|
||||
|
||||
Declaring and setting a local value:
|
||||
```tf
|
||||
locals {
|
||||
location = var.location != null ? var.location : "europe-north1-a"
|
||||
region = replace(local.location, "/-[a-z]$/", "")
|
||||
}
|
||||
```
|
||||
|
||||
We can have multiple `locals` blocks.
|
||||
|
||||
Using a local value:
|
||||
```tf
|
||||
resource "google_container_cluster" "mycluster" {
|
||||
location = local.location
|
||||
...
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user