🏭️ Refactor stateful apps content

This commit is contained in:
Jérôme Petazzoni
2021-11-20 22:00:50 +01:00
parent 93d8a23c81
commit 52015b81fe
15 changed files with 1426 additions and 1363 deletions

20
k8s/mounter.yaml Normal file
View File

@@ -0,0 +1,20 @@
kind: Pod
apiVersion: v1
metadata:
generateName: mounter-
labels:
container.training/mounter: ""
spec:
volumes:
- name: pvc
persistentVolumeClaim:
claimName: my-pvc-XYZ45
containers:
- name: mounter
image: alpine
stdin: true
tty: true
volumeMounts:
- name: pvc
mountPath: /pvc
workingDir: /pvc

20
k8s/pv.yaml Normal file
View File

@@ -0,0 +1,20 @@
kind: PersistentVolume
apiVersion: v1
metadata:
generateName: my-pv-
labels:
container.training/pv: ""
spec:
accessModes:
- ReadWriteOnce
- ReadWriteMany
capacity:
storage: 1G
hostPath:
path: /tmp/my-pv
#storageClassName: my-sc
#claimRef:
# kind: PersistentVolumeClaim
# apiVersion: v1
# namespace: default
# name: my-pvc-XYZ45

13
k8s/pvc.yaml Normal file
View File

@@ -0,0 +1,13 @@
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
generateName: my-pvc-
labels:
container.training/pvc: ""
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1G
#storageClassName: my-sc

228
slides/k8s/consul.md Normal file
View File

@@ -0,0 +1,228 @@
# Running a Consul cluster
- Here is a good use-case for Stateful sets!
- We are going to deploy a Consul cluster with 3 nodes
- Consul is a highly-available key/value store
(like etcd or Zookeeper)
- One easy way to bootstrap a cluster is to tell each node:
- the addresses of other nodes
- how many nodes are expected (to know when quorum is reached)
---
## Bootstrapping a Consul cluster
*After reading the Consul documentation carefully (and/or asking around),
we figure out the minimal command-line to run our Consul cluster.*
```
consul agent -data-dir=/consul/data -client=0.0.0.0 -server -ui \
-bootstrap-expect=3 \
-retry-join=`X.X.X.X` \
-retry-join=`Y.Y.Y.Y`
```
- Replace X.X.X.X and Y.Y.Y.Y with the addresses of other nodes
- A node can add its own address (it will work fine)
- ... Which means that we can use the same command-line on all nodes (convenient!)
---
## Cloud Auto-join
- Since version 1.4.0, Consul can use the Kubernetes API to find its peers
- This is called [Cloud Auto-join]
- Instead of passing an IP address, we need to pass a parameter like this:
```
consul agent -retry-join "provider=k8s label_selector=\"app=consul\""
```
- Consul needs to be able to talk to the Kubernetes API
- We can provide a `kubeconfig` file
- If Consul runs in a pod, it will use the *service account* of the pod
[Cloud Auto-join]: https://www.consul.io/docs/agent/cloud-auto-join.html#kubernetes-k8s-
---
## Setting up Cloud auto-join
- We need to create a service account for Consul
- We need to create a role that can `list` and `get` pods
- We need to bind that role to the service account
- And of course, we need to make sure that Consul pods use that service account
---
## Putting it all together
- The file `k8s/consul-1.yaml` defines the required resources
(service account, role, role binding, service, stateful set)
- Inspired by this [excellent tutorial](https://github.com/kelseyhightower/consul-on-kubernetes) by Kelsey Hightower
(many features from the original tutorial were removed for simplicity)
---
## Running our Consul cluster
- We'll use the provided YAML file
.exercise[
- Create the stateful set and associated service:
```bash
kubectl apply -f ~/container.training/k8s/consul-1.yaml
```
- Check the logs as the pods come up one after another:
```bash
stern consul
```
<!--
```wait Synced node info```
```key ^C```
-->
- Check the health of the cluster:
```bash
kubectl exec consul-0 -- consul members
```
]
---
## Caveats
- The scheduler may place two Consul pods on the same node
- if that node fails, we lose two Consul pods at the same time
- this will cause the cluster to fail
- Scaling down the cluster will cause it to fail
- when a Consul member leaves the cluster, it needs to inform the others
- otherwise, the last remaining node doesn't have quorum and stops functioning
- This Consul cluster doesn't use real persistence yet
- data is stored in the containers' ephemeral filesystem
- if a pod fails, its replacement starts from a blank slate
---
## Improving pod placement
- We need to tell the scheduler:
*do not put two of these pods on the same node!*
- This is done with an `affinity` section like the following one:
```yaml
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: consul
topologyKey: kubernetes.io/hostname
```
---
## Using a lifecycle hook
- When a Consul member leaves the cluster, it needs to execute:
```bash
consul leave
```
- This is done with a `lifecycle` section like the following one:
```yaml
lifecycle:
preStop:
exec:
command: [ "sh", "-c", "consul leave" ]
```
---
## Running a better Consul cluster
- Let's try to add the scheduling constraint and lifecycle hook
- We can do that in the same namespace or another one (as we like)
- If we do that in the same namespace, we will see a rolling update
(pods will be replaced one by one)
.exercise[
- Deploy a better Consul cluster:
```bash
kubectl apply -f ~/container.training/k8s/consul-2.yaml
```
]
---
## Still no persistence, though
- We aren't using actual persistence yet
(no `volumeClaimTemplate`, Persistent Volume, etc.)
- What happens if we lose a pod?
- a new pod gets rescheduled (with an empty state)
- the new pod tries to connect to the two others
- it will be accepted (after 1-2 minutes of instability)
- and it will retrieve the data from the other pods
---
## Failure modes
- What happens if we lose two pods?
- manual repair will be required
- we will need to instruct the remaining one to act solo
- then rejoin new pods
- What happens if we lose three pods? (aka all of them)
- we lose all the data (ouch)
???
:EN:- Scheduling pods together or separately
:EN:- Example: deploying a Consul cluster
:FR:- Lancer des pods ensemble ou séparément
:FR:- Example : lancer un cluster Consul

View File

@@ -1,251 +0,0 @@
# Local Persistent Volumes
- We want to run that Consul cluster *and* actually persist data
- But we don't have a distributed storage system
- We are going to use local volumes instead
(similar conceptually to `hostPath` volumes)
- We can use local volumes without installing extra plugins
- However, they are tied to a node
- If that node goes down, the volume becomes unavailable
---
## With or without dynamic provisioning
- We will deploy a Consul cluster *with* persistence
- That cluster's StatefulSet will create PVCs
- These PVCs will remain unbound¹, until we will create local volumes manually
(we will basically do the job of the dynamic provisioner)
- Then, we will see how to automate that with a dynamic provisioner
.footnote[¹Unbound = without an associated Persistent Volume.]
---
## If we have a dynamic provisioner ...
- The labs in this section assume that we *do not* have a dynamic provisioner
- If we do have one, we need to disable it
.exercise[
- Check if we have a dynamic provisioner:
```bash
kubectl get storageclass
```
- If the output contains a line with `(default)`, run this command:
```bash
kubectl annotate sc storageclass.kubernetes.io/is-default-class- --all
```
- Check again that it is no longer marked as `(default)`
]
---
## Deploying Consul
- Let's use a new manifest for our Consul cluster
- The only differences between that file and the previous one are:
- `volumeClaimTemplate` defined in the Stateful Set spec
- the corresponding `volumeMounts` in the Pod spec
.exercise[
- Apply the persistent Consul YAML file:
```bash
kubectl apply -f ~/container.training/k8s/consul-3.yaml
```
]
---
## Observing the situation
- Let's look at Persistent Volume Claims and Pods
.exercise[
- Check that we now have an unbound Persistent Volume Claim:
```bash
kubectl get pvc
```
- We don't have any Persistent Volume:
```bash
kubectl get pv
```
- The Pod `consul-0` is not scheduled yet:
```bash
kubectl get pods -o wide
```
]
*Hint: leave these commands running with `-w` in different windows.*
---
## Explanations
- In a Stateful Set, the Pods are started one by one
- `consul-1` won't be created until `consul-0` is running
- `consul-0` has a dependency on an unbound Persistent Volume Claim
- The scheduler won't schedule the Pod until the PVC is bound
(because the PVC might be bound to a volume that is only available on a subset of nodes; for instance EBS are tied to an availability zone)
---
## Creating Persistent Volumes
- Let's create 3 local directories (`/mnt/consul`) on node2, node3, node4
- Then create 3 Persistent Volumes corresponding to these directories
.exercise[
- Create the local directories:
```bash
for NODE in node2 node3 node4; do
ssh $NODE sudo mkdir -p /mnt/consul
done
```
- Create the PV objects:
```bash
kubectl apply -f ~/container.training/k8s/volumes-for-consul.yaml
```
]
---
## Check our Consul cluster
- The PVs that we created will be automatically matched with the PVCs
- Once a PVC is bound, its pod can start normally
- Once the pod `consul-0` has started, `consul-1` can be created, etc.
- Eventually, our Consul cluster is up, and backend by "persistent" volumes
.exercise[
- Check that our Consul clusters has 3 members indeed:
```bash
kubectl exec consul-0 -- consul members
```
]
---
## Devil is in the details (1/2)
- The size of the Persistent Volumes is bogus
(it is used when matching PVs and PVCs together, but there is no actual quota or limit)
---
## Devil is in the details (2/2)
- This specific example worked because we had exactly 1 free PV per node:
- if we had created multiple PVs per node ...
- we could have ended with two PVCs bound to PVs on the same node ...
- which would have required two pods to be on the same node ...
- which is forbidden by the anti-affinity constraints in the StatefulSet
- To avoid that, we need to associated the PVs with a Storage Class that has:
```yaml
volumeBindingMode: WaitForFirstConsumer
```
(this means that a PVC will be bound to a PV only after being used by a Pod)
- See [this blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) for more details
---
## Bulk provisioning
- It's not practical to manually create directories and PVs for each app
- We *could* pre-provision a number of PVs across our fleet
- We could even automate that with a Daemon Set:
- creating a number of directories on each node
- creating the corresponding PV objects
- We also need to recycle volumes
- ... This can quickly get out of hand
---
## Dynamic provisioning
- We could also write our own provisioner, which would:
- watch the PVCs across all namespaces
- when a PVC is created, create a corresponding PV on a node
- Or we could use one of the dynamic provisioners for local persistent volumes
(for instance the [Rancher local path provisioner](https://github.com/rancher/local-path-provisioner))
---
## Strategies for local persistent volumes
- Remember, when a node goes down, the volumes on that node become unavailable
- High availability will require another layer of replication
(like what we've just seen with Consul; or primary/secondary; etc)
- Pre-provisioning PVs makes sense for machines with local storage
(e.g. cloud instance storage; or storage directly attached to a physical machine)
- Dynamic provisioning makes sense for large number of applications
(when we can't or won't dedicate a whole disk to a volume)
- It's possible to mix both (using distinct Storage Classes)
???
:EN:- Static vs dynamic volume provisioning
:EN:- Example: local persistent volume provisioner
:FR:- Création statique ou dynamique de volumes
:FR:- Exemple : création de volumes locaux

View File

@@ -321,207 +321,13 @@ EOF
---
## Creating a Pod using the Jiva class
## We're ready now!
- We will create a Pod running PostgreSQL, using the default class
- We have a StorageClass that can provision PersistentVolumes
.exercise[
- These PersistentVolumes will be replicated across nodes
- Create the Pod:
```bash
kubectl apply -f ~/container.training/k8s/postgres.yaml
```
- Wait for the PV, PVC, and Pod to be up:
```bash
watch kubectl get pv,pvc,pod
```
- We can also check what's going on in the `openebs` namespace:
```bash
watch kubectl get pods --namespace openebs
```
]
---
## Node failover
⚠️ This will partially break your cluster!
- We are going to disconnect the node running PostgreSQL from the cluster
- We will see what happens, and how to recover
- We will not reconnect the node to the cluster
- This whole lab will take at least 10-15 minutes (due to various timeouts)
⚠️ Only do this lab at the very end, when you don't want to run anything else after!
---
## Disconnecting the node from the cluster
.exercise[
- Find out where the Pod is running, and SSH into that node:
```bash
kubectl get pod postgres-0 -o jsonpath={.spec.nodeName}
ssh nodeX
```
- Check the name of the network interface:
```bash
sudo ip route ls default
```
- The output should look like this:
```
default via 10.10.0.1 `dev ensX` proto dhcp src 10.10.0.13 metric 100
```
- Shutdown the network interface:
```bash
sudo ip link set ensX down
```
]
---
## Watch what's going on
- Let's look at the status of Nodes, Pods, and Events
.exercise[
- In a first pane/tab/window, check Nodes and Pods:
```bash
watch kubectl get nodes,pods -o wide
```
- In another pane/tab/window, check Events:
```bash
kubectl get events --watch
```
]
---
## Node Ready → NotReady
- After \~30 seconds, the control plane stops receiving heartbeats from the Node
- The Node is marked NotReady
- It is not *schedulable* anymore
(the scheduler won't place new pods there, except some special cases)
- All Pods on that Node are also *not ready*
(they get removed from service Endpoints)
- ... But nothing else happens for now
(the control plane is waiting: maybe the Node will come back shortly?)
---
## Pod eviction
- After \~5 minutes, the control plane will evict most Pods from the Node
- These Pods are now `Terminating`
- The Pods controlled by e.g. ReplicaSets are automatically moved
(or rather: new Pods are created to replace them)
- But nothing happens to the Pods controlled by StatefulSets at this point
(they remain `Terminating` forever)
- Why? 🤔
--
- This is to avoid *split brain scenarios*
---
class: extra-details
## Split brain 🧠⚡️🧠
- Imagine that we create a replacement pod `postgres-0` on another Node
- And 15 minutes later, the Node is reconnected and the original `postgres-0` comes back
- Which one is the "right" one?
- What if they have conflicting data?
😱
- We *cannot* let that happen!
- Kubernetes won't do it
- ... Unless we tell it to
---
## The Node is gone
- One thing we can do, is tell Kubernetes "the Node won't come back"
(there are other methods; but this one is the simplest one here)
- This is done with a simple `kubectl delete node`
.exercise[
- `kubectl delete` the Node that we disconnected
]
---
## Pod rescheduling
- Kubernetes removes the Node
- After a brief period of time (\~1 minute) the "Terminating" Pods are removed
- A replacement Pod is created on another Node
- ... But it doens't start yet!
- Why? 🤔
---
## Multiple attachment
- By default, a disk can only be attached to one Node at a time
(sometimes it's a hardware or API limitation; sometimes enforced in software)
- In our Events, we should see `FailedAttachVolume` and `FailedMount` messages
- After \~5 more minutes, the disk will be force-detached from the old Node
- ... Which will allow attaching it to the new Node!
🎉
- The Pod will then be able to start
- Failover is complete!
- They should be able to withstand single-node failures
???

View File

@@ -1,42 +1,4 @@
# Highly available Persistent Volumes
- How can we achieve true durability?
- How can we store data that would survive the loss of a node?
--
- We need to use Persistent Volumes backed by highly available storage systems
- There are many ways to achieve that:
- leveraging our cloud's storage APIs
- using NAS/SAN systems or file servers
- distributed storage systems
--
- We are going to see one distributed storage system in action
---
## Our test scenario
- We will set up a distributed storage system on our cluster
- We will use it to deploy a SQL database (PostgreSQL)
- We will insert some test data in the database
- We will disrupt the node running the database
- We will see how it recovers
---
## Portworx
# Portworx
- Portworx is a *commercial* persistent storage solution for containers
@@ -60,7 +22,7 @@
- We're installing Portworx because we need a storage system
- If you are using AKS, EKS, GKE ... you already have a storage system
- If you are using AKS, EKS, GKE, Kapsule ... you already have a storage system
(but you might want another one, e.g. to leverage local storage)
@@ -301,364 +263,6 @@ parameters:
---
## Our Postgres Stateful set
- The next slide shows `k8s/postgres.yaml`
- It defines a Stateful set
- With a `volumeClaimTemplate` requesting a 1 GB volume
- That volume will be mounted to `/var/lib/postgresql/data`
- There is another little detail: we enable the `stork` scheduler
- The `stork` scheduler is optional (it's specific to Portworx)
- It helps the Kubernetes scheduler to colocate the pod with its volume
(see [this blog post](https://portworx.com/stork-storage-orchestration-kubernetes/) for more details about that)
---
.small[
```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
selector:
matchLabels:
app: postgres
serviceName: postgres
template:
metadata:
labels:
app: postgres
spec:
schedulerName: stork
containers:
- name: postgres
image: postgres:12
env:
- name: POSTGRES_HOST_AUTH_METHOD
value: trust
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: postgres
volumeClaimTemplates:
- metadata:
name: postgres
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
```
]
---
## Creating the Stateful set
- Before applying the YAML, watch what's going on with `kubectl get events -w`
.exercise[
- Apply that YAML:
```bash
kubectl apply -f ~/container.training/k8s/postgres.yaml
```
<!-- ```hide kubectl wait pod postgres-0 --for condition=ready``` -->
]
---
## Testing our PostgreSQL pod
- We will use `kubectl exec` to get a shell in the pod
- Good to know: we need to use the `postgres` user in the pod
.exercise[
- Get a shell in the pod, as the `postgres` user:
```bash
kubectl exec -ti postgres-0 -- su postgres
```
<!--
autopilot prompt detection expects $ or # at the beginning of the line.
```wait postgres@postgres```
```keys PS1="\u@\h:\w\n\$ "```
```key ^J```
-->
- Check that default databases have been created correctly:
```bash
psql -l
```
]
(This should show us 3 lines: postgres, template0, and template1.)
---
## Inserting data in PostgreSQL
- We will create a database and populate it with `pgbench`
.exercise[
- Create a database named `demo`:
```bash
createdb demo
```
- Populate it with `pgbench`:
```bash
pgbench -i demo
```
]
- The `-i` flag means "create tables"
- If you want more data in the test tables, add e.g. `-s 10` (to get 10x more rows)
---
## Checking how much data we have now
- The `pgbench` tool inserts rows in table `pgbench_accounts`
.exercise[
- Check that the `demo` base exists:
```bash
psql -l
```
- Check how many rows we have in `pgbench_accounts`:
```bash
psql demo -c "select count(*) from pgbench_accounts"
```
- Check that `pgbench_history` is currently empty:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
---
## Testing the load generator
- Let's use `pgbench` to generate a few transactions
.exercise[
- Run `pgbench` for 10 seconds, reporting progress every second:
```bash
pgbench -P 1 -T 10 demo
```
- Check the size of the history table now:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
Note: on small cloud instances, a typical speed is about 100 transactions/second.
---
## Generating transactions
- Now let's use `pgbench` to generate more transactions
- While it's running, we will disrupt the database server
.exercise[
- Run `pgbench` for 10 minutes, reporting progress every second:
```bash
pgbench -P 1 -T 600 demo
```
- You can use a longer time period if you need more time to run the next steps
<!-- ```tmux split-pane -h``` -->
]
---
## Find out which node is hosting the database
- We can find that information with `kubectl get pods -o wide`
.exercise[
- Check the node running the database:
```bash
kubectl get pod postgres-0 -o wide
```
]
We are going to disrupt that node.
--
By "disrupt" we mean: "disconnect it from the network".
---
## Disconnect the node
- We will use `iptables` to block all traffic exiting the node
(except SSH traffic, so we can repair the node later if needed)
.exercise[
- SSH to the node to disrupt:
```bash
ssh `nodeX`
```
- Allow SSH traffic leaving the node, but block all other traffic:
```bash
sudo iptables -I OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -I OUTPUT 2 -j DROP
```
]
---
## Check that the node is disconnected
.exercise[
- Check that the node can't communicate with other nodes:
```
ping node1
```
- Logout to go back on `node1`
<!-- ```key ^D``` -->
- Watch the events unfolding with `kubectl get events -w` and `kubectl get pods -w`
]
- It will take some time for Kubernetes to mark the node as unhealthy
- Then it will attempt to reschedule the pod to another node
- In about a minute, our pod should be up and running again
---
## Check that our data is still available
- We are going to reconnect to the (new) pod and check
.exercise[
- Get a shell on the pod:
```bash
kubectl exec -ti postgres-0 -- su postgres
```
<!--
```wait postgres@postgres```
```keys PS1="\u@\h:\w\n\$ "```
```key ^J```
-->
- Check how many transactions are now in the `pgbench_history` table:
```bash
psql demo -c "select count(*) from pgbench_history"
```
<!-- ```key ^D``` -->
]
If the 10-second test that we ran earlier gave e.g. 80 transactions per second,
and we failed the node after 30 seconds, we should have about 2400 row in that table.
---
## Double-check that the pod has really moved
- Just to make sure the system is not bluffing!
.exercise[
- Look at which node the pod is now running on
```bash
kubectl get pod postgres-0 -o wide
```
]
---
## Re-enable the node
- Let's fix the node that we disconnected from the network
.exercise[
- SSH to the node:
```bash
ssh `nodeX`
```
- Remove the iptables rule blocking traffic:
```bash
sudo iptables -D OUTPUT 2
```
]
---
class: extra-details
## A few words about this PostgreSQL setup
- In a real deployment, you would want to set a password
- This can be done by creating a `secret`:
```
kubectl create secret generic postgres \
--from-literal=password=$(base64 /dev/urandom | head -c16)
```
- And then passing that secret to the container:
```yaml
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres
key: password
```
---
class: extra-details
## Troubleshooting Portworx
@@ -666,7 +270,7 @@ class: extra-details
- If we need to see what's going on with Portworx:
```
PXPOD=$(kubectl -n kube-system get pod -l name=portworx -o json |
jq -r .items[0].metadata.name)
jq -r .items[0].metadata.name)
kubectl -n kube-system exec $PXPOD -- /opt/pwx/bin/pxctl status
```
@@ -709,26 +313,6 @@ class: extra-details
---
class: extra-details
## Dynamic provisioning without a provider
- What if we want to use Stateful sets without a storage provider?
- We will have to create volumes manually
(by creating Persistent Volume objects)
- These volumes will be automatically bound with matching Persistent Volume Claims
- We can use local volumes (essentially bind mounts of host directories)
- Of course, these volumes won't be available in case of node failure
- Check [this blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) for more information and gotchas
---
## Acknowledgements
The Portworx installation tutorial, and the PostgreSQL example,
@@ -748,8 +332,5 @@ were inspired by [Portworx examples on Katacoda](https://katacoda.com/portworx/s
???
:EN:- Using highly available persistent volumes
:EN:- Example: deploying a database that can withstand node outages
:FR:- Utilisation de volumes à haute disponibilité
:FR:- Exemple : déployer une base de données survivant à la défaillance d'un nœud
:EN:- Hyperconverged storage with Portworx
:FR:- Stockage hyperconvergé avec Portworx

323
slides/k8s/pv-pvc-sc.md Normal file
View File

@@ -0,0 +1,323 @@
# PV, PVC, and Storage Classes
- When an application needs storage, it creates a PersistentVolumeClaim
(either directly, or through a volume claim template in a Stateful Set)
- The PersistentVolumeClaim is initially `Pending`
- Kubernetes then looks for a suitable PersistentVolume
(maybe one is immediately available; maybe we need to wait for provisioning)
- Once a suitable PersistentVolume is found, the PVC becomes `Bound`
- The PVC can then be used by a Pod
(as long as the PVC is `Pending`, the Pod cannot run)
---
## Access modes
- PV and PVC have *access modes*:
- ReadWriteOnce (only one node can access the volume at a time)
- ReadWriteMany (multiple nodes can access the volume simultaneously)
- ReadOnlyMany (multiple nodes can access, but they can't write)
- ReadWriteOncePod (only one pod can access the volume; new in Kubernetes 1.22)
- A PV lists the access modes that it requires
- A PVC lists the access modes that it supports
⚠️ A PV with only ReadWriteMany won't satisfy a PVC with ReadWriteOnce!
---
## Capacity
- A PVC must express a storage size request
(field `spec.resources.requests.storage`, in bytes)
- A PV must express its size
(field `spec.capacity.storage`, in bytes)
- Kubernetes will only match a PV and PVC if the PV is big enough
- These fields are only used for "matchmaking" purposes:
- nothing prevents the Pod mounting the PVC from using more space
- nothing requires the PV to actually be that big
---
## Storage Class
- What if we have multiple storage systems available?
(e.g. NFS and iSCSI; or AzureFile and AzureDisk; or Cinder and Ceph...)
- What if we have a storage system with multiple tiers?
(e.g. SAN with RAID1 and RAID5; general purpose vs. io optimized EBS...)
- Kubernetes lets us define *storage classes* to represent these
(see if you have any available at the moment with `kubectl get storageclasses`)
---
## Using storage classes
- Optionally, each PV and each PVC can reference a StorageClass
(field `spec.storageClassName`)
- When creating a PVC, specifying a StorageClass means
“use that particular storage system to provision the volume!”
- Storage classes are necessary for [dynamic provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/)
(but we can also ignore them and perform manual provisioning)
---
## Default storage class
- We can define a *default storage class*
(by annotating it with `storageclass.kubernetes.io/is-default-class=true`)
- When a PVC is created,
**IF** it doesn't indicate which storage class to use
**AND** there is a default storage class
**THEN** the PVC `storageClassName` is set to the default storage class
---
## Additional constraints
- A PersistentVolumeClaim can also specify a volume selector
(referring to labels on the PV)
- A PersistentVolume can also be created with a `claimRef`
(indicating to which PVC it should be bound)
---
class: extra-details
## Which PV gets associated to a PVC?
- The PV must be `Available`
- The PV must satisfy the PVC constraints
(access mode, size, optional selector, optional storage class)
- The PVs with the closest access mode are picked
- Then the PVs with the closest size
- It is possible to specify a `claimRef` when creating a PV
(this will associate it to the specified PVC, but only if the PV satisfies all the requirements of the PVC; otherwise another PV might end up being picked)
- For all the details about the PersistentVolumeClaimBinder, check [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/persistent-storage.md#matching-and-binding)
---
## Creating a PVC
- Let's create a standalone PVC and see what happens!
.exercise[
- Check if we have a StorageClass:
```bash
kubectl get storageclasses
```
- Create the PVC:
```bash
kubectl create -f ~/container.training/k8s/pvc.yaml
```
- Check the PVC:
```bash
kubectl get pvc
```
]
---
## Four possibilities
1. If we have a default StorageClass with *immediate* binding:
*a PV was created and associated to the PVC*
2. If we have a default StorageClass that *waits for first consumer*:
*the PVC is still `Pending` but has a `STORAGECLASS`* ⚠️
3. If we don't have a default StorageClass:
*the PVC is still `Pending`, without a `STORAGECLASS`*
4. If we have a StorageClass, but it doesn't work:
*the PVC is still `Pending` but has a `STORAGECLASS`* ⚠️
---
## Immediate vs WaitForFirstConsumer
- Immediate = as soon as there is a `Pending` PVC, create a PV
- What if:
- the PV is only available on a node (e.g. local volume)
- ...or on a subset of nodes (e.g. SAN HBA, EBS AZ...)
- the Pod that will use the PVC has scheduling constraints
- these constraints turn out to be incompatible with the PV
- WaitForFirstConsumer = don't provision the PV until a Pod mounts the PVC
---
## Using the PVC
- Let's mount the PVC in a Pod
- We will use a stray Pod (no Deployment, StatefulSet, etc.)
- We will use @@LINK[k8s/mounter.yaml], shown on the next slide
- We'll need to update the `claimName`! ⚠️
---
```yaml
@@INCLUDE[k8s/mounter.yaml]
```
---
## Running the Pod
.exercise[
- Edit the `mounter.yaml` manifest
- Update the `claimName` to put the name of our PVC
- Create the Pod
- Check the status of the PV and PVC
]
Note: this "mounter" Pod can be useful to inspect the content of a PVC.
---
## Scenario 1 & 2
If we have a default Storage Class that can provision PVC dynamically...
- We should now have a new PV
- The PV and the PVC should be `Bound` together
---
## Scenario 3
If we don't have a default Storage Class, we must create the PV manually.
```bash
kubectl create -f ~/container.training/k8s/pv.yaml
```
After a few seconds, check that the PV and PVC are bound:
```bash
kubectl get pv,pvc
```
---
## Scenario 4
If our default Storage Class can't provision a PV, let's do it manually.
The PV must specify the correct `storageClassName`.
```bash
STORAGECLASS=$(kubectl get pvc --selector=container.training/pvc \
-o jsonpath={..storageClassName})
kubectl patch -f ~/container.training/k8s/pv.yaml --dry-run=client -o yaml \
--patch '{"spec": {"storageClassName": "'$STORAGECLASS'"}}' \
| kubectl create -f-
```
Check that the PV and PVC are bound:
```bash
kubectl get pv,pvc
```
---
## Checking the Pod
- If the PVC was `Pending`, then the Pod was `Pending` too
- Once the PVC is `Bound`, the Pod can be scheduled and can run
- Once the Pod is `Running`, check it out with `kubectl attach -ti`
---
## PV and PVC lifecycle
- We can't delete a PV if it's `Bound`
- If we `kubectl delete` it, it goes to `Terminating` state
- We can't delete a PVC if it's in use by a Pod
- Likewise, if we `kubectl delete` it, it goes to `Terminating` state
- Deletion is prevented by *finalizers*
(=like a post-it note saying “don't delete me!”)
- When the mounting Pods are deleted, their PVCs are freed up
- When PVCs are deleted, their PVs are freed up
???
:EN:- Storage provisioning
:EN:- PV, PVC, StorageClass
:FR:- Création de volumes
:FR:- PV, PVC, et StorageClass

View File

@@ -0,0 +1,468 @@
# Stateful failover
- How can we achieve true durability?
- How can we store data that would survive the loss of a node?
--
- We need to use Persistent Volumes backed by highly available storage systems
- There are many ways to achieve that:
- leveraging our cloud's storage APIs
- using NAS/SAN systems or file servers
- distributed storage systems
---
## Our test scenario
- We will use it to deploy a SQL database (PostgreSQL)
- We will insert some test data in the database
- We will disrupt the node running the database
- We will see how it recovers
---
## Our Postgres Stateful set
- The next slide shows `k8s/postgres.yaml`
- It defines a Stateful set
- With a `volumeClaimTemplate` requesting a 1 GB volume
- That volume will be mounted to `/var/lib/postgresql/data`
---
.small[.small[
```yaml
@@INCLUDE[k8s/postgres.yaml]
```
]]
---
## Creating the Stateful set
- Before applying the YAML, watch what's going on with `kubectl get events -w`
.exercise[
- Apply that YAML:
```bash
kubectl apply -f ~/container.training/k8s/postgres.yaml
```
<!-- ```hide kubectl wait pod postgres-0 --for condition=ready``` -->
]
---
## Testing our PostgreSQL pod
- We will use `kubectl exec` to get a shell in the pod
- Good to know: we need to use the `postgres` user in the pod
.exercise[
- Get a shell in the pod, as the `postgres` user:
```bash
kubectl exec -ti postgres-0 -- su postgres
```
<!--
autopilot prompt detection expects $ or # at the beginning of the line.
```wait postgres@postgres```
```keys PS1="\u@\h:\w\n\$ "```
```key ^J```
-->
- Check that default databases have been created correctly:
```bash
psql -l
```
]
(This should show us 3 lines: postgres, template0, and template1.)
---
## Inserting data in PostgreSQL
- We will create a database and populate it with `pgbench`
.exercise[
- Create a database named `demo`:
```bash
createdb demo
```
- Populate it with `pgbench`:
```bash
pgbench -i demo
```
]
- The `-i` flag means "create tables"
- If you want more data in the test tables, add e.g. `-s 10` (to get 10x more rows)
---
## Checking how much data we have now
- The `pgbench` tool inserts rows in table `pgbench_accounts`
.exercise[
- Check that the `demo` base exists:
```bash
psql -l
```
- Check how many rows we have in `pgbench_accounts`:
```bash
psql demo -c "select count(*) from pgbench_accounts"
```
- Check that `pgbench_history` is currently empty:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
---
## Testing the load generator
- Let's use `pgbench` to generate a few transactions
.exercise[
- Run `pgbench` for 10 seconds, reporting progress every second:
```bash
pgbench -P 1 -T 10 demo
```
- Check the size of the history table now:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
Note: on small cloud instances, a typical speed is about 100 transactions/second.
---
## Generating transactions
- Now let's use `pgbench` to generate more transactions
- While it's running, we will disrupt the database server
.exercise[
- Run `pgbench` for 10 minutes, reporting progress every second:
```bash
pgbench -P 1 -T 600 demo
```
- You can use a longer time period if you need more time to run the next steps
<!-- ```tmux split-pane -h``` -->
]
---
## Find out which node is hosting the database
- We can find that information with `kubectl get pods -o wide`
.exercise[
- Check the node running the database:
```bash
kubectl get pod postgres-0 -o wide
```
]
We are going to disrupt that node.
--
By "disrupt" we mean: "disconnect it from the network".
---
## Node failover
⚠️ This will partially break your cluster!
- We are going to disconnect the node running PostgreSQL from the cluster
- We will see what happens, and how to recover
- We will not reconnect the node to the cluster
- This whole lab will take at least 10-15 minutes (due to various timeouts)
⚠️ Only do this lab at the very end, when you don't want to run anything else after!
---
## Disconnecting the node from the cluster
.exercise[
- Find out where the Pod is running, and SSH into that node:
```bash
kubectl get pod postgres-0 -o jsonpath={.spec.nodeName}
ssh nodeX
```
- Check the name of the network interface:
```bash
sudo ip route ls default
```
- The output should look like this:
```
default via 10.10.0.1 `dev ensX` proto dhcp src 10.10.0.13 metric 100
```
- Shutdown the network interface:
```bash
sudo ip link set ensX down
```
]
---
class: extra-details
## Another way to disconnect the node
- We can also use `iptables` to block all traffic exiting the node
(except SSH traffic, so we can repair the node later if needed)
.exercise[
- SSH to the node to disrupt:
```bash
ssh `nodeX`
```
- Allow SSH traffic leaving the node, but block all other traffic:
```bash
sudo iptables -I OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -I OUTPUT 2 -j DROP
```
]
---
## Watch what's going on
- Let's look at the status of Nodes, Pods, and Events
.exercise[
- In a first pane/tab/window, check Nodes and Pods:
```bash
watch kubectl get nodes,pods -o wide
```
- In another pane/tab/window, check Events:
```bash
kubectl get events --watch
```
]
---
## Node Ready → NotReady
- After \~30 seconds, the control plane stops receiving heartbeats from the Node
- The Node is marked NotReady
- It is not *schedulable* anymore
(the scheduler won't place new pods there, except some special cases)
- All Pods on that Node are also *not ready*
(they get removed from service Endpoints)
- ... But nothing else happens for now
(the control plane is waiting: maybe the Node will come back shortly?)
---
## Pod eviction
- After \~5 minutes, the control plane will evict most Pods from the Node
- These Pods are now `Terminating`
- The Pods controlled by e.g. ReplicaSets are automatically moved
(or rather: new Pods are created to replace them)
- But nothing happens to the Pods controlled by StatefulSets at this point
(they remain `Terminating` forever)
- Why? 🤔
--
- This is to avoid *split brain scenarios*
---
class: extra-details
## Split brain 🧠⚡️🧠
- Imagine that we create a replacement pod `postgres-0` on another Node
- And 15 minutes later, the Node is reconnected and the original `postgres-0` comes back
- Which one is the "right" one?
- What if they have conflicting data?
😱
- We *cannot* let that happen!
- Kubernetes won't do it
- ... Unless we tell it to
---
## The Node is gone
- One thing we can do, is tell Kubernetes "the Node won't come back"
(there are other methods; but this one is the simplest one here)
- This is done with a simple `kubectl delete node`
.exercise[
- `kubectl delete` the Node that we disconnected
]
---
## Pod rescheduling
- Kubernetes removes the Node
- After a brief period of time (\~1 minute) the "Terminating" Pods are removed
- A replacement Pod is created on another Node
- ... But it doens't start yet!
- Why? 🤔
---
## Multiple attachment
- By default, a disk can only be attached to one Node at a time
(sometimes it's a hardware or API limitation; sometimes enforced in software)
- In our Events, we should see `FailedAttachVolume` and `FailedMount` messages
- After \~5 more minutes, the disk will be force-detached from the old Node
- ... Which will allow attaching it to the new Node!
🎉
- The Pod will then be able to start
- Failover is complete!
---
## Check that our data is still available
- We are going to reconnect to the (new) pod and check
.exercise[
- Get a shell on the pod:
```bash
kubectl exec -ti postgres-0 -- su postgres
```
<!--
```wait postgres@postgres```
```keys PS1="\u@\h:\w\n\$ "```
```key ^J```
-->
- Check how many transactions are now in the `pgbench_history` table:
```bash
psql demo -c "select count(*) from pgbench_history"
```
<!-- ```key ^D``` -->
]
If the 10-second test that we ran earlier gave e.g. 80 transactions per second,
and we failed the node after 30 seconds, we should have about 2400 row in that table.
---
## Double-check that the pod has really moved
- Just to make sure the system is not bluffing!
.exercise[
- Look at which node the pod is now running on
```bash
kubectl get pod postgres-0 -o wide
```
]
???
:EN:- Using highly available persistent volumes
:EN:- Example: deploying a database that can withstand node outages
:FR:- Utilisation de volumes à haute disponibilité
:FR:- Exemple : déployer une base de données survivant à la défaillance d'un nœud

View File

@@ -6,7 +6,7 @@
- They offer mechanisms to deploy scaled stateful applications
- At a first glance, they look like *deployments*:
- At a first glance, they look like Deployments:
- a stateful set defines a pod spec and a number of replicas *R*
@@ -182,503 +182,30 @@ spec:
- These pods can each have their own persistent storage
(Deployments cannot do that)
---
# Running a Consul cluster
## Obtaining per-pod storage
- Here is a good use-case for Stateful sets!
- Stateful Sets can have *persistent volume claim templates*
- We are going to deploy a Consul cluster with 3 nodes
(declared in `spec.volumeClaimTemplates` in the Stateful set manifest)
- Consul is a highly-available key/value store
- A claim template will create one Persistent Volume Claim per pod
(like etcd or Zookeeper)
(the PVC will be named `<claim-name>.<stateful-set-name>.<pod-index>`)
- One easy way to bootstrap a cluster is to tell each node:
- Persistent Volume Claims are matched 1-to-1 with Persistent Volumes
- the addresses of other nodes
- Persistent Volume provisioning can be done:
- how many nodes are expected (to know when quorum is reached)
- automatically (by leveraging *dynamic provisioning* with a Storage Class)
---
## Bootstrapping a Consul cluster
*After reading the Consul documentation carefully (and/or asking around),
we figure out the minimal command-line to run our Consul cluster.*
```
consul agent -data-dir=/consul/data -client=0.0.0.0 -server -ui \
-bootstrap-expect=3 \
-retry-join=`X.X.X.X` \
-retry-join=`Y.Y.Y.Y`
```
- Replace X.X.X.X and Y.Y.Y.Y with the addresses of other nodes
- A node can add its own address (it will work fine)
- ... Which means that we can use the same command-line on all nodes (convenient!)
---
## Cloud Auto-join
- Since version 1.4.0, Consul can use the Kubernetes API to find its peers
- This is called [Cloud Auto-join]
- Instead of passing an IP address, we need to pass a parameter like this:
```
consul agent -retry-join "provider=k8s label_selector=\"app=consul\""
```
- Consul needs to be able to talk to the Kubernetes API
- We can provide a `kubeconfig` file
- If Consul runs in a pod, it will use the *service account* of the pod
[Cloud Auto-join]: https://www.consul.io/docs/agent/cloud-auto-join.html#kubernetes-k8s-
---
## Setting up Cloud auto-join
- We need to create a service account for Consul
- We need to create a role that can `list` and `get` pods
- We need to bind that role to the service account
- And of course, we need to make sure that Consul pods use that service account
---
## Putting it all together
- The file `k8s/consul-1.yaml` defines the required resources
(service account, role, role binding, service, stateful set)
- Inspired by this [excellent tutorial](https://github.com/kelseyhightower/consul-on-kubernetes) by Kelsey Hightower
(many features from the original tutorial were removed for simplicity)
---
## Running our Consul cluster
- We'll use the provided YAML file
.exercise[
- Create the stateful set and associated service:
```bash
kubectl apply -f ~/container.training/k8s/consul-1.yaml
```
- Check the logs as the pods come up one after another:
```bash
stern consul
```
<!--
```wait Synced node info```
```key ^C```
-->
- Check the health of the cluster:
```bash
kubectl exec consul-0 -- consul members
```
]
---
## Caveats
- The scheduler may place two Consul pods on the same node
- if that node fails, we lose two Consul pods at the same time
- this will cause the cluster to fail
- Scaling down the cluster will cause it to fail
- when a Consul member leaves the cluster, it needs to inform the others
- otherwise, the last remaining node doesn't have quorum and stops functioning
- This Consul cluster doesn't use real persistence yet
- data is stored in the containers' ephemeral filesystem
- if a pod fails, its replacement starts from a blank slate
---
## Improving pod placement
- We need to tell the scheduler:
*do not put two of these pods on the same node!*
- This is done with an `affinity` section like the following one:
```yaml
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: consul
topologyKey: kubernetes.io/hostname
```
---
## Using a lifecycle hook
- When a Consul member leaves the cluster, it needs to execute:
```bash
consul leave
```
- This is done with a `lifecycle` section like the following one:
```yaml
lifecycle:
preStop:
exec:
command: [ "sh", "-c", "consul leave" ]
```
---
## Running a better Consul cluster
- Let's try to add the scheduling constraint and lifecycle hook
- We can do that in the same namespace or another one (as we like)
- If we do that in the same namespace, we will see a rolling update
(pods will be replaced one by one)
.exercise[
- Deploy a better Consul cluster:
```bash
kubectl apply -f ~/container.training/k8s/consul-2.yaml
```
]
---
## Still no persistence, though
- We aren't using actual persistence yet
(no `volumeClaimTemplate`, Persistent Volume, etc.)
- What happens if we lose a pod?
- a new pod gets rescheduled (with an empty state)
- the new pod tries to connect to the two others
- it will be accepted (after 1-2 minutes of instability)
- and it will retrieve the data from the other pods
---
## Failure modes
- What happens if we lose two pods?
- manual repair will be required
- we will need to instruct the remaining one to act solo
- then rejoin new pods
- What happens if we lose three pods? (aka all of them)
- we lose all the data (ouch)
- If we run Consul without persistent storage, backups are a good idea!
---
# Persistent Volumes Claims
- Our Pods can use a special volume type: a *Persistent Volume Claim*
- A Persistent Volume Claim (PVC) is also a Kubernetes resource
(visible with `kubectl get persistentvolumeclaims` or `kubectl get pvc`)
- A PVC is not a volume; it is a *request for a volume*
- It should indicate at least:
- the size of the volume (e.g. "5 GiB")
- the access mode (e.g. "read-write by a single pod")
---
## What's in a PVC?
- A PVC contains at least:
- a list of *access modes* (ReadWriteOnce, ReadOnlyMany, ReadWriteMany)
- a size (interpreted as the minimal storage space needed)
- It can also contain optional elements:
- a selector (to restrict which actual volumes it can use)
- a *storage class* (used by dynamic provisioning, more on that later)
---
## What does a PVC look like?
Here is a manifest for a basic PVC:
```yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: my-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
```
---
## Using a Persistent Volume Claim
Here is a Pod definition like the ones shown earlier, but using a PVC:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-using-a-claim
spec:
containers:
- image: ...
name: container-using-a-claim
volumeMounts:
- mountPath: /my-vol
name: my-volume
volumes:
- name: my-volume
persistentVolumeClaim:
claimName: my-claim
```
---
## Creating and using Persistent Volume Claims
- PVCs can be created manually and used explicitly
(as shown on the previous slides)
- They can also be created and used through Stateful Sets
(this will be shown later)
---
## Lifecycle of Persistent Volume Claims
- When a PVC is created, it starts existing in "Unbound" state
(without an associated volume)
- A Pod referencing an unbound PVC will not start
(the scheduler will wait until the PVC is bound to place it)
- A special controller continuously monitors PVCs to associate them with PVs
- If no PV is available, one must be created:
- manually (by operator intervention)
- using a *dynamic provisioner* (more on that later)
---
class: extra-details
## Which PV gets associated to a PVC?
- The PV must satisfy the PVC constraints
(access mode, size, optional selector, optional storage class)
- The PVs with the closest access mode are picked
- Then the PVs with the closest size
- It is possible to specify a `claimRef` when creating a PV
(this will associate it to the specified PVC, but only if the PV satisfies all the requirements of the PVC; otherwise another PV might end up being picked)
- For all the details about the PersistentVolumeClaimBinder, check [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/persistent-storage.md#matching-and-binding)
---
## Persistent Volume Claims and Stateful sets
- A Stateful set can define one (or more) `volumeClaimTemplate`
- Each `volumeClaimTemplate` will create one Persistent Volume Claim per pod
- Each pod will therefore have its own individual volume
- These volumes are numbered (like the pods)
- Example:
- a Stateful set is named `db`
- it is scaled to replicas
- it has a `volumeClaimTemplate` named `data`
- then it will create pods `db-0`, `db-1`, `db-2`
- these pods will have volumes named `data-db-0`, `data-db-1`, `data-db-2`
---
## Persistent Volume Claims are sticky
- When updating the stateful set (e.g. image upgrade), each pod keeps its volume
- When pods get rescheduled (e.g. node failure), they keep their volume
(this requires a storage system that is not node-local)
- These volumes are not automatically deleted
(when the stateful set is scaled down or deleted)
- If a stateful set is scaled back up later, the pods get their data back
---
## Dynamic provisioners
- A *dynamic provisioner* monitors unbound PVCs
- It can create volumes (and the corresponding PV) on the fly
- This requires the PVCs to have a *storage class*
(annotation `volume.beta.kubernetes.io/storage-provisioner`)
- A dynamic provisioner only acts on PVCs with the right storage class
(it ignores the other ones)
- Just like `LoadBalancer` services, dynamic provisioners are optional
(i.e. our cluster may or may not have one pre-installed)
---
## What's a Storage Class?
- A Storage Class is yet another Kubernetes API resource
(visible with e.g. `kubectl get storageclass` or `kubectl get sc`)
- It indicates which *provisioner* to use
(which controller will create the actual volume)
- And arbitrary parameters for that provisioner
(replication levels, type of disk ... anything relevant!)
- Storage Classes are required if we want to use [dynamic provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/)
(but we can also create volumes manually, and ignore Storage Classes)
---
## The default storage class
- At most one storage class can be marked as the default class
(by annotating it with `storageclass.kubernetes.io/is-default-class=true`)
- When a PVC is created, it will be annotated with the default storage class
(unless it specifies an explicit storage class)
- This only happens at PVC creation
(existing PVCs are not updated when we mark a class as the default one)
---
## Dynamic provisioning setup
This is how we can achieve fully automated provisioning of persistent storage.
1. Configure a storage system.
(It needs to have an API, or be capable of automated provisioning of volumes.)
2. Install a dynamic provisioner for this storage system.
(This is some specific controller code.)
3. Create a Storage Class for this system.
(It has to match what the dynamic provisioner is expecting.)
4. Annotate the Storage Class to be the default one.
---
## Dynamic provisioning usage
After setting up the system (previous slide), all we need to do is:
*Create a Stateful Set that makes use of a `volumeClaimTemplate`.*
This will trigger the following actions.
1. The Stateful Set creates PVCs according to the `volumeClaimTemplate`.
2. The Stateful Set creates Pods using these PVCs.
3. The PVCs are automatically annotated with our Storage Class.
4. The dynamic provisioner provisions volumes and creates the corresponding PVs.
5. The PersistentVolumeClaimBinder associates the PVs and the PVCs together.
6. PVCs are now bound, the Pods can start.
- manually (human operator creates the volumes ahead of time, or when needed)
???
:EN:- Deploying apps with Stateful Sets
:EN:- Example: deploying a Consul cluster
:EN:- Understanding Persistent Volume Claims and Storage Classes
:FR:- Déployer une application avec un *Stateful Set*
:FR:- Example : lancer un cluster Consul
:FR:- Comprendre les *Persistent Volume Claims* et *Storage Classes*

View File

@@ -0,0 +1,314 @@
## Putting it all together
- We want to run that Consul cluster *and* actually persist data
- We'll use a StatefulSet that will leverage PV and PVC
- If we have a dynamic provisioner:
*the cluster will come up right away*
- If we don't have a dynamic provisioner:
*we will need to create Persistent Volumes manually*
---
## Persistent Volume Claims and Stateful sets
- A Stateful set can define one (or more) `volumeClaimTemplate`
- Each `volumeClaimTemplate` will create one Persistent Volume Claim per Pod
- Each Pod will therefore have its own individual volume
- These volumes are numbered (like the Pods)
- Example:
- a Stateful set is named `consul`
- it is scaled to replicas
- it has a `volumeClaimTemplate` named `data`
- then it will create pods `consul-0`, `consul-1`, `consul-2`
- these pods will have volumes named `data`, referencing PersistentVolumeClaims
named `data-consul-0`, `data-consul-1`, `data-consul-2`
---
## Persistent Volume Claims are sticky
- When updating the stateful set (e.g. image upgrade), each pod keeps its volume
- When pods get rescheduled (e.g. node failure), they keep their volume
(this requires a storage system that is not node-local)
- These volumes are not automatically deleted
(when the stateful set is scaled down or deleted)
- If a stateful set is scaled back up later, the pods get their data back
---
## Deploying Consul
- Let's use a new manifest for our Consul cluster
- The only differences between that file and the previous one are:
- `volumeClaimTemplate` defined in the Stateful Set spec
- the corresponding `volumeMounts` in the Pod spec
.exercise[
- Apply the persistent Consul YAML file:
```bash
kubectl apply -f ~/container.training/k8s/consul-3.yaml
```
]
---
## No dynamic provisioner
- If we don't have a dynamic provisioner, we need to create the PVs
- We are going to use local volumes
(similar conceptually to `hostPath` volumes)
- We can use local volumes without installing extra plugins
- However, they are tied to a node
- If that node goes down, the volume becomes unavailable
---
## Observing the situation
- Let's look at Persistent Volume Claims and Pods
.exercise[
- Check that we now have an unbound Persistent Volume Claim:
```bash
kubectl get pvc
```
- We don't have any Persistent Volume:
```bash
kubectl get pv
```
- The Pod `consul-0` is not scheduled yet:
```bash
kubectl get pods -o wide
```
]
*Hint: leave these commands running with `-w` in different windows.*
---
## Explanations
- In a Stateful Set, the Pods are started one by one
- `consul-1` won't be created until `consul-0` is running
- `consul-0` has a dependency on an unbound Persistent Volume Claim
- The scheduler won't schedule the Pod until the PVC is bound
(because the PVC might be bound to a volume that is only available on a subset of nodes; for instance EBS are tied to an availability zone)
---
## Creating Persistent Volumes
- Let's create 3 local directories (`/mnt/consul`) on node2, node3, node4
- Then create 3 Persistent Volumes corresponding to these directories
.exercise[
- Create the local directories:
```bash
for NODE in node2 node3 node4; do
ssh $NODE sudo mkdir -p /mnt/consul
done
```
- Create the PV objects:
```bash
kubectl apply -f ~/container.training/k8s/volumes-for-consul.yaml
```
]
---
## Check our Consul cluster
- The PVs that we created will be automatically matched with the PVCs
- Once a PVC is bound, its pod can start normally
- Once the pod `consul-0` has started, `consul-1` can be created, etc.
- Eventually, our Consul cluster is up, and backend by "persistent" volumes
.exercise[
- Check that our Consul clusters has 3 members indeed:
```bash
kubectl exec consul-0 -- consul members
```
]
---
## Devil is in the details (1/2)
- The size of the Persistent Volumes is bogus
(it is used when matching PVs and PVCs together, but there is no actual quota or limit)
- The Pod might end up using more than the requested size
- The PV may or may not have the capacity that it's advertising
- It works well with dynamically provisioned block volumes
- ...Less so in other scenarios!
---
## Devil is in the details (2/2)
- This specific example worked because we had exactly 1 free PV per node:
- if we had created multiple PVs per node ...
- we could have ended with two PVCs bound to PVs on the same node ...
- which would have required two pods to be on the same node ...
- which is forbidden by the anti-affinity constraints in the StatefulSet
- To avoid that, we need to associated the PVs with a Storage Class that has:
```yaml
volumeBindingMode: WaitForFirstConsumer
```
(this means that a PVC will be bound to a PV only after being used by a Pod)
- See [this blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) for more details
---
## If we have a dynamic provisioner
These are the steps when dynamic provisioning happens:
1. The Stateful Set creates PVCs according to the `volumeClaimTemplate`.
2. The Stateful Set creates Pods using these PVCs.
3. The PVCs are automatically annotated with our Storage Class.
4. The dynamic provisioner provisions volumes and creates the corresponding PVs.
5. The PersistentVolumeClaimBinder associates the PVs and the PVCs together.
6. PVCs are now bound, the Pods can start.
---
## Validating persistence (1)
- When the StatefulSet is deleted, the PVC and PV still exist
- And if we recreate an identical StatefulSet, the PVC and PV are reused
- Let's see that!
.exercise[
- Put some data in Consul:
```bash
kubectl exec consul-0 -- consul kv put answer 42
```
- Delete the Consul cluster:
```bash
kubectl delete -f ~/container.training/k8s/consul-3.yaml
```
]
---
## Validating persistence (2)
.exercise[
- Wait until the last Pod is deleted:
```bash
kubectl wait pod consul-0 --for=delete
```
- Check that PV and PVC are still here:
```bash
kubectl get pv,pvc
```
]
---
## Validating persistence (3)
.exercise[
- Re-create the cluster:
```bash
kubectl apply -f ~/container.training/k8s/consul-3.yaml
```
- Wait until it's up
- Then access the key that we set earlier:
```bash
kubectl exec consul-0 -- consul kv get answer
```
]
---
## Cleaning up
- PV and PVC don't get deleted automatically
- This is great (less risk of accidental data loss)
- This is not great (storage usage increases)
- Managing PVC lifecycle:
- remove them manually
- add their StatefulSet to their `ownerReferences`
- delete the Namespace that they belong to
???
:EN:- Defining volumeClaimTemplates
:FR:- Définir des volumeClaimTemplates

View File

@@ -84,5 +84,9 @@ content:
- k8s/configuration.md
- k8s/secrets.md
- k8s/statefulsets.md
- k8s/local-persistent-volumes.md
- k8s/portworx.md
- k8s/consul.md
- k8s/pv-pvc-sc.md
- k8s/volume-claim-templates.md
#- k8s/portworx.md
- k8s/openebs.md
- k8s/stateful-failover.md

View File

@@ -110,8 +110,12 @@ content:
#- k8s/prometheus.md
#- k8s/prometheus-stack.md
#- k8s/statefulsets.md
#- k8s/local-persistent-volumes.md
#- k8s/consul.md
#- k8s/pv-pvc-sc.md
#- k8s/volume-claim-templates.md
#- k8s/portworx.md
#- k8s/openebs.md
#- k8s/stateful-failover.md
#- k8s/extending-api.md
#- k8s/crd.md
#- k8s/admission.md

View File

@@ -112,9 +112,12 @@ content:
- k8s/configuration.md
- k8s/secrets.md
- k8s/statefulsets.md
- k8s/local-persistent-volumes.md
- k8s/consul.md
- k8s/pv-pvc-sc.md
- k8s/volume-claim-templates.md
- k8s/portworx.md
- k8s/openebs.md
- k8s/stateful-failover.md
-
- k8s/logs-centralized.md
- k8s/prometheus.md

View File

@@ -110,9 +110,12 @@ content:
#- k8s/prometheus-stack.md
-
- k8s/statefulsets.md
- k8s/local-persistent-volumes.md
- k8s/portworx.md
#- k8s/openebs.md
- k8s/consul.md
- k8s/pv-pvc-sc.md
- k8s/volume-claim-templates.md
#- k8s/portworx.md
- k8s/openebs.md
- k8s/stateful-failover.md
#- k8s/extending-api.md
#- k8s/admission.md
#- k8s/operators.md