mirror of
https://github.com/jpetazzo/container.training.git
synced 2026-02-14 09:39:56 +00:00
🏭️ Refactor stateful apps content
This commit is contained in:
20
k8s/mounter.yaml
Normal file
20
k8s/mounter.yaml
Normal file
@@ -0,0 +1,20 @@
|
||||
kind: Pod
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
generateName: mounter-
|
||||
labels:
|
||||
container.training/mounter: ""
|
||||
spec:
|
||||
volumes:
|
||||
- name: pvc
|
||||
persistentVolumeClaim:
|
||||
claimName: my-pvc-XYZ45
|
||||
containers:
|
||||
- name: mounter
|
||||
image: alpine
|
||||
stdin: true
|
||||
tty: true
|
||||
volumeMounts:
|
||||
- name: pvc
|
||||
mountPath: /pvc
|
||||
workingDir: /pvc
|
||||
20
k8s/pv.yaml
Normal file
20
k8s/pv.yaml
Normal file
@@ -0,0 +1,20 @@
|
||||
kind: PersistentVolume
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
generateName: my-pv-
|
||||
labels:
|
||||
container.training/pv: ""
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
- ReadWriteMany
|
||||
capacity:
|
||||
storage: 1G
|
||||
hostPath:
|
||||
path: /tmp/my-pv
|
||||
#storageClassName: my-sc
|
||||
#claimRef:
|
||||
# kind: PersistentVolumeClaim
|
||||
# apiVersion: v1
|
||||
# namespace: default
|
||||
# name: my-pvc-XYZ45
|
||||
13
k8s/pvc.yaml
Normal file
13
k8s/pvc.yaml
Normal file
@@ -0,0 +1,13 @@
|
||||
kind: PersistentVolumeClaim
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
generateName: my-pvc-
|
||||
labels:
|
||||
container.training/pvc: ""
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 1G
|
||||
#storageClassName: my-sc
|
||||
228
slides/k8s/consul.md
Normal file
228
slides/k8s/consul.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# Running a Consul cluster
|
||||
|
||||
- Here is a good use-case for Stateful sets!
|
||||
|
||||
- We are going to deploy a Consul cluster with 3 nodes
|
||||
|
||||
- Consul is a highly-available key/value store
|
||||
|
||||
(like etcd or Zookeeper)
|
||||
|
||||
- One easy way to bootstrap a cluster is to tell each node:
|
||||
|
||||
- the addresses of other nodes
|
||||
|
||||
- how many nodes are expected (to know when quorum is reached)
|
||||
|
||||
---
|
||||
|
||||
## Bootstrapping a Consul cluster
|
||||
|
||||
*After reading the Consul documentation carefully (and/or asking around),
|
||||
we figure out the minimal command-line to run our Consul cluster.*
|
||||
|
||||
```
|
||||
consul agent -data-dir=/consul/data -client=0.0.0.0 -server -ui \
|
||||
-bootstrap-expect=3 \
|
||||
-retry-join=`X.X.X.X` \
|
||||
-retry-join=`Y.Y.Y.Y`
|
||||
```
|
||||
|
||||
- Replace X.X.X.X and Y.Y.Y.Y with the addresses of other nodes
|
||||
|
||||
- A node can add its own address (it will work fine)
|
||||
|
||||
- ... Which means that we can use the same command-line on all nodes (convenient!)
|
||||
|
||||
---
|
||||
|
||||
## Cloud Auto-join
|
||||
|
||||
- Since version 1.4.0, Consul can use the Kubernetes API to find its peers
|
||||
|
||||
- This is called [Cloud Auto-join]
|
||||
|
||||
- Instead of passing an IP address, we need to pass a parameter like this:
|
||||
|
||||
```
|
||||
consul agent -retry-join "provider=k8s label_selector=\"app=consul\""
|
||||
```
|
||||
|
||||
- Consul needs to be able to talk to the Kubernetes API
|
||||
|
||||
- We can provide a `kubeconfig` file
|
||||
|
||||
- If Consul runs in a pod, it will use the *service account* of the pod
|
||||
|
||||
[Cloud Auto-join]: https://www.consul.io/docs/agent/cloud-auto-join.html#kubernetes-k8s-
|
||||
|
||||
---
|
||||
|
||||
## Setting up Cloud auto-join
|
||||
|
||||
- We need to create a service account for Consul
|
||||
|
||||
- We need to create a role that can `list` and `get` pods
|
||||
|
||||
- We need to bind that role to the service account
|
||||
|
||||
- And of course, we need to make sure that Consul pods use that service account
|
||||
|
||||
---
|
||||
|
||||
## Putting it all together
|
||||
|
||||
- The file `k8s/consul-1.yaml` defines the required resources
|
||||
|
||||
(service account, role, role binding, service, stateful set)
|
||||
|
||||
- Inspired by this [excellent tutorial](https://github.com/kelseyhightower/consul-on-kubernetes) by Kelsey Hightower
|
||||
|
||||
(many features from the original tutorial were removed for simplicity)
|
||||
|
||||
---
|
||||
|
||||
## Running our Consul cluster
|
||||
|
||||
- We'll use the provided YAML file
|
||||
|
||||
.exercise[
|
||||
|
||||
- Create the stateful set and associated service:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-1.yaml
|
||||
```
|
||||
|
||||
- Check the logs as the pods come up one after another:
|
||||
```bash
|
||||
stern consul
|
||||
```
|
||||
|
||||
<!--
|
||||
```wait Synced node info```
|
||||
```key ^C```
|
||||
-->
|
||||
|
||||
- Check the health of the cluster:
|
||||
```bash
|
||||
kubectl exec consul-0 -- consul members
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Caveats
|
||||
|
||||
- The scheduler may place two Consul pods on the same node
|
||||
|
||||
- if that node fails, we lose two Consul pods at the same time
|
||||
- this will cause the cluster to fail
|
||||
|
||||
- Scaling down the cluster will cause it to fail
|
||||
|
||||
- when a Consul member leaves the cluster, it needs to inform the others
|
||||
- otherwise, the last remaining node doesn't have quorum and stops functioning
|
||||
|
||||
- This Consul cluster doesn't use real persistence yet
|
||||
|
||||
- data is stored in the containers' ephemeral filesystem
|
||||
- if a pod fails, its replacement starts from a blank slate
|
||||
|
||||
---
|
||||
|
||||
## Improving pod placement
|
||||
|
||||
- We need to tell the scheduler:
|
||||
|
||||
*do not put two of these pods on the same node!*
|
||||
|
||||
- This is done with an `affinity` section like the following one:
|
||||
```yaml
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
- labelSelector:
|
||||
matchLabels:
|
||||
app: consul
|
||||
topologyKey: kubernetes.io/hostname
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Using a lifecycle hook
|
||||
|
||||
- When a Consul member leaves the cluster, it needs to execute:
|
||||
```bash
|
||||
consul leave
|
||||
```
|
||||
|
||||
- This is done with a `lifecycle` section like the following one:
|
||||
```yaml
|
||||
lifecycle:
|
||||
preStop:
|
||||
exec:
|
||||
command: [ "sh", "-c", "consul leave" ]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running a better Consul cluster
|
||||
|
||||
- Let's try to add the scheduling constraint and lifecycle hook
|
||||
|
||||
- We can do that in the same namespace or another one (as we like)
|
||||
|
||||
- If we do that in the same namespace, we will see a rolling update
|
||||
|
||||
(pods will be replaced one by one)
|
||||
|
||||
.exercise[
|
||||
|
||||
- Deploy a better Consul cluster:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-2.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Still no persistence, though
|
||||
|
||||
- We aren't using actual persistence yet
|
||||
|
||||
(no `volumeClaimTemplate`, Persistent Volume, etc.)
|
||||
|
||||
- What happens if we lose a pod?
|
||||
|
||||
- a new pod gets rescheduled (with an empty state)
|
||||
|
||||
- the new pod tries to connect to the two others
|
||||
|
||||
- it will be accepted (after 1-2 minutes of instability)
|
||||
|
||||
- and it will retrieve the data from the other pods
|
||||
|
||||
---
|
||||
|
||||
## Failure modes
|
||||
|
||||
- What happens if we lose two pods?
|
||||
|
||||
- manual repair will be required
|
||||
|
||||
- we will need to instruct the remaining one to act solo
|
||||
|
||||
- then rejoin new pods
|
||||
|
||||
- What happens if we lose three pods? (aka all of them)
|
||||
|
||||
- we lose all the data (ouch)
|
||||
|
||||
???
|
||||
|
||||
:EN:- Scheduling pods together or separately
|
||||
:EN:- Example: deploying a Consul cluster
|
||||
:FR:- Lancer des pods ensemble ou séparément
|
||||
:FR:- Example : lancer un cluster Consul
|
||||
@@ -1,251 +0,0 @@
|
||||
# Local Persistent Volumes
|
||||
|
||||
- We want to run that Consul cluster *and* actually persist data
|
||||
|
||||
- But we don't have a distributed storage system
|
||||
|
||||
- We are going to use local volumes instead
|
||||
|
||||
(similar conceptually to `hostPath` volumes)
|
||||
|
||||
- We can use local volumes without installing extra plugins
|
||||
|
||||
- However, they are tied to a node
|
||||
|
||||
- If that node goes down, the volume becomes unavailable
|
||||
|
||||
---
|
||||
|
||||
## With or without dynamic provisioning
|
||||
|
||||
- We will deploy a Consul cluster *with* persistence
|
||||
|
||||
- That cluster's StatefulSet will create PVCs
|
||||
|
||||
- These PVCs will remain unbound¹, until we will create local volumes manually
|
||||
|
||||
(we will basically do the job of the dynamic provisioner)
|
||||
|
||||
- Then, we will see how to automate that with a dynamic provisioner
|
||||
|
||||
.footnote[¹Unbound = without an associated Persistent Volume.]
|
||||
|
||||
---
|
||||
|
||||
## If we have a dynamic provisioner ...
|
||||
|
||||
- The labs in this section assume that we *do not* have a dynamic provisioner
|
||||
|
||||
- If we do have one, we need to disable it
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check if we have a dynamic provisioner:
|
||||
```bash
|
||||
kubectl get storageclass
|
||||
```
|
||||
|
||||
- If the output contains a line with `(default)`, run this command:
|
||||
```bash
|
||||
kubectl annotate sc storageclass.kubernetes.io/is-default-class- --all
|
||||
```
|
||||
|
||||
- Check again that it is no longer marked as `(default)`
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Deploying Consul
|
||||
|
||||
- Let's use a new manifest for our Consul cluster
|
||||
|
||||
- The only differences between that file and the previous one are:
|
||||
|
||||
- `volumeClaimTemplate` defined in the Stateful Set spec
|
||||
|
||||
- the corresponding `volumeMounts` in the Pod spec
|
||||
|
||||
.exercise[
|
||||
|
||||
- Apply the persistent Consul YAML file:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-3.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Observing the situation
|
||||
|
||||
- Let's look at Persistent Volume Claims and Pods
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that we now have an unbound Persistent Volume Claim:
|
||||
```bash
|
||||
kubectl get pvc
|
||||
```
|
||||
|
||||
- We don't have any Persistent Volume:
|
||||
```bash
|
||||
kubectl get pv
|
||||
```
|
||||
|
||||
- The Pod `consul-0` is not scheduled yet:
|
||||
```bash
|
||||
kubectl get pods -o wide
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
*Hint: leave these commands running with `-w` in different windows.*
|
||||
|
||||
---
|
||||
|
||||
## Explanations
|
||||
|
||||
- In a Stateful Set, the Pods are started one by one
|
||||
|
||||
- `consul-1` won't be created until `consul-0` is running
|
||||
|
||||
- `consul-0` has a dependency on an unbound Persistent Volume Claim
|
||||
|
||||
- The scheduler won't schedule the Pod until the PVC is bound
|
||||
|
||||
(because the PVC might be bound to a volume that is only available on a subset of nodes; for instance EBS are tied to an availability zone)
|
||||
|
||||
---
|
||||
|
||||
## Creating Persistent Volumes
|
||||
|
||||
- Let's create 3 local directories (`/mnt/consul`) on node2, node3, node4
|
||||
|
||||
- Then create 3 Persistent Volumes corresponding to these directories
|
||||
|
||||
.exercise[
|
||||
|
||||
- Create the local directories:
|
||||
```bash
|
||||
for NODE in node2 node3 node4; do
|
||||
ssh $NODE sudo mkdir -p /mnt/consul
|
||||
done
|
||||
```
|
||||
|
||||
- Create the PV objects:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/volumes-for-consul.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Check our Consul cluster
|
||||
|
||||
- The PVs that we created will be automatically matched with the PVCs
|
||||
|
||||
- Once a PVC is bound, its pod can start normally
|
||||
|
||||
- Once the pod `consul-0` has started, `consul-1` can be created, etc.
|
||||
|
||||
- Eventually, our Consul cluster is up, and backend by "persistent" volumes
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that our Consul clusters has 3 members indeed:
|
||||
```bash
|
||||
kubectl exec consul-0 -- consul members
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Devil is in the details (1/2)
|
||||
|
||||
- The size of the Persistent Volumes is bogus
|
||||
|
||||
(it is used when matching PVs and PVCs together, but there is no actual quota or limit)
|
||||
|
||||
---
|
||||
|
||||
## Devil is in the details (2/2)
|
||||
|
||||
- This specific example worked because we had exactly 1 free PV per node:
|
||||
|
||||
- if we had created multiple PVs per node ...
|
||||
|
||||
- we could have ended with two PVCs bound to PVs on the same node ...
|
||||
|
||||
- which would have required two pods to be on the same node ...
|
||||
|
||||
- which is forbidden by the anti-affinity constraints in the StatefulSet
|
||||
|
||||
- To avoid that, we need to associated the PVs with a Storage Class that has:
|
||||
```yaml
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
```
|
||||
(this means that a PVC will be bound to a PV only after being used by a Pod)
|
||||
|
||||
- See [this blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) for more details
|
||||
|
||||
---
|
||||
|
||||
## Bulk provisioning
|
||||
|
||||
- It's not practical to manually create directories and PVs for each app
|
||||
|
||||
- We *could* pre-provision a number of PVs across our fleet
|
||||
|
||||
- We could even automate that with a Daemon Set:
|
||||
|
||||
- creating a number of directories on each node
|
||||
|
||||
- creating the corresponding PV objects
|
||||
|
||||
- We also need to recycle volumes
|
||||
|
||||
- ... This can quickly get out of hand
|
||||
|
||||
---
|
||||
|
||||
## Dynamic provisioning
|
||||
|
||||
- We could also write our own provisioner, which would:
|
||||
|
||||
- watch the PVCs across all namespaces
|
||||
|
||||
- when a PVC is created, create a corresponding PV on a node
|
||||
|
||||
- Or we could use one of the dynamic provisioners for local persistent volumes
|
||||
|
||||
(for instance the [Rancher local path provisioner](https://github.com/rancher/local-path-provisioner))
|
||||
|
||||
---
|
||||
|
||||
## Strategies for local persistent volumes
|
||||
|
||||
- Remember, when a node goes down, the volumes on that node become unavailable
|
||||
|
||||
- High availability will require another layer of replication
|
||||
|
||||
(like what we've just seen with Consul; or primary/secondary; etc)
|
||||
|
||||
- Pre-provisioning PVs makes sense for machines with local storage
|
||||
|
||||
(e.g. cloud instance storage; or storage directly attached to a physical machine)
|
||||
|
||||
- Dynamic provisioning makes sense for large number of applications
|
||||
|
||||
(when we can't or won't dedicate a whole disk to a volume)
|
||||
|
||||
- It's possible to mix both (using distinct Storage Classes)
|
||||
|
||||
???
|
||||
|
||||
:EN:- Static vs dynamic volume provisioning
|
||||
:EN:- Example: local persistent volume provisioner
|
||||
:FR:- Création statique ou dynamique de volumes
|
||||
:FR:- Exemple : création de volumes locaux
|
||||
@@ -321,207 +321,13 @@ EOF
|
||||
|
||||
---
|
||||
|
||||
## Creating a Pod using the Jiva class
|
||||
## We're ready now!
|
||||
|
||||
- We will create a Pod running PostgreSQL, using the default class
|
||||
- We have a StorageClass that can provision PersistentVolumes
|
||||
|
||||
.exercise[
|
||||
- These PersistentVolumes will be replicated across nodes
|
||||
|
||||
- Create the Pod:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/postgres.yaml
|
||||
```
|
||||
|
||||
- Wait for the PV, PVC, and Pod to be up:
|
||||
```bash
|
||||
watch kubectl get pv,pvc,pod
|
||||
```
|
||||
|
||||
- We can also check what's going on in the `openebs` namespace:
|
||||
```bash
|
||||
watch kubectl get pods --namespace openebs
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Node failover
|
||||
|
||||
⚠️ This will partially break your cluster!
|
||||
|
||||
- We are going to disconnect the node running PostgreSQL from the cluster
|
||||
|
||||
- We will see what happens, and how to recover
|
||||
|
||||
- We will not reconnect the node to the cluster
|
||||
|
||||
- This whole lab will take at least 10-15 minutes (due to various timeouts)
|
||||
|
||||
⚠️ Only do this lab at the very end, when you don't want to run anything else after!
|
||||
|
||||
---
|
||||
|
||||
## Disconnecting the node from the cluster
|
||||
|
||||
.exercise[
|
||||
|
||||
- Find out where the Pod is running, and SSH into that node:
|
||||
```bash
|
||||
kubectl get pod postgres-0 -o jsonpath={.spec.nodeName}
|
||||
ssh nodeX
|
||||
```
|
||||
|
||||
- Check the name of the network interface:
|
||||
```bash
|
||||
sudo ip route ls default
|
||||
```
|
||||
|
||||
- The output should look like this:
|
||||
```
|
||||
default via 10.10.0.1 `dev ensX` proto dhcp src 10.10.0.13 metric 100
|
||||
```
|
||||
|
||||
- Shutdown the network interface:
|
||||
```bash
|
||||
sudo ip link set ensX down
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Watch what's going on
|
||||
|
||||
- Let's look at the status of Nodes, Pods, and Events
|
||||
|
||||
.exercise[
|
||||
|
||||
- In a first pane/tab/window, check Nodes and Pods:
|
||||
```bash
|
||||
watch kubectl get nodes,pods -o wide
|
||||
```
|
||||
|
||||
- In another pane/tab/window, check Events:
|
||||
```bash
|
||||
kubectl get events --watch
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Node Ready → NotReady
|
||||
|
||||
- After \~30 seconds, the control plane stops receiving heartbeats from the Node
|
||||
|
||||
- The Node is marked NotReady
|
||||
|
||||
- It is not *schedulable* anymore
|
||||
|
||||
(the scheduler won't place new pods there, except some special cases)
|
||||
|
||||
- All Pods on that Node are also *not ready*
|
||||
|
||||
(they get removed from service Endpoints)
|
||||
|
||||
- ... But nothing else happens for now
|
||||
|
||||
(the control plane is waiting: maybe the Node will come back shortly?)
|
||||
|
||||
---
|
||||
|
||||
## Pod eviction
|
||||
|
||||
- After \~5 minutes, the control plane will evict most Pods from the Node
|
||||
|
||||
- These Pods are now `Terminating`
|
||||
|
||||
- The Pods controlled by e.g. ReplicaSets are automatically moved
|
||||
|
||||
(or rather: new Pods are created to replace them)
|
||||
|
||||
- But nothing happens to the Pods controlled by StatefulSets at this point
|
||||
|
||||
(they remain `Terminating` forever)
|
||||
|
||||
- Why? 🤔
|
||||
|
||||
--
|
||||
|
||||
- This is to avoid *split brain scenarios*
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Split brain 🧠⚡️🧠
|
||||
|
||||
- Imagine that we create a replacement pod `postgres-0` on another Node
|
||||
|
||||
- And 15 minutes later, the Node is reconnected and the original `postgres-0` comes back
|
||||
|
||||
- Which one is the "right" one?
|
||||
|
||||
- What if they have conflicting data?
|
||||
|
||||
😱
|
||||
|
||||
- We *cannot* let that happen!
|
||||
|
||||
- Kubernetes won't do it
|
||||
|
||||
- ... Unless we tell it to
|
||||
|
||||
---
|
||||
|
||||
## The Node is gone
|
||||
|
||||
- One thing we can do, is tell Kubernetes "the Node won't come back"
|
||||
|
||||
(there are other methods; but this one is the simplest one here)
|
||||
|
||||
- This is done with a simple `kubectl delete node`
|
||||
|
||||
.exercise[
|
||||
|
||||
- `kubectl delete` the Node that we disconnected
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Pod rescheduling
|
||||
|
||||
- Kubernetes removes the Node
|
||||
|
||||
- After a brief period of time (\~1 minute) the "Terminating" Pods are removed
|
||||
|
||||
- A replacement Pod is created on another Node
|
||||
|
||||
- ... But it doens't start yet!
|
||||
|
||||
- Why? 🤔
|
||||
|
||||
---
|
||||
|
||||
## Multiple attachment
|
||||
|
||||
- By default, a disk can only be attached to one Node at a time
|
||||
|
||||
(sometimes it's a hardware or API limitation; sometimes enforced in software)
|
||||
|
||||
- In our Events, we should see `FailedAttachVolume` and `FailedMount` messages
|
||||
|
||||
- After \~5 more minutes, the disk will be force-detached from the old Node
|
||||
|
||||
- ... Which will allow attaching it to the new Node!
|
||||
|
||||
🎉
|
||||
|
||||
- The Pod will then be able to start
|
||||
|
||||
- Failover is complete!
|
||||
- They should be able to withstand single-node failures
|
||||
|
||||
???
|
||||
|
||||
|
||||
@@ -1,42 +1,4 @@
|
||||
# Highly available Persistent Volumes
|
||||
|
||||
- How can we achieve true durability?
|
||||
|
||||
- How can we store data that would survive the loss of a node?
|
||||
|
||||
--
|
||||
|
||||
- We need to use Persistent Volumes backed by highly available storage systems
|
||||
|
||||
- There are many ways to achieve that:
|
||||
|
||||
- leveraging our cloud's storage APIs
|
||||
|
||||
- using NAS/SAN systems or file servers
|
||||
|
||||
- distributed storage systems
|
||||
|
||||
--
|
||||
|
||||
- We are going to see one distributed storage system in action
|
||||
|
||||
---
|
||||
|
||||
## Our test scenario
|
||||
|
||||
- We will set up a distributed storage system on our cluster
|
||||
|
||||
- We will use it to deploy a SQL database (PostgreSQL)
|
||||
|
||||
- We will insert some test data in the database
|
||||
|
||||
- We will disrupt the node running the database
|
||||
|
||||
- We will see how it recovers
|
||||
|
||||
---
|
||||
|
||||
## Portworx
|
||||
# Portworx
|
||||
|
||||
- Portworx is a *commercial* persistent storage solution for containers
|
||||
|
||||
@@ -60,7 +22,7 @@
|
||||
|
||||
- We're installing Portworx because we need a storage system
|
||||
|
||||
- If you are using AKS, EKS, GKE ... you already have a storage system
|
||||
- If you are using AKS, EKS, GKE, Kapsule ... you already have a storage system
|
||||
|
||||
(but you might want another one, e.g. to leverage local storage)
|
||||
|
||||
@@ -301,364 +263,6 @@ parameters:
|
||||
|
||||
---
|
||||
|
||||
## Our Postgres Stateful set
|
||||
|
||||
- The next slide shows `k8s/postgres.yaml`
|
||||
|
||||
- It defines a Stateful set
|
||||
|
||||
- With a `volumeClaimTemplate` requesting a 1 GB volume
|
||||
|
||||
- That volume will be mounted to `/var/lib/postgresql/data`
|
||||
|
||||
- There is another little detail: we enable the `stork` scheduler
|
||||
|
||||
- The `stork` scheduler is optional (it's specific to Portworx)
|
||||
|
||||
- It helps the Kubernetes scheduler to colocate the pod with its volume
|
||||
|
||||
(see [this blog post](https://portworx.com/stork-storage-orchestration-kubernetes/) for more details about that)
|
||||
|
||||
---
|
||||
|
||||
.small[
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: postgres
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: postgres
|
||||
serviceName: postgres
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: postgres
|
||||
spec:
|
||||
schedulerName: stork
|
||||
containers:
|
||||
- name: postgres
|
||||
image: postgres:12
|
||||
env:
|
||||
- name: POSTGRES_HOST_AUTH_METHOD
|
||||
value: trust
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/postgresql/data
|
||||
name: postgres
|
||||
volumeClaimTemplates:
|
||||
- metadata:
|
||||
name: postgres
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 1Gi
|
||||
```
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Creating the Stateful set
|
||||
|
||||
- Before applying the YAML, watch what's going on with `kubectl get events -w`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Apply that YAML:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/postgres.yaml
|
||||
```
|
||||
|
||||
<!-- ```hide kubectl wait pod postgres-0 --for condition=ready``` -->
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Testing our PostgreSQL pod
|
||||
|
||||
- We will use `kubectl exec` to get a shell in the pod
|
||||
|
||||
- Good to know: we need to use the `postgres` user in the pod
|
||||
|
||||
.exercise[
|
||||
|
||||
- Get a shell in the pod, as the `postgres` user:
|
||||
```bash
|
||||
kubectl exec -ti postgres-0 -- su postgres
|
||||
```
|
||||
|
||||
<!--
|
||||
autopilot prompt detection expects $ or # at the beginning of the line.
|
||||
```wait postgres@postgres```
|
||||
```keys PS1="\u@\h:\w\n\$ "```
|
||||
```key ^J```
|
||||
-->
|
||||
|
||||
- Check that default databases have been created correctly:
|
||||
```bash
|
||||
psql -l
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
(This should show us 3 lines: postgres, template0, and template1.)
|
||||
|
||||
---
|
||||
|
||||
## Inserting data in PostgreSQL
|
||||
|
||||
- We will create a database and populate it with `pgbench`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Create a database named `demo`:
|
||||
```bash
|
||||
createdb demo
|
||||
```
|
||||
|
||||
- Populate it with `pgbench`:
|
||||
```bash
|
||||
pgbench -i demo
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
- The `-i` flag means "create tables"
|
||||
|
||||
- If you want more data in the test tables, add e.g. `-s 10` (to get 10x more rows)
|
||||
|
||||
---
|
||||
|
||||
## Checking how much data we have now
|
||||
|
||||
- The `pgbench` tool inserts rows in table `pgbench_accounts`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that the `demo` base exists:
|
||||
```bash
|
||||
psql -l
|
||||
```
|
||||
|
||||
- Check how many rows we have in `pgbench_accounts`:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_accounts"
|
||||
```
|
||||
|
||||
- Check that `pgbench_history` is currently empty:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_history"
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Testing the load generator
|
||||
|
||||
- Let's use `pgbench` to generate a few transactions
|
||||
|
||||
.exercise[
|
||||
|
||||
- Run `pgbench` for 10 seconds, reporting progress every second:
|
||||
```bash
|
||||
pgbench -P 1 -T 10 demo
|
||||
```
|
||||
|
||||
- Check the size of the history table now:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_history"
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
Note: on small cloud instances, a typical speed is about 100 transactions/second.
|
||||
|
||||
---
|
||||
|
||||
## Generating transactions
|
||||
|
||||
- Now let's use `pgbench` to generate more transactions
|
||||
|
||||
- While it's running, we will disrupt the database server
|
||||
|
||||
.exercise[
|
||||
|
||||
- Run `pgbench` for 10 minutes, reporting progress every second:
|
||||
```bash
|
||||
pgbench -P 1 -T 600 demo
|
||||
```
|
||||
|
||||
- You can use a longer time period if you need more time to run the next steps
|
||||
|
||||
<!-- ```tmux split-pane -h``` -->
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Find out which node is hosting the database
|
||||
|
||||
- We can find that information with `kubectl get pods -o wide`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check the node running the database:
|
||||
```bash
|
||||
kubectl get pod postgres-0 -o wide
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
We are going to disrupt that node.
|
||||
|
||||
--
|
||||
|
||||
By "disrupt" we mean: "disconnect it from the network".
|
||||
|
||||
---
|
||||
|
||||
## Disconnect the node
|
||||
|
||||
- We will use `iptables` to block all traffic exiting the node
|
||||
|
||||
(except SSH traffic, so we can repair the node later if needed)
|
||||
|
||||
.exercise[
|
||||
|
||||
- SSH to the node to disrupt:
|
||||
```bash
|
||||
ssh `nodeX`
|
||||
```
|
||||
|
||||
- Allow SSH traffic leaving the node, but block all other traffic:
|
||||
```bash
|
||||
sudo iptables -I OUTPUT -p tcp --sport 22 -j ACCEPT
|
||||
sudo iptables -I OUTPUT 2 -j DROP
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Check that the node is disconnected
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that the node can't communicate with other nodes:
|
||||
```
|
||||
ping node1
|
||||
```
|
||||
|
||||
- Logout to go back on `node1`
|
||||
|
||||
<!-- ```key ^D``` -->
|
||||
|
||||
- Watch the events unfolding with `kubectl get events -w` and `kubectl get pods -w`
|
||||
|
||||
]
|
||||
|
||||
- It will take some time for Kubernetes to mark the node as unhealthy
|
||||
|
||||
- Then it will attempt to reschedule the pod to another node
|
||||
|
||||
- In about a minute, our pod should be up and running again
|
||||
|
||||
---
|
||||
|
||||
## Check that our data is still available
|
||||
|
||||
- We are going to reconnect to the (new) pod and check
|
||||
|
||||
.exercise[
|
||||
|
||||
- Get a shell on the pod:
|
||||
```bash
|
||||
kubectl exec -ti postgres-0 -- su postgres
|
||||
```
|
||||
|
||||
<!--
|
||||
```wait postgres@postgres```
|
||||
```keys PS1="\u@\h:\w\n\$ "```
|
||||
```key ^J```
|
||||
-->
|
||||
|
||||
- Check how many transactions are now in the `pgbench_history` table:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_history"
|
||||
```
|
||||
|
||||
<!-- ```key ^D``` -->
|
||||
|
||||
]
|
||||
|
||||
If the 10-second test that we ran earlier gave e.g. 80 transactions per second,
|
||||
and we failed the node after 30 seconds, we should have about 2400 row in that table.
|
||||
|
||||
---
|
||||
|
||||
## Double-check that the pod has really moved
|
||||
|
||||
- Just to make sure the system is not bluffing!
|
||||
|
||||
.exercise[
|
||||
|
||||
- Look at which node the pod is now running on
|
||||
```bash
|
||||
kubectl get pod postgres-0 -o wide
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Re-enable the node
|
||||
|
||||
- Let's fix the node that we disconnected from the network
|
||||
|
||||
.exercise[
|
||||
|
||||
- SSH to the node:
|
||||
```bash
|
||||
ssh `nodeX`
|
||||
```
|
||||
|
||||
- Remove the iptables rule blocking traffic:
|
||||
```bash
|
||||
sudo iptables -D OUTPUT 2
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## A few words about this PostgreSQL setup
|
||||
|
||||
- In a real deployment, you would want to set a password
|
||||
|
||||
- This can be done by creating a `secret`:
|
||||
```
|
||||
kubectl create secret generic postgres \
|
||||
--from-literal=password=$(base64 /dev/urandom | head -c16)
|
||||
```
|
||||
|
||||
- And then passing that secret to the container:
|
||||
```yaml
|
||||
env:
|
||||
- name: POSTGRES_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: postgres
|
||||
key: password
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Troubleshooting Portworx
|
||||
@@ -666,7 +270,7 @@ class: extra-details
|
||||
- If we need to see what's going on with Portworx:
|
||||
```
|
||||
PXPOD=$(kubectl -n kube-system get pod -l name=portworx -o json |
|
||||
jq -r .items[0].metadata.name)
|
||||
jq -r .items[0].metadata.name)
|
||||
kubectl -n kube-system exec $PXPOD -- /opt/pwx/bin/pxctl status
|
||||
```
|
||||
|
||||
@@ -709,26 +313,6 @@ class: extra-details
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Dynamic provisioning without a provider
|
||||
|
||||
- What if we want to use Stateful sets without a storage provider?
|
||||
|
||||
- We will have to create volumes manually
|
||||
|
||||
(by creating Persistent Volume objects)
|
||||
|
||||
- These volumes will be automatically bound with matching Persistent Volume Claims
|
||||
|
||||
- We can use local volumes (essentially bind mounts of host directories)
|
||||
|
||||
- Of course, these volumes won't be available in case of node failure
|
||||
|
||||
- Check [this blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) for more information and gotchas
|
||||
|
||||
---
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
The Portworx installation tutorial, and the PostgreSQL example,
|
||||
@@ -748,8 +332,5 @@ were inspired by [Portworx examples on Katacoda](https://katacoda.com/portworx/s
|
||||
|
||||
???
|
||||
|
||||
:EN:- Using highly available persistent volumes
|
||||
:EN:- Example: deploying a database that can withstand node outages
|
||||
|
||||
:FR:- Utilisation de volumes à haute disponibilité
|
||||
:FR:- Exemple : déployer une base de données survivant à la défaillance d'un nœud
|
||||
:EN:- Hyperconverged storage with Portworx
|
||||
:FR:- Stockage hyperconvergé avec Portworx
|
||||
|
||||
323
slides/k8s/pv-pvc-sc.md
Normal file
323
slides/k8s/pv-pvc-sc.md
Normal file
@@ -0,0 +1,323 @@
|
||||
# PV, PVC, and Storage Classes
|
||||
|
||||
- When an application needs storage, it creates a PersistentVolumeClaim
|
||||
|
||||
(either directly, or through a volume claim template in a Stateful Set)
|
||||
|
||||
- The PersistentVolumeClaim is initially `Pending`
|
||||
|
||||
- Kubernetes then looks for a suitable PersistentVolume
|
||||
|
||||
(maybe one is immediately available; maybe we need to wait for provisioning)
|
||||
|
||||
- Once a suitable PersistentVolume is found, the PVC becomes `Bound`
|
||||
|
||||
- The PVC can then be used by a Pod
|
||||
|
||||
(as long as the PVC is `Pending`, the Pod cannot run)
|
||||
|
||||
---
|
||||
|
||||
## Access modes
|
||||
|
||||
- PV and PVC have *access modes*:
|
||||
|
||||
- ReadWriteOnce (only one node can access the volume at a time)
|
||||
|
||||
- ReadWriteMany (multiple nodes can access the volume simultaneously)
|
||||
|
||||
- ReadOnlyMany (multiple nodes can access, but they can't write)
|
||||
|
||||
- ReadWriteOncePod (only one pod can access the volume; new in Kubernetes 1.22)
|
||||
|
||||
- A PV lists the access modes that it requires
|
||||
|
||||
- A PVC lists the access modes that it supports
|
||||
|
||||
⚠️ A PV with only ReadWriteMany won't satisfy a PVC with ReadWriteOnce!
|
||||
|
||||
---
|
||||
|
||||
## Capacity
|
||||
|
||||
- A PVC must express a storage size request
|
||||
|
||||
(field `spec.resources.requests.storage`, in bytes)
|
||||
|
||||
- A PV must express its size
|
||||
|
||||
(field `spec.capacity.storage`, in bytes)
|
||||
|
||||
- Kubernetes will only match a PV and PVC if the PV is big enough
|
||||
|
||||
- These fields are only used for "matchmaking" purposes:
|
||||
|
||||
- nothing prevents the Pod mounting the PVC from using more space
|
||||
|
||||
- nothing requires the PV to actually be that big
|
||||
|
||||
---
|
||||
|
||||
## Storage Class
|
||||
|
||||
- What if we have multiple storage systems available?
|
||||
|
||||
(e.g. NFS and iSCSI; or AzureFile and AzureDisk; or Cinder and Ceph...)
|
||||
|
||||
- What if we have a storage system with multiple tiers?
|
||||
|
||||
(e.g. SAN with RAID1 and RAID5; general purpose vs. io optimized EBS...)
|
||||
|
||||
- Kubernetes lets us define *storage classes* to represent these
|
||||
|
||||
(see if you have any available at the moment with `kubectl get storageclasses`)
|
||||
|
||||
---
|
||||
|
||||
## Using storage classes
|
||||
|
||||
- Optionally, each PV and each PVC can reference a StorageClass
|
||||
|
||||
(field `spec.storageClassName`)
|
||||
|
||||
- When creating a PVC, specifying a StorageClass means
|
||||
|
||||
“use that particular storage system to provision the volume!”
|
||||
|
||||
- Storage classes are necessary for [dynamic provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/)
|
||||
|
||||
(but we can also ignore them and perform manual provisioning)
|
||||
|
||||
---
|
||||
|
||||
## Default storage class
|
||||
|
||||
- We can define a *default storage class*
|
||||
|
||||
(by annotating it with `storageclass.kubernetes.io/is-default-class=true`)
|
||||
|
||||
- When a PVC is created,
|
||||
|
||||
**IF** it doesn't indicate which storage class to use
|
||||
|
||||
**AND** there is a default storage class
|
||||
|
||||
**THEN** the PVC `storageClassName` is set to the default storage class
|
||||
|
||||
---
|
||||
|
||||
## Additional constraints
|
||||
|
||||
- A PersistentVolumeClaim can also specify a volume selector
|
||||
|
||||
(referring to labels on the PV)
|
||||
|
||||
- A PersistentVolume can also be created with a `claimRef`
|
||||
|
||||
(indicating to which PVC it should be bound)
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Which PV gets associated to a PVC?
|
||||
|
||||
- The PV must be `Available`
|
||||
|
||||
- The PV must satisfy the PVC constraints
|
||||
|
||||
(access mode, size, optional selector, optional storage class)
|
||||
|
||||
- The PVs with the closest access mode are picked
|
||||
|
||||
- Then the PVs with the closest size
|
||||
|
||||
- It is possible to specify a `claimRef` when creating a PV
|
||||
|
||||
(this will associate it to the specified PVC, but only if the PV satisfies all the requirements of the PVC; otherwise another PV might end up being picked)
|
||||
|
||||
- For all the details about the PersistentVolumeClaimBinder, check [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/persistent-storage.md#matching-and-binding)
|
||||
|
||||
---
|
||||
|
||||
## Creating a PVC
|
||||
|
||||
- Let's create a standalone PVC and see what happens!
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check if we have a StorageClass:
|
||||
```bash
|
||||
kubectl get storageclasses
|
||||
```
|
||||
|
||||
- Create the PVC:
|
||||
```bash
|
||||
kubectl create -f ~/container.training/k8s/pvc.yaml
|
||||
```
|
||||
|
||||
- Check the PVC:
|
||||
```bash
|
||||
kubectl get pvc
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Four possibilities
|
||||
|
||||
1. If we have a default StorageClass with *immediate* binding:
|
||||
|
||||
*a PV was created and associated to the PVC*
|
||||
|
||||
2. If we have a default StorageClass that *waits for first consumer*:
|
||||
|
||||
*the PVC is still `Pending` but has a `STORAGECLASS`* ⚠️
|
||||
|
||||
3. If we don't have a default StorageClass:
|
||||
|
||||
*the PVC is still `Pending`, without a `STORAGECLASS`*
|
||||
|
||||
4. If we have a StorageClass, but it doesn't work:
|
||||
|
||||
*the PVC is still `Pending` but has a `STORAGECLASS`* ⚠️
|
||||
|
||||
---
|
||||
|
||||
## Immediate vs WaitForFirstConsumer
|
||||
|
||||
- Immediate = as soon as there is a `Pending` PVC, create a PV
|
||||
|
||||
- What if:
|
||||
|
||||
- the PV is only available on a node (e.g. local volume)
|
||||
|
||||
- ...or on a subset of nodes (e.g. SAN HBA, EBS AZ...)
|
||||
|
||||
- the Pod that will use the PVC has scheduling constraints
|
||||
|
||||
- these constraints turn out to be incompatible with the PV
|
||||
|
||||
- WaitForFirstConsumer = don't provision the PV until a Pod mounts the PVC
|
||||
|
||||
---
|
||||
|
||||
## Using the PVC
|
||||
|
||||
- Let's mount the PVC in a Pod
|
||||
|
||||
- We will use a stray Pod (no Deployment, StatefulSet, etc.)
|
||||
|
||||
- We will use @@LINK[k8s/mounter.yaml], shown on the next slide
|
||||
|
||||
- We'll need to update the `claimName`! ⚠️
|
||||
|
||||
---
|
||||
|
||||
```yaml
|
||||
@@INCLUDE[k8s/mounter.yaml]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running the Pod
|
||||
|
||||
.exercise[
|
||||
|
||||
- Edit the `mounter.yaml` manifest
|
||||
|
||||
- Update the `claimName` to put the name of our PVC
|
||||
|
||||
- Create the Pod
|
||||
|
||||
- Check the status of the PV and PVC
|
||||
|
||||
]
|
||||
|
||||
Note: this "mounter" Pod can be useful to inspect the content of a PVC.
|
||||
|
||||
---
|
||||
|
||||
## Scenario 1 & 2
|
||||
|
||||
If we have a default Storage Class that can provision PVC dynamically...
|
||||
|
||||
- We should now have a new PV
|
||||
|
||||
- The PV and the PVC should be `Bound` together
|
||||
|
||||
---
|
||||
|
||||
## Scenario 3
|
||||
|
||||
If we don't have a default Storage Class, we must create the PV manually.
|
||||
|
||||
```bash
|
||||
kubectl create -f ~/container.training/k8s/pv.yaml
|
||||
```
|
||||
|
||||
After a few seconds, check that the PV and PVC are bound:
|
||||
|
||||
```bash
|
||||
kubectl get pv,pvc
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scenario 4
|
||||
|
||||
If our default Storage Class can't provision a PV, let's do it manually.
|
||||
|
||||
The PV must specify the correct `storageClassName`.
|
||||
|
||||
```bash
|
||||
STORAGECLASS=$(kubectl get pvc --selector=container.training/pvc \
|
||||
-o jsonpath={..storageClassName})
|
||||
kubectl patch -f ~/container.training/k8s/pv.yaml --dry-run=client -o yaml \
|
||||
--patch '{"spec": {"storageClassName": "'$STORAGECLASS'"}}' \
|
||||
| kubectl create -f-
|
||||
```
|
||||
|
||||
Check that the PV and PVC are bound:
|
||||
|
||||
```bash
|
||||
kubectl get pv,pvc
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Checking the Pod
|
||||
|
||||
- If the PVC was `Pending`, then the Pod was `Pending` too
|
||||
|
||||
- Once the PVC is `Bound`, the Pod can be scheduled and can run
|
||||
|
||||
- Once the Pod is `Running`, check it out with `kubectl attach -ti`
|
||||
|
||||
---
|
||||
|
||||
## PV and PVC lifecycle
|
||||
|
||||
- We can't delete a PV if it's `Bound`
|
||||
|
||||
- If we `kubectl delete` it, it goes to `Terminating` state
|
||||
|
||||
- We can't delete a PVC if it's in use by a Pod
|
||||
|
||||
- Likewise, if we `kubectl delete` it, it goes to `Terminating` state
|
||||
|
||||
- Deletion is prevented by *finalizers*
|
||||
|
||||
(=like a post-it note saying “don't delete me!”)
|
||||
|
||||
- When the mounting Pods are deleted, their PVCs are freed up
|
||||
|
||||
- When PVCs are deleted, their PVs are freed up
|
||||
|
||||
???
|
||||
|
||||
:EN:- Storage provisioning
|
||||
:EN:- PV, PVC, StorageClass
|
||||
:FR:- Création de volumes
|
||||
:FR:- PV, PVC, et StorageClass
|
||||
468
slides/k8s/stateful-failover.md
Normal file
468
slides/k8s/stateful-failover.md
Normal file
@@ -0,0 +1,468 @@
|
||||
# Stateful failover
|
||||
|
||||
- How can we achieve true durability?
|
||||
|
||||
- How can we store data that would survive the loss of a node?
|
||||
|
||||
--
|
||||
|
||||
- We need to use Persistent Volumes backed by highly available storage systems
|
||||
|
||||
- There are many ways to achieve that:
|
||||
|
||||
- leveraging our cloud's storage APIs
|
||||
|
||||
- using NAS/SAN systems or file servers
|
||||
|
||||
- distributed storage systems
|
||||
|
||||
---
|
||||
|
||||
## Our test scenario
|
||||
|
||||
- We will use it to deploy a SQL database (PostgreSQL)
|
||||
|
||||
- We will insert some test data in the database
|
||||
|
||||
- We will disrupt the node running the database
|
||||
|
||||
- We will see how it recovers
|
||||
|
||||
---
|
||||
|
||||
## Our Postgres Stateful set
|
||||
|
||||
- The next slide shows `k8s/postgres.yaml`
|
||||
|
||||
- It defines a Stateful set
|
||||
|
||||
- With a `volumeClaimTemplate` requesting a 1 GB volume
|
||||
|
||||
- That volume will be mounted to `/var/lib/postgresql/data`
|
||||
|
||||
---
|
||||
|
||||
.small[.small[
|
||||
```yaml
|
||||
@@INCLUDE[k8s/postgres.yaml]
|
||||
```
|
||||
]]
|
||||
|
||||
---
|
||||
|
||||
## Creating the Stateful set
|
||||
|
||||
- Before applying the YAML, watch what's going on with `kubectl get events -w`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Apply that YAML:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/postgres.yaml
|
||||
```
|
||||
|
||||
<!-- ```hide kubectl wait pod postgres-0 --for condition=ready``` -->
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Testing our PostgreSQL pod
|
||||
|
||||
- We will use `kubectl exec` to get a shell in the pod
|
||||
|
||||
- Good to know: we need to use the `postgres` user in the pod
|
||||
|
||||
.exercise[
|
||||
|
||||
- Get a shell in the pod, as the `postgres` user:
|
||||
```bash
|
||||
kubectl exec -ti postgres-0 -- su postgres
|
||||
```
|
||||
|
||||
<!--
|
||||
autopilot prompt detection expects $ or # at the beginning of the line.
|
||||
```wait postgres@postgres```
|
||||
```keys PS1="\u@\h:\w\n\$ "```
|
||||
```key ^J```
|
||||
-->
|
||||
|
||||
- Check that default databases have been created correctly:
|
||||
```bash
|
||||
psql -l
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
(This should show us 3 lines: postgres, template0, and template1.)
|
||||
|
||||
---
|
||||
|
||||
## Inserting data in PostgreSQL
|
||||
|
||||
- We will create a database and populate it with `pgbench`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Create a database named `demo`:
|
||||
```bash
|
||||
createdb demo
|
||||
```
|
||||
|
||||
- Populate it with `pgbench`:
|
||||
```bash
|
||||
pgbench -i demo
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
- The `-i` flag means "create tables"
|
||||
|
||||
- If you want more data in the test tables, add e.g. `-s 10` (to get 10x more rows)
|
||||
|
||||
---
|
||||
|
||||
## Checking how much data we have now
|
||||
|
||||
- The `pgbench` tool inserts rows in table `pgbench_accounts`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that the `demo` base exists:
|
||||
```bash
|
||||
psql -l
|
||||
```
|
||||
|
||||
- Check how many rows we have in `pgbench_accounts`:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_accounts"
|
||||
```
|
||||
|
||||
- Check that `pgbench_history` is currently empty:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_history"
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Testing the load generator
|
||||
|
||||
- Let's use `pgbench` to generate a few transactions
|
||||
|
||||
.exercise[
|
||||
|
||||
- Run `pgbench` for 10 seconds, reporting progress every second:
|
||||
```bash
|
||||
pgbench -P 1 -T 10 demo
|
||||
```
|
||||
|
||||
- Check the size of the history table now:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_history"
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
Note: on small cloud instances, a typical speed is about 100 transactions/second.
|
||||
|
||||
---
|
||||
|
||||
## Generating transactions
|
||||
|
||||
- Now let's use `pgbench` to generate more transactions
|
||||
|
||||
- While it's running, we will disrupt the database server
|
||||
|
||||
.exercise[
|
||||
|
||||
- Run `pgbench` for 10 minutes, reporting progress every second:
|
||||
```bash
|
||||
pgbench -P 1 -T 600 demo
|
||||
```
|
||||
|
||||
- You can use a longer time period if you need more time to run the next steps
|
||||
|
||||
<!-- ```tmux split-pane -h``` -->
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Find out which node is hosting the database
|
||||
|
||||
- We can find that information with `kubectl get pods -o wide`
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check the node running the database:
|
||||
```bash
|
||||
kubectl get pod postgres-0 -o wide
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
We are going to disrupt that node.
|
||||
|
||||
--
|
||||
|
||||
By "disrupt" we mean: "disconnect it from the network".
|
||||
|
||||
---
|
||||
|
||||
## Node failover
|
||||
|
||||
⚠️ This will partially break your cluster!
|
||||
|
||||
- We are going to disconnect the node running PostgreSQL from the cluster
|
||||
|
||||
- We will see what happens, and how to recover
|
||||
|
||||
- We will not reconnect the node to the cluster
|
||||
|
||||
- This whole lab will take at least 10-15 minutes (due to various timeouts)
|
||||
|
||||
⚠️ Only do this lab at the very end, when you don't want to run anything else after!
|
||||
|
||||
---
|
||||
|
||||
## Disconnecting the node from the cluster
|
||||
|
||||
.exercise[
|
||||
|
||||
- Find out where the Pod is running, and SSH into that node:
|
||||
```bash
|
||||
kubectl get pod postgres-0 -o jsonpath={.spec.nodeName}
|
||||
ssh nodeX
|
||||
```
|
||||
|
||||
- Check the name of the network interface:
|
||||
```bash
|
||||
sudo ip route ls default
|
||||
```
|
||||
|
||||
- The output should look like this:
|
||||
```
|
||||
default via 10.10.0.1 `dev ensX` proto dhcp src 10.10.0.13 metric 100
|
||||
```
|
||||
|
||||
- Shutdown the network interface:
|
||||
```bash
|
||||
sudo ip link set ensX down
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Another way to disconnect the node
|
||||
|
||||
- We can also use `iptables` to block all traffic exiting the node
|
||||
|
||||
(except SSH traffic, so we can repair the node later if needed)
|
||||
|
||||
.exercise[
|
||||
|
||||
- SSH to the node to disrupt:
|
||||
```bash
|
||||
ssh `nodeX`
|
||||
```
|
||||
|
||||
- Allow SSH traffic leaving the node, but block all other traffic:
|
||||
```bash
|
||||
sudo iptables -I OUTPUT -p tcp --sport 22 -j ACCEPT
|
||||
sudo iptables -I OUTPUT 2 -j DROP
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Watch what's going on
|
||||
|
||||
- Let's look at the status of Nodes, Pods, and Events
|
||||
|
||||
.exercise[
|
||||
|
||||
- In a first pane/tab/window, check Nodes and Pods:
|
||||
```bash
|
||||
watch kubectl get nodes,pods -o wide
|
||||
```
|
||||
|
||||
- In another pane/tab/window, check Events:
|
||||
```bash
|
||||
kubectl get events --watch
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Node Ready → NotReady
|
||||
|
||||
- After \~30 seconds, the control plane stops receiving heartbeats from the Node
|
||||
|
||||
- The Node is marked NotReady
|
||||
|
||||
- It is not *schedulable* anymore
|
||||
|
||||
(the scheduler won't place new pods there, except some special cases)
|
||||
|
||||
- All Pods on that Node are also *not ready*
|
||||
|
||||
(they get removed from service Endpoints)
|
||||
|
||||
- ... But nothing else happens for now
|
||||
|
||||
(the control plane is waiting: maybe the Node will come back shortly?)
|
||||
|
||||
---
|
||||
|
||||
## Pod eviction
|
||||
|
||||
- After \~5 minutes, the control plane will evict most Pods from the Node
|
||||
|
||||
- These Pods are now `Terminating`
|
||||
|
||||
- The Pods controlled by e.g. ReplicaSets are automatically moved
|
||||
|
||||
(or rather: new Pods are created to replace them)
|
||||
|
||||
- But nothing happens to the Pods controlled by StatefulSets at this point
|
||||
|
||||
(they remain `Terminating` forever)
|
||||
|
||||
- Why? 🤔
|
||||
|
||||
--
|
||||
|
||||
- This is to avoid *split brain scenarios*
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Split brain 🧠⚡️🧠
|
||||
|
||||
- Imagine that we create a replacement pod `postgres-0` on another Node
|
||||
|
||||
- And 15 minutes later, the Node is reconnected and the original `postgres-0` comes back
|
||||
|
||||
- Which one is the "right" one?
|
||||
|
||||
- What if they have conflicting data?
|
||||
|
||||
😱
|
||||
|
||||
- We *cannot* let that happen!
|
||||
|
||||
- Kubernetes won't do it
|
||||
|
||||
- ... Unless we tell it to
|
||||
|
||||
---
|
||||
|
||||
## The Node is gone
|
||||
|
||||
- One thing we can do, is tell Kubernetes "the Node won't come back"
|
||||
|
||||
(there are other methods; but this one is the simplest one here)
|
||||
|
||||
- This is done with a simple `kubectl delete node`
|
||||
|
||||
.exercise[
|
||||
|
||||
- `kubectl delete` the Node that we disconnected
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Pod rescheduling
|
||||
|
||||
- Kubernetes removes the Node
|
||||
|
||||
- After a brief period of time (\~1 minute) the "Terminating" Pods are removed
|
||||
|
||||
- A replacement Pod is created on another Node
|
||||
|
||||
- ... But it doens't start yet!
|
||||
|
||||
- Why? 🤔
|
||||
|
||||
---
|
||||
|
||||
## Multiple attachment
|
||||
|
||||
- By default, a disk can only be attached to one Node at a time
|
||||
|
||||
(sometimes it's a hardware or API limitation; sometimes enforced in software)
|
||||
|
||||
- In our Events, we should see `FailedAttachVolume` and `FailedMount` messages
|
||||
|
||||
- After \~5 more minutes, the disk will be force-detached from the old Node
|
||||
|
||||
- ... Which will allow attaching it to the new Node!
|
||||
|
||||
🎉
|
||||
|
||||
- The Pod will then be able to start
|
||||
|
||||
- Failover is complete!
|
||||
|
||||
---
|
||||
|
||||
## Check that our data is still available
|
||||
|
||||
- We are going to reconnect to the (new) pod and check
|
||||
|
||||
.exercise[
|
||||
|
||||
- Get a shell on the pod:
|
||||
```bash
|
||||
kubectl exec -ti postgres-0 -- su postgres
|
||||
```
|
||||
|
||||
<!--
|
||||
```wait postgres@postgres```
|
||||
```keys PS1="\u@\h:\w\n\$ "```
|
||||
```key ^J```
|
||||
-->
|
||||
|
||||
- Check how many transactions are now in the `pgbench_history` table:
|
||||
```bash
|
||||
psql demo -c "select count(*) from pgbench_history"
|
||||
```
|
||||
|
||||
<!-- ```key ^D``` -->
|
||||
|
||||
]
|
||||
|
||||
If the 10-second test that we ran earlier gave e.g. 80 transactions per second,
|
||||
and we failed the node after 30 seconds, we should have about 2400 row in that table.
|
||||
|
||||
---
|
||||
|
||||
## Double-check that the pod has really moved
|
||||
|
||||
- Just to make sure the system is not bluffing!
|
||||
|
||||
.exercise[
|
||||
|
||||
- Look at which node the pod is now running on
|
||||
```bash
|
||||
kubectl get pod postgres-0 -o wide
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
???
|
||||
|
||||
:EN:- Using highly available persistent volumes
|
||||
:EN:- Example: deploying a database that can withstand node outages
|
||||
|
||||
:FR:- Utilisation de volumes à haute disponibilité
|
||||
:FR:- Exemple : déployer une base de données survivant à la défaillance d'un nœud
|
||||
@@ -6,7 +6,7 @@
|
||||
|
||||
- They offer mechanisms to deploy scaled stateful applications
|
||||
|
||||
- At a first glance, they look like *deployments*:
|
||||
- At a first glance, they look like Deployments:
|
||||
|
||||
- a stateful set defines a pod spec and a number of replicas *R*
|
||||
|
||||
@@ -182,503 +182,30 @@ spec:
|
||||
|
||||
- These pods can each have their own persistent storage
|
||||
|
||||
(Deployments cannot do that)
|
||||
|
||||
---
|
||||
|
||||
# Running a Consul cluster
|
||||
## Obtaining per-pod storage
|
||||
|
||||
- Here is a good use-case for Stateful sets!
|
||||
- Stateful Sets can have *persistent volume claim templates*
|
||||
|
||||
- We are going to deploy a Consul cluster with 3 nodes
|
||||
(declared in `spec.volumeClaimTemplates` in the Stateful set manifest)
|
||||
|
||||
- Consul is a highly-available key/value store
|
||||
- A claim template will create one Persistent Volume Claim per pod
|
||||
|
||||
(like etcd or Zookeeper)
|
||||
(the PVC will be named `<claim-name>.<stateful-set-name>.<pod-index>`)
|
||||
|
||||
- One easy way to bootstrap a cluster is to tell each node:
|
||||
- Persistent Volume Claims are matched 1-to-1 with Persistent Volumes
|
||||
|
||||
- the addresses of other nodes
|
||||
- Persistent Volume provisioning can be done:
|
||||
|
||||
- how many nodes are expected (to know when quorum is reached)
|
||||
- automatically (by leveraging *dynamic provisioning* with a Storage Class)
|
||||
|
||||
---
|
||||
|
||||
## Bootstrapping a Consul cluster
|
||||
|
||||
*After reading the Consul documentation carefully (and/or asking around),
|
||||
we figure out the minimal command-line to run our Consul cluster.*
|
||||
|
||||
```
|
||||
consul agent -data-dir=/consul/data -client=0.0.0.0 -server -ui \
|
||||
-bootstrap-expect=3 \
|
||||
-retry-join=`X.X.X.X` \
|
||||
-retry-join=`Y.Y.Y.Y`
|
||||
```
|
||||
|
||||
- Replace X.X.X.X and Y.Y.Y.Y with the addresses of other nodes
|
||||
|
||||
- A node can add its own address (it will work fine)
|
||||
|
||||
- ... Which means that we can use the same command-line on all nodes (convenient!)
|
||||
|
||||
---
|
||||
|
||||
## Cloud Auto-join
|
||||
|
||||
- Since version 1.4.0, Consul can use the Kubernetes API to find its peers
|
||||
|
||||
- This is called [Cloud Auto-join]
|
||||
|
||||
- Instead of passing an IP address, we need to pass a parameter like this:
|
||||
|
||||
```
|
||||
consul agent -retry-join "provider=k8s label_selector=\"app=consul\""
|
||||
```
|
||||
|
||||
- Consul needs to be able to talk to the Kubernetes API
|
||||
|
||||
- We can provide a `kubeconfig` file
|
||||
|
||||
- If Consul runs in a pod, it will use the *service account* of the pod
|
||||
|
||||
[Cloud Auto-join]: https://www.consul.io/docs/agent/cloud-auto-join.html#kubernetes-k8s-
|
||||
|
||||
---
|
||||
|
||||
## Setting up Cloud auto-join
|
||||
|
||||
- We need to create a service account for Consul
|
||||
|
||||
- We need to create a role that can `list` and `get` pods
|
||||
|
||||
- We need to bind that role to the service account
|
||||
|
||||
- And of course, we need to make sure that Consul pods use that service account
|
||||
|
||||
---
|
||||
|
||||
## Putting it all together
|
||||
|
||||
- The file `k8s/consul-1.yaml` defines the required resources
|
||||
|
||||
(service account, role, role binding, service, stateful set)
|
||||
|
||||
- Inspired by this [excellent tutorial](https://github.com/kelseyhightower/consul-on-kubernetes) by Kelsey Hightower
|
||||
|
||||
(many features from the original tutorial were removed for simplicity)
|
||||
|
||||
---
|
||||
|
||||
## Running our Consul cluster
|
||||
|
||||
- We'll use the provided YAML file
|
||||
|
||||
.exercise[
|
||||
|
||||
- Create the stateful set and associated service:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-1.yaml
|
||||
```
|
||||
|
||||
- Check the logs as the pods come up one after another:
|
||||
```bash
|
||||
stern consul
|
||||
```
|
||||
|
||||
<!--
|
||||
```wait Synced node info```
|
||||
```key ^C```
|
||||
-->
|
||||
|
||||
- Check the health of the cluster:
|
||||
```bash
|
||||
kubectl exec consul-0 -- consul members
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Caveats
|
||||
|
||||
- The scheduler may place two Consul pods on the same node
|
||||
|
||||
- if that node fails, we lose two Consul pods at the same time
|
||||
- this will cause the cluster to fail
|
||||
|
||||
- Scaling down the cluster will cause it to fail
|
||||
|
||||
- when a Consul member leaves the cluster, it needs to inform the others
|
||||
- otherwise, the last remaining node doesn't have quorum and stops functioning
|
||||
|
||||
- This Consul cluster doesn't use real persistence yet
|
||||
|
||||
- data is stored in the containers' ephemeral filesystem
|
||||
- if a pod fails, its replacement starts from a blank slate
|
||||
|
||||
---
|
||||
|
||||
## Improving pod placement
|
||||
|
||||
- We need to tell the scheduler:
|
||||
|
||||
*do not put two of these pods on the same node!*
|
||||
|
||||
- This is done with an `affinity` section like the following one:
|
||||
```yaml
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
- labelSelector:
|
||||
matchLabels:
|
||||
app: consul
|
||||
topologyKey: kubernetes.io/hostname
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Using a lifecycle hook
|
||||
|
||||
- When a Consul member leaves the cluster, it needs to execute:
|
||||
```bash
|
||||
consul leave
|
||||
```
|
||||
|
||||
- This is done with a `lifecycle` section like the following one:
|
||||
```yaml
|
||||
lifecycle:
|
||||
preStop:
|
||||
exec:
|
||||
command: [ "sh", "-c", "consul leave" ]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running a better Consul cluster
|
||||
|
||||
- Let's try to add the scheduling constraint and lifecycle hook
|
||||
|
||||
- We can do that in the same namespace or another one (as we like)
|
||||
|
||||
- If we do that in the same namespace, we will see a rolling update
|
||||
|
||||
(pods will be replaced one by one)
|
||||
|
||||
.exercise[
|
||||
|
||||
- Deploy a better Consul cluster:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-2.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Still no persistence, though
|
||||
|
||||
- We aren't using actual persistence yet
|
||||
|
||||
(no `volumeClaimTemplate`, Persistent Volume, etc.)
|
||||
|
||||
- What happens if we lose a pod?
|
||||
|
||||
- a new pod gets rescheduled (with an empty state)
|
||||
|
||||
- the new pod tries to connect to the two others
|
||||
|
||||
- it will be accepted (after 1-2 minutes of instability)
|
||||
|
||||
- and it will retrieve the data from the other pods
|
||||
|
||||
---
|
||||
|
||||
## Failure modes
|
||||
|
||||
- What happens if we lose two pods?
|
||||
|
||||
- manual repair will be required
|
||||
|
||||
- we will need to instruct the remaining one to act solo
|
||||
|
||||
- then rejoin new pods
|
||||
|
||||
- What happens if we lose three pods? (aka all of them)
|
||||
|
||||
- we lose all the data (ouch)
|
||||
|
||||
- If we run Consul without persistent storage, backups are a good idea!
|
||||
|
||||
---
|
||||
|
||||
# Persistent Volumes Claims
|
||||
|
||||
- Our Pods can use a special volume type: a *Persistent Volume Claim*
|
||||
|
||||
- A Persistent Volume Claim (PVC) is also a Kubernetes resource
|
||||
|
||||
(visible with `kubectl get persistentvolumeclaims` or `kubectl get pvc`)
|
||||
|
||||
- A PVC is not a volume; it is a *request for a volume*
|
||||
|
||||
- It should indicate at least:
|
||||
|
||||
- the size of the volume (e.g. "5 GiB")
|
||||
|
||||
- the access mode (e.g. "read-write by a single pod")
|
||||
|
||||
---
|
||||
|
||||
## What's in a PVC?
|
||||
|
||||
- A PVC contains at least:
|
||||
|
||||
- a list of *access modes* (ReadWriteOnce, ReadOnlyMany, ReadWriteMany)
|
||||
|
||||
- a size (interpreted as the minimal storage space needed)
|
||||
|
||||
- It can also contain optional elements:
|
||||
|
||||
- a selector (to restrict which actual volumes it can use)
|
||||
|
||||
- a *storage class* (used by dynamic provisioning, more on that later)
|
||||
|
||||
---
|
||||
|
||||
## What does a PVC look like?
|
||||
|
||||
Here is a manifest for a basic PVC:
|
||||
|
||||
```yaml
|
||||
kind: PersistentVolumeClaim
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: my-claim
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 1Gi
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Using a Persistent Volume Claim
|
||||
|
||||
Here is a Pod definition like the ones shown earlier, but using a PVC:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: pod-using-a-claim
|
||||
spec:
|
||||
containers:
|
||||
- image: ...
|
||||
name: container-using-a-claim
|
||||
volumeMounts:
|
||||
- mountPath: /my-vol
|
||||
name: my-volume
|
||||
volumes:
|
||||
- name: my-volume
|
||||
persistentVolumeClaim:
|
||||
claimName: my-claim
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Creating and using Persistent Volume Claims
|
||||
|
||||
- PVCs can be created manually and used explicitly
|
||||
|
||||
(as shown on the previous slides)
|
||||
|
||||
- They can also be created and used through Stateful Sets
|
||||
|
||||
(this will be shown later)
|
||||
|
||||
---
|
||||
|
||||
## Lifecycle of Persistent Volume Claims
|
||||
|
||||
- When a PVC is created, it starts existing in "Unbound" state
|
||||
|
||||
(without an associated volume)
|
||||
|
||||
- A Pod referencing an unbound PVC will not start
|
||||
|
||||
(the scheduler will wait until the PVC is bound to place it)
|
||||
|
||||
- A special controller continuously monitors PVCs to associate them with PVs
|
||||
|
||||
- If no PV is available, one must be created:
|
||||
|
||||
- manually (by operator intervention)
|
||||
|
||||
- using a *dynamic provisioner* (more on that later)
|
||||
|
||||
---
|
||||
|
||||
class: extra-details
|
||||
|
||||
## Which PV gets associated to a PVC?
|
||||
|
||||
- The PV must satisfy the PVC constraints
|
||||
|
||||
(access mode, size, optional selector, optional storage class)
|
||||
|
||||
- The PVs with the closest access mode are picked
|
||||
|
||||
- Then the PVs with the closest size
|
||||
|
||||
- It is possible to specify a `claimRef` when creating a PV
|
||||
|
||||
(this will associate it to the specified PVC, but only if the PV satisfies all the requirements of the PVC; otherwise another PV might end up being picked)
|
||||
|
||||
- For all the details about the PersistentVolumeClaimBinder, check [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/persistent-storage.md#matching-and-binding)
|
||||
|
||||
---
|
||||
|
||||
## Persistent Volume Claims and Stateful sets
|
||||
|
||||
- A Stateful set can define one (or more) `volumeClaimTemplate`
|
||||
|
||||
- Each `volumeClaimTemplate` will create one Persistent Volume Claim per pod
|
||||
|
||||
- Each pod will therefore have its own individual volume
|
||||
|
||||
- These volumes are numbered (like the pods)
|
||||
|
||||
- Example:
|
||||
|
||||
- a Stateful set is named `db`
|
||||
- it is scaled to replicas
|
||||
- it has a `volumeClaimTemplate` named `data`
|
||||
- then it will create pods `db-0`, `db-1`, `db-2`
|
||||
- these pods will have volumes named `data-db-0`, `data-db-1`, `data-db-2`
|
||||
|
||||
---
|
||||
|
||||
## Persistent Volume Claims are sticky
|
||||
|
||||
- When updating the stateful set (e.g. image upgrade), each pod keeps its volume
|
||||
|
||||
- When pods get rescheduled (e.g. node failure), they keep their volume
|
||||
|
||||
(this requires a storage system that is not node-local)
|
||||
|
||||
- These volumes are not automatically deleted
|
||||
|
||||
(when the stateful set is scaled down or deleted)
|
||||
|
||||
- If a stateful set is scaled back up later, the pods get their data back
|
||||
|
||||
---
|
||||
|
||||
## Dynamic provisioners
|
||||
|
||||
- A *dynamic provisioner* monitors unbound PVCs
|
||||
|
||||
- It can create volumes (and the corresponding PV) on the fly
|
||||
|
||||
- This requires the PVCs to have a *storage class*
|
||||
|
||||
(annotation `volume.beta.kubernetes.io/storage-provisioner`)
|
||||
|
||||
- A dynamic provisioner only acts on PVCs with the right storage class
|
||||
|
||||
(it ignores the other ones)
|
||||
|
||||
- Just like `LoadBalancer` services, dynamic provisioners are optional
|
||||
|
||||
(i.e. our cluster may or may not have one pre-installed)
|
||||
|
||||
---
|
||||
|
||||
## What's a Storage Class?
|
||||
|
||||
- A Storage Class is yet another Kubernetes API resource
|
||||
|
||||
(visible with e.g. `kubectl get storageclass` or `kubectl get sc`)
|
||||
|
||||
- It indicates which *provisioner* to use
|
||||
|
||||
(which controller will create the actual volume)
|
||||
|
||||
- And arbitrary parameters for that provisioner
|
||||
|
||||
(replication levels, type of disk ... anything relevant!)
|
||||
|
||||
- Storage Classes are required if we want to use [dynamic provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/)
|
||||
|
||||
(but we can also create volumes manually, and ignore Storage Classes)
|
||||
|
||||
---
|
||||
|
||||
## The default storage class
|
||||
|
||||
- At most one storage class can be marked as the default class
|
||||
|
||||
(by annotating it with `storageclass.kubernetes.io/is-default-class=true`)
|
||||
|
||||
- When a PVC is created, it will be annotated with the default storage class
|
||||
|
||||
(unless it specifies an explicit storage class)
|
||||
|
||||
- This only happens at PVC creation
|
||||
|
||||
(existing PVCs are not updated when we mark a class as the default one)
|
||||
|
||||
---
|
||||
|
||||
## Dynamic provisioning setup
|
||||
|
||||
This is how we can achieve fully automated provisioning of persistent storage.
|
||||
|
||||
1. Configure a storage system.
|
||||
|
||||
(It needs to have an API, or be capable of automated provisioning of volumes.)
|
||||
|
||||
2. Install a dynamic provisioner for this storage system.
|
||||
|
||||
(This is some specific controller code.)
|
||||
|
||||
3. Create a Storage Class for this system.
|
||||
|
||||
(It has to match what the dynamic provisioner is expecting.)
|
||||
|
||||
4. Annotate the Storage Class to be the default one.
|
||||
|
||||
---
|
||||
|
||||
## Dynamic provisioning usage
|
||||
|
||||
After setting up the system (previous slide), all we need to do is:
|
||||
|
||||
*Create a Stateful Set that makes use of a `volumeClaimTemplate`.*
|
||||
|
||||
This will trigger the following actions.
|
||||
|
||||
1. The Stateful Set creates PVCs according to the `volumeClaimTemplate`.
|
||||
|
||||
2. The Stateful Set creates Pods using these PVCs.
|
||||
|
||||
3. The PVCs are automatically annotated with our Storage Class.
|
||||
|
||||
4. The dynamic provisioner provisions volumes and creates the corresponding PVs.
|
||||
|
||||
5. The PersistentVolumeClaimBinder associates the PVs and the PVCs together.
|
||||
|
||||
6. PVCs are now bound, the Pods can start.
|
||||
- manually (human operator creates the volumes ahead of time, or when needed)
|
||||
|
||||
???
|
||||
|
||||
:EN:- Deploying apps with Stateful Sets
|
||||
:EN:- Example: deploying a Consul cluster
|
||||
:EN:- Understanding Persistent Volume Claims and Storage Classes
|
||||
:FR:- Déployer une application avec un *Stateful Set*
|
||||
:FR:- Example : lancer un cluster Consul
|
||||
:FR:- Comprendre les *Persistent Volume Claims* et *Storage Classes*
|
||||
|
||||
|
||||
314
slides/k8s/volume-claim-templates.md
Normal file
314
slides/k8s/volume-claim-templates.md
Normal file
@@ -0,0 +1,314 @@
|
||||
## Putting it all together
|
||||
|
||||
- We want to run that Consul cluster *and* actually persist data
|
||||
|
||||
- We'll use a StatefulSet that will leverage PV and PVC
|
||||
|
||||
- If we have a dynamic provisioner:
|
||||
|
||||
*the cluster will come up right away*
|
||||
|
||||
- If we don't have a dynamic provisioner:
|
||||
|
||||
*we will need to create Persistent Volumes manually*
|
||||
|
||||
---
|
||||
|
||||
## Persistent Volume Claims and Stateful sets
|
||||
|
||||
- A Stateful set can define one (or more) `volumeClaimTemplate`
|
||||
|
||||
- Each `volumeClaimTemplate` will create one Persistent Volume Claim per Pod
|
||||
|
||||
- Each Pod will therefore have its own individual volume
|
||||
|
||||
- These volumes are numbered (like the Pods)
|
||||
|
||||
- Example:
|
||||
|
||||
- a Stateful set is named `consul`
|
||||
- it is scaled to replicas
|
||||
- it has a `volumeClaimTemplate` named `data`
|
||||
- then it will create pods `consul-0`, `consul-1`, `consul-2`
|
||||
- these pods will have volumes named `data`, referencing PersistentVolumeClaims
|
||||
named `data-consul-0`, `data-consul-1`, `data-consul-2`
|
||||
|
||||
---
|
||||
|
||||
## Persistent Volume Claims are sticky
|
||||
|
||||
- When updating the stateful set (e.g. image upgrade), each pod keeps its volume
|
||||
|
||||
- When pods get rescheduled (e.g. node failure), they keep their volume
|
||||
|
||||
(this requires a storage system that is not node-local)
|
||||
|
||||
- These volumes are not automatically deleted
|
||||
|
||||
(when the stateful set is scaled down or deleted)
|
||||
|
||||
- If a stateful set is scaled back up later, the pods get their data back
|
||||
|
||||
---
|
||||
|
||||
## Deploying Consul
|
||||
|
||||
- Let's use a new manifest for our Consul cluster
|
||||
|
||||
- The only differences between that file and the previous one are:
|
||||
|
||||
- `volumeClaimTemplate` defined in the Stateful Set spec
|
||||
|
||||
- the corresponding `volumeMounts` in the Pod spec
|
||||
|
||||
.exercise[
|
||||
|
||||
- Apply the persistent Consul YAML file:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-3.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## No dynamic provisioner
|
||||
|
||||
- If we don't have a dynamic provisioner, we need to create the PVs
|
||||
|
||||
- We are going to use local volumes
|
||||
|
||||
(similar conceptually to `hostPath` volumes)
|
||||
|
||||
- We can use local volumes without installing extra plugins
|
||||
|
||||
- However, they are tied to a node
|
||||
|
||||
- If that node goes down, the volume becomes unavailable
|
||||
|
||||
---
|
||||
|
||||
## Observing the situation
|
||||
|
||||
- Let's look at Persistent Volume Claims and Pods
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that we now have an unbound Persistent Volume Claim:
|
||||
```bash
|
||||
kubectl get pvc
|
||||
```
|
||||
|
||||
- We don't have any Persistent Volume:
|
||||
```bash
|
||||
kubectl get pv
|
||||
```
|
||||
|
||||
- The Pod `consul-0` is not scheduled yet:
|
||||
```bash
|
||||
kubectl get pods -o wide
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
*Hint: leave these commands running with `-w` in different windows.*
|
||||
|
||||
---
|
||||
|
||||
## Explanations
|
||||
|
||||
- In a Stateful Set, the Pods are started one by one
|
||||
|
||||
- `consul-1` won't be created until `consul-0` is running
|
||||
|
||||
- `consul-0` has a dependency on an unbound Persistent Volume Claim
|
||||
|
||||
- The scheduler won't schedule the Pod until the PVC is bound
|
||||
|
||||
(because the PVC might be bound to a volume that is only available on a subset of nodes; for instance EBS are tied to an availability zone)
|
||||
|
||||
---
|
||||
|
||||
## Creating Persistent Volumes
|
||||
|
||||
- Let's create 3 local directories (`/mnt/consul`) on node2, node3, node4
|
||||
|
||||
- Then create 3 Persistent Volumes corresponding to these directories
|
||||
|
||||
.exercise[
|
||||
|
||||
- Create the local directories:
|
||||
```bash
|
||||
for NODE in node2 node3 node4; do
|
||||
ssh $NODE sudo mkdir -p /mnt/consul
|
||||
done
|
||||
```
|
||||
|
||||
- Create the PV objects:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/volumes-for-consul.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Check our Consul cluster
|
||||
|
||||
- The PVs that we created will be automatically matched with the PVCs
|
||||
|
||||
- Once a PVC is bound, its pod can start normally
|
||||
|
||||
- Once the pod `consul-0` has started, `consul-1` can be created, etc.
|
||||
|
||||
- Eventually, our Consul cluster is up, and backend by "persistent" volumes
|
||||
|
||||
.exercise[
|
||||
|
||||
- Check that our Consul clusters has 3 members indeed:
|
||||
```bash
|
||||
kubectl exec consul-0 -- consul members
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Devil is in the details (1/2)
|
||||
|
||||
- The size of the Persistent Volumes is bogus
|
||||
|
||||
(it is used when matching PVs and PVCs together, but there is no actual quota or limit)
|
||||
|
||||
- The Pod might end up using more than the requested size
|
||||
|
||||
- The PV may or may not have the capacity that it's advertising
|
||||
|
||||
- It works well with dynamically provisioned block volumes
|
||||
|
||||
- ...Less so in other scenarios!
|
||||
|
||||
---
|
||||
|
||||
## Devil is in the details (2/2)
|
||||
|
||||
- This specific example worked because we had exactly 1 free PV per node:
|
||||
|
||||
- if we had created multiple PVs per node ...
|
||||
|
||||
- we could have ended with two PVCs bound to PVs on the same node ...
|
||||
|
||||
- which would have required two pods to be on the same node ...
|
||||
|
||||
- which is forbidden by the anti-affinity constraints in the StatefulSet
|
||||
|
||||
- To avoid that, we need to associated the PVs with a Storage Class that has:
|
||||
```yaml
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
```
|
||||
(this means that a PVC will be bound to a PV only after being used by a Pod)
|
||||
|
||||
- See [this blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) for more details
|
||||
|
||||
---
|
||||
|
||||
## If we have a dynamic provisioner
|
||||
|
||||
These are the steps when dynamic provisioning happens:
|
||||
|
||||
1. The Stateful Set creates PVCs according to the `volumeClaimTemplate`.
|
||||
|
||||
2. The Stateful Set creates Pods using these PVCs.
|
||||
|
||||
3. The PVCs are automatically annotated with our Storage Class.
|
||||
|
||||
4. The dynamic provisioner provisions volumes and creates the corresponding PVs.
|
||||
|
||||
5. The PersistentVolumeClaimBinder associates the PVs and the PVCs together.
|
||||
|
||||
6. PVCs are now bound, the Pods can start.
|
||||
|
||||
---
|
||||
|
||||
## Validating persistence (1)
|
||||
|
||||
- When the StatefulSet is deleted, the PVC and PV still exist
|
||||
|
||||
- And if we recreate an identical StatefulSet, the PVC and PV are reused
|
||||
|
||||
- Let's see that!
|
||||
|
||||
.exercise[
|
||||
|
||||
- Put some data in Consul:
|
||||
```bash
|
||||
kubectl exec consul-0 -- consul kv put answer 42
|
||||
```
|
||||
|
||||
- Delete the Consul cluster:
|
||||
```bash
|
||||
kubectl delete -f ~/container.training/k8s/consul-3.yaml
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Validating persistence (2)
|
||||
|
||||
.exercise[
|
||||
|
||||
- Wait until the last Pod is deleted:
|
||||
```bash
|
||||
kubectl wait pod consul-0 --for=delete
|
||||
```
|
||||
|
||||
- Check that PV and PVC are still here:
|
||||
```bash
|
||||
kubectl get pv,pvc
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Validating persistence (3)
|
||||
|
||||
.exercise[
|
||||
|
||||
- Re-create the cluster:
|
||||
```bash
|
||||
kubectl apply -f ~/container.training/k8s/consul-3.yaml
|
||||
```
|
||||
|
||||
- Wait until it's up
|
||||
|
||||
- Then access the key that we set earlier:
|
||||
```bash
|
||||
kubectl exec consul-0 -- consul kv get answer
|
||||
```
|
||||
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
## Cleaning up
|
||||
|
||||
- PV and PVC don't get deleted automatically
|
||||
|
||||
- This is great (less risk of accidental data loss)
|
||||
|
||||
- This is not great (storage usage increases)
|
||||
|
||||
- Managing PVC lifecycle:
|
||||
|
||||
- remove them manually
|
||||
|
||||
- add their StatefulSet to their `ownerReferences`
|
||||
|
||||
- delete the Namespace that they belong to
|
||||
|
||||
???
|
||||
|
||||
:EN:- Defining volumeClaimTemplates
|
||||
:FR:- Définir des volumeClaimTemplates
|
||||
@@ -84,5 +84,9 @@ content:
|
||||
- k8s/configuration.md
|
||||
- k8s/secrets.md
|
||||
- k8s/statefulsets.md
|
||||
- k8s/local-persistent-volumes.md
|
||||
- k8s/portworx.md
|
||||
- k8s/consul.md
|
||||
- k8s/pv-pvc-sc.md
|
||||
- k8s/volume-claim-templates.md
|
||||
#- k8s/portworx.md
|
||||
- k8s/openebs.md
|
||||
- k8s/stateful-failover.md
|
||||
|
||||
@@ -110,8 +110,12 @@ content:
|
||||
#- k8s/prometheus.md
|
||||
#- k8s/prometheus-stack.md
|
||||
#- k8s/statefulsets.md
|
||||
#- k8s/local-persistent-volumes.md
|
||||
#- k8s/consul.md
|
||||
#- k8s/pv-pvc-sc.md
|
||||
#- k8s/volume-claim-templates.md
|
||||
#- k8s/portworx.md
|
||||
#- k8s/openebs.md
|
||||
#- k8s/stateful-failover.md
|
||||
#- k8s/extending-api.md
|
||||
#- k8s/crd.md
|
||||
#- k8s/admission.md
|
||||
|
||||
@@ -112,9 +112,12 @@ content:
|
||||
- k8s/configuration.md
|
||||
- k8s/secrets.md
|
||||
- k8s/statefulsets.md
|
||||
- k8s/local-persistent-volumes.md
|
||||
- k8s/consul.md
|
||||
- k8s/pv-pvc-sc.md
|
||||
- k8s/volume-claim-templates.md
|
||||
- k8s/portworx.md
|
||||
- k8s/openebs.md
|
||||
- k8s/stateful-failover.md
|
||||
-
|
||||
- k8s/logs-centralized.md
|
||||
- k8s/prometheus.md
|
||||
|
||||
@@ -110,9 +110,12 @@ content:
|
||||
#- k8s/prometheus-stack.md
|
||||
-
|
||||
- k8s/statefulsets.md
|
||||
- k8s/local-persistent-volumes.md
|
||||
- k8s/portworx.md
|
||||
#- k8s/openebs.md
|
||||
- k8s/consul.md
|
||||
- k8s/pv-pvc-sc.md
|
||||
- k8s/volume-claim-templates.md
|
||||
#- k8s/portworx.md
|
||||
- k8s/openebs.md
|
||||
- k8s/stateful-failover.md
|
||||
#- k8s/extending-api.md
|
||||
#- k8s/admission.md
|
||||
#- k8s/operators.md
|
||||
|
||||
Reference in New Issue
Block a user