diff --git a/content/runbooks/alertmanager/AlertmanagerClusterDown.md b/content/runbooks/alertmanager/AlertmanagerClusterDown.md new file mode 100644 index 0000000..6048601 --- /dev/null +++ b/content/runbooks/alertmanager/AlertmanagerClusterDown.md @@ -0,0 +1,28 @@ +--- +title: Alertmanager Cluster Down +weight: 20 +--- + +# AlertmanagerClusterDown + +## Meaning + +Half or more of the Alertmanager instances within the same cluster are down. + +## Impact + +You have an unstable cluster, if everything goes wrong you will lose the whole cluster. + +## Diagnosis + +Verify why pods are not running. +You can get a big picture with `events`. + +```bash +$ kubectl get events --field-selector involvedObject.kind=Pod | grep alertmanager +``` + +## Mitigation + +There are no cheap options to mitigate this risk. +Verifying any new changes in preprod before production environment should improve stability.