merge

2026-05-21 14:22:46 +00:00 · 2021-11-26 10:25:54 +01:00
parent d175bbb21e b0619006b8
commit 21aadc19c0
21 changed files with 878 additions and 1 deletions
--- a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md
+++ b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md
@@ -0,0 +1,22 @@
+---
+title: Alertmanager Cluster Failed To Send Alerts
+weight: 20
+---
+
+# AlertmanagerClusterFailedToSendAlerts
+
+## Meaning
+
+All instances failed to send notification to an integration. 
+
+## Impact
+
+You will not receive a notification when an alert is raised.
+
+## Diagnosis
+
+No alerts are received at the integration level from the cluster. 
+
+## Mitigation
+
+Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall...)
--- a/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md
+++ b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md
@@ -0,0 +1,24 @@
+---
+title: Alertmanager ConfigInconsistent
+weight: 20
+---
+
+# AlertmanagerConfigInconsistent
+
+## Meaning
+
+The configuration between instances inside a cluster is inconsistent.
+
+## Impact
+
+Configuration inconsistency can be multiple and impact is hard to predict. 
+Nevertheless, in most cases the alert might be lost or routed to the incorrect integration. 
+
+## Diagnosis
+
+Run a `diff` tool between all `alertmanager.yml` that are deployed to find what is wrong.
+You could run a job within your CI to avoid this issue in the future.
+
+## Mitigation
+
+Delete the incorrect secret and deploy the correct one.
--- a/content/runbooks/alertmanager/AlertmanagerFailedReload.md
+++ b/content/runbooks/alertmanager/AlertmanagerFailedReload.md
@@ -7,7 +7,9 @@ weight: 20

 ## Meaning

-The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance for the cluster monitoring stack has consistently failed to reload its configuration for a certain period.
+The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance
+for the cluster monitoring stack has consistently failed to reload its
+configuration for a certain period.

 ## Impact

--- a/content/runbooks/etcd/_index.md
+++ b/content/runbooks/etcd/_index.md
@@ -0,0 +1,7 @@
+---
+title: etcd
+bookCollapseSection: true
+bookFlatSection: true
+weight: 10
+---
+
--- a/content/runbooks/etcd/etcdBackendQuotaLowSpace.md
+++ b/content/runbooks/etcd/etcdBackendQuotaLowSpace.md
@@ -0,0 +1,81 @@
+# etcdBackendQuotaLowSpace
+
+## Meaning
+
+This alert fires when the total existing DB size exceeds 95% of the maximum
+DB quota. The consumed space is in Prometheus represented by the metric
+`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by
+`etcd_server_quota_backend_bytes`.
+
+## Impact
+
+In case the DB size exceeds the DB quota, no writes can be performed anymore on
+the etcd cluster. This further prevents any updates in the cluster, such as the
+creation of pods.
+
+## Diagnosis
+
+The following two approaches can be used for the diagnosis.
+
+### CLI Checks
+
+To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any
+etcd pod.
+
+```console
+$ NAMESPACE="kube-etcd"
+$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }')
+```
+
+Validate that the `etcdctl` command is available:
+
+```console
+$ etcdctl version
+```
+
+`etcdctl` can be used to fetch the DB size of the etcd endpoints.
+
+```console
+$ etcdctl endpoint status -w table
+```
+
+### PromQL queries
+
+Check the percentage consumption of etcd DB with the following query in the
+metrics console:
+
+```console
+(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100
+```
+
+Check the DB size in MB that can be reduced after defragmentation:
+
+```console
+(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
+```
+
+## Mitigation
+
+### Capacity planning
+
+If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to
+the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's
+start planning for new cluster.
+
+In the meantime before migration happens, you can use defrag to gain some time.
+
+### Defrag
+
+When the etcd DB size increases, we can defragment existing etcd DB to optimize
+DB consumption as described in [here][etcdDefragmentation]. Run the following
+command in all etcd pods.
+
+```console
+$ etcdctl defrag
+```
+
+As validation, check the endpoint status of etcd members to know the reduced
+size of etcd DB. Use for this purpose the same diagnostic approaches as listed
+above. More space should be available now.
+
+[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/
--- a/content/runbooks/etcd/etcdGRPCRequestsSlow.md
+++ b/content/runbooks/etcd/etcdGRPCRequestsSlow.md
@@ -0,0 +1,96 @@
+# etcdGRPCRequestsSlow
+
+## Meaning
+
+This alert fires when the 99th percentile of etcd gRPC requests are too slow.
+
+## Impact
+
+When requests are too slow, they can lead to various scenarios like leader
+election failure, slow reads and writes.
+
+## Diagnosis
+
+This could be result of slow disk (due to fragmented state) or CPU contention.
+
+### Slow disk
+
+One of the most common reasons for slow gRPC requests is disk. Checking disk
+related metrics and dashboards should provide a more clear picture.
+
+#### PromQL queries used to troubleshoot
+
+Verify the value of how slow the etcd gRPC requests are by using the following
+query in the metrics console:
+
+```console
+histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type))
+```
+That result should give a rough timeline of when the issue started.
+
+`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
+duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
+rule out a slow disk and confirm that the disk is reasonably fast, 99th
+percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
+than 10ms. Query in metrics UI:
+
+```console
+histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
+```
+#### Console dashboards
+
+In the OpenShift dashboard console under Observe section, select the etcd
+dashboard. There are both RPC rate as well as Disk Sync Duration dashboards
+which will assist with further issues.
+
+### Resource exhaustion
+
+It can happen that etcd responds slower due to CPU resource exhaustion.
+This was seen in some cases when one application was requesting too much CPU
+which led to this alert firing for multiple methods.
+
+Often if this is the case, we also see
+`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well.
+
+To confirm this is the cause of the slow requests either:
+
+1. In OpenShift console on primary page under "Cluster utilization" view the
+   requested CPU vs available.
+
+2. PromQL query is the following to see top consumers of CPU:
+
+```console
+      topk(25, sort_desc(
+        sum by (namespace) (
+          (
+            sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod)
+            *
+            on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
+          )
+          *
+          on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
+        )
+      ))
+```
+
+## Mitigation
+
+### Fragmented state
+
+In the case of slow fisk or when the etcd DB size increases, we can defragment
+existing etcd DB to optimize DB consumption as described in
+[here][etcdDefragmentation]. Run the following command in all etcd pods.
+
+```console
+$ etcdctl defrag
+```
+
+As validation, check the endpoint status of etcd members to know the reduced
+size of etcd DB. Use for this purpose the same diagnostic approaches as listed
+above. More space should be available now.
+
+Further info on etcd best practices can be found in the [OpenShift docs
+here][etcdPractices].
+
+[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
+[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
--- a/content/runbooks/etcd/etcdHighFsyncDurations.md
+++ b/content/runbooks/etcd/etcdHighFsyncDurations.md
@@ -0,0 +1,55 @@
+# etcdHighFsyncDurations
+
+## Meaning
+
+This alert fires when the 99th percentile of etcd disk fsync duration is too
+high for 10 minutes.
+
+## Impact
+
+When this happens it can lead to various scenarios like leader election failure,
+frequent leader elections, slow reads and writes.
+
+## Diagnosis
+
+This could be result of slow disk possibly due to fragmented state in etcd or
+simply due to slow disk.
+
+### Slow disk
+
+Checking disk related metrics and dashboards should provide a more clear
+picture.
+
+#### PromQL queries used to troubleshoot
+
+`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
+duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
+rule out a slow disk and confirm that the disk is reasonably fast, 99th
+percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
+than 10ms. Query in metrics UI:
+
+```console
+histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
+```
+
+## Mitigation
+
+### Fragmented state
+
+In the case of slow fisk or when the etcd DB size increases, we can defragment
+existing etcd DB to optimize DB consumption as described in
+[here][etcdDefragmentation]. Run the following command in all etcd pods.
+
+```console
+$ etcdctl defrag
+```
+
+As validation, check the endpoint status of etcd members to know the reduced
+size of etcd DB. Use for this purpose the same diagnostic approaches as listed
+above. More space should be available now.
+
+Further info on etcd best practices can be found in the [OpenShift docs
+here][etcdPractices].
+
+[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
+[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
--- a/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
+++ b/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
@@ -0,0 +1,41 @@
+# etcdHighNumberOfFailedGRPCRequests
+
+## Meaning
+
+This alert fires when at least 50% of etcd gRPC requests failed in the past 10
+minutes.
+
+## Impact
+
+First establish which gRPC method is failing, this will be visible in the alert.
+If it's not part of the alert, the following query will display method and etcd
+instance that has failing requests:
+
+```sh
+100 * sum without(grpc_type, grpc_code)
+(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m]))
+/ sum without(grpc_type, grpc_code)
+(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on()
+(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1))
+```
+
+## Diagnosis
+
+All the gRPC errors should also be logged in each respective etcd instance logs.
+You can get the instance name from the alert that is firing or by running the
+query detailed above. Those etcd instance logs should serve as further insight
+into what is wrong.
+
+To get logs of etcd containers either check the instance from the alert and
+check logs directly or run the following:
+
+```sh
+NAMESPACE="kube-etcd"
+kubectl logs -n $NAMESPACE -lapp=etcd etcd
+```
+
+## Mitigation
+
+Depending on the above diagnosis, the issue will most likely be described in the
+error log line of either etcd or openshift-etcd-operator. Most likely causes
+tend to be networking issues.
--- a/content/runbooks/etcd/etcdInsufficientMembers.md
+++ b/content/runbooks/etcd/etcdInsufficientMembers.md
@@ -0,0 +1,65 @@
+# etcdInsufficientMembers
+
+## Meaning
+
+This alert fires when there are fewer instances available than are needed by
+etcd to be healthy.
+
+## Impact
+
+When etcd does not have a majority of instances available the Kubernetes and
+OpenShift APIs will reject read and write requests and operations that preserve
+the health of workloads cannot be performed.
+
+## Diagnosis
+
+This can kubectlcur multiple control plane nodes are powered off or are unable to
+connect each other via the network. Check that all control plane nodes are
+powered and that network connections between each machine are functional.
+
+Check any other critical, warning or info alerts firing that can assist with the
+diagnosis.
+
+Login to the cluster. Check health of master nodes if any of them is in
+`NotReady` state or not.
+
+```console
+$ kubectl get nodes -l node-role.kubernetes.io/master=
+```
+
+### General etcd health
+
+To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
+etcd pod.
+
+```console
+$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
+```
+
+Validate that the `etcdctl` command is available:
+
+```console
+$ etcdctl version
+```
+
+Run the following command to get the health of etcd:
+
+```console
+$ etcdctl endpoint health -w table
+```
+## Mitigation
+
+### Disaster and recovery
+
+If an upgrade is in progress, the alert may automatically resolve in some time
+when the master node comes up again. If MCO is not working on the master node,
+check the cloud provider to verify if the master node instances are running or not.
+
+In the case when you are running on AWS, the AWS instance retirement might need
+a manual reboot of the master node.
+
+As a last resort if none of the above fix the issue and the alert is still
+firing, for etcd specific issues follow the steps described in the [disaster and
+recovery dkubectls](dkubectls).
+
+[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).
--- a/content/runbooks/etcd/etcdMembersDown.md
+++ b/content/runbooks/etcd/etcdMembersDown.md
@@ -0,0 +1,68 @@
+# etcdMembersDown
+
+## Meaning
+
+This alert fires when one or more etcd member goes down and evaluates the
+number of etcd members that are currently down. Often, this alert was observed
+as part of a cluster upgrade when a master node is being upgraded and requires a
+reboot.
+
+## Impact
+
+In etcd a majority of (n/2)+1 has to agree on membership changes or key-value
+upgrade proposals. With this approach, a split-brain inconsistency can be
+avoided. In the case that only one member is down in a 3-member cluster, it
+still can make forward progress. Due to the fact that the quorum is 2 and 2
+members are still alive. However, when more members are down, the cluster
+becomes unrecoverable.
+
+## Diagnosis
+
+Login to the cluster. Check health of master nodes if any of them is in
+`NotReady` state or not.
+
+```console
+$ kubectl get nodes -l node-role.kubernetes.io/master=
+```
+
+In case there is no upgrade going on, but there is a change in the
+`machineconfig` for the master pool causing a rolling reboot of each master
+node, this alert can be triggered as well. We can check if the
+`machineconfiguration.openshift.io/state : Working` annotation is set for any of
+the master nodes. This is the case when the [machine-config-operator
+(MCO)](https://github.com/openshift/machine-config-operator) is working on it.
+
+```console
+$ kubectl get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
+```
+
+### General etcd health
+
+To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
+etcd pod.
+
+```console
+$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
+```
+
+Validate that the `etcdctl` command is available:
+
+```console
+$ etcdctl version
+```
+
+Run the following command to get the health of etcd:
+
+```console
+$ etcdctl endpoint health -w table
+```
+
+## Mitigation
+
+If an upgrade is in progress, the alert may automatically resolve in some time
+when the master node comes up again. If MCO is not working on the master node,
+check the cloud provider to verify if the master node instances are running or not.
+
+In the case when you are running on AWS, the AWS instance retirement might need
+a manual reboot of the master node.
+
--- a/content/runbooks/etcd/etcdNoLeader.md
+++ b/content/runbooks/etcd/etcdNoLeader.md
@@ -0,0 +1,42 @@
+# etcdNoLeader
+
+## Meaning
+
+This alert is triggered when etcd cluster does not have a leader for more than 1
+minute.
+
+## Impact
+
+When there is no leader, Kubernetes API will not be able to work
+as expected and cluster cannot process any writes or reads, and any write
+requests are queued for processing until a new leader is elected. Operations
+that preserve the health of the workloads cannot be performed.
+
+## Diagnosis
+
+### Control plane nodes issue
+
+This can occur multiple control plane nodes are powered off or are unable to
+connect each other via the network. Check that all control plane nodes are
+powered and that network connections between each machine are functional.
+
+### Slow disk issue
+
+Another potential cause could be slow disk, inspect the `Disk Sync
+Duration`dashboard, as well as the `Total Leader Elections Per Day` to get more
+insight and help with diagnosis.
+
+### Other
+
+Check the logs of etcd containers to see any further information and to verify
+that etcd does not have leader. Logs should contain something like `etcdserver:
+no leader`. 
+
+## Mitigation
+
+### Disaster and recovery
+
+Follow the steps described in the [disaster and recovery docs](docs).
+
+
+[docs]:(https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).
--- a/content/runbooks/kubernetes/KubeAPIDown.md
+++ b/content/runbooks/kubernetes/KubeAPIDown.md
@@ -0,0 +1,35 @@
+# KubeAPIDown
+
+## Meaning
+
+The `KubeAPIDown` alert is triggered when all Kubernetes API servers have not
+been reachable by the monitoring system for more than 15 minutes.
+
+## Impact
+
+This is a critical alert. The Kubernetes API is not responding. The
+cluster may partially or fully non-functional.
+
+## Diagnosis
+
+Check the status of the API server targets in the Prometheus UI.
+
+Then, confirm whether the API is also unresponsive for you:
+
+```console
+$ kubectl cluster-info
+```
+
+If you can still reach the API server, there may be a network issue between the
+Prometheus instances and the API server pods. Check the status of the API server
+pods.
+
+```console
+$ kubectl -n kube-system get pods
+$ kubectl -n kube-system logs -l 'app=kube-apiserver'
+```
+## Mitigation
+
+If you can still reach the API server intermittently, you may be able treat this
+like any other failing deployment. If not, it's possible you may have to refer
+to the disaster recovery documentation.
--- a/content/runbooks/kubernetes/KubeNodeNotReady.md
+++ b/content/runbooks/kubernetes/KubeNodeNotReady.md
@@ -0,0 +1,39 @@
+# KubeNodeNotReady
+
+## Meaning
+
+KubeNodeNotReady alert is fired when a Kubernetes node is not in `Ready`
+state for a certain period. In this case, the node is not able to host any new
+pods as described [here][KubeNode].
+
+## Impact
+
+The performance of the cluster deployments is affected, depending on the overall
+workload and the type of the node.
+
+## Diagnosis
+
+The notification details should list the node that's not ready. For Example:
+
+```txt
+ - alertname = KubeNodeNotReady
+...
+ - node = node1.example.com
+...
+```
+
+Login to the cluster. Check the status of that node:
+
+```console
+$ kubectl get node $NODE -o yaml
+```
+
+The output should describe why the node isn't ready (e.g.: timeouts reaching the
+API or kubelet).
+
+## Mitigation
+
+Once, the problem was resolved that prevented node from being replaced,
+the instance should be terminated.
+
+[KubeNode]: https://kubernetes.io/docs/concepts/architecture/nodes/#condition
--- a/content/runbooks/kubernetes/KubeletDown.md
+++ b/content/runbooks/kubernetes/KubeletDown.md
@@ -0,0 +1,38 @@
+# KubeletDown
+
+## Meaning
+
+This alert is triggered when the monitoring system has not been able to reach
+any of the cluster's Kubelets for more than 15 minutes.
+
+## Impact
+
+This alert represents a critical threat to the cluster's stability. Excluding
+the possibility of a network issue preventing the monitoring system from
+scraping Kubelet metrics, multiple nodes in the cluster are likely unable to
+respond to configuration changes for pods and other resources, and some
+debugging tools are likely not functional, e.g. `kubectl exec` and `kubectl logs`.
+
+## Diagnosis
+
+Check the status of nodes and for recent events on `Node` objects, or for recent
+events in general:
+
+```console
+$ kubectl get nodes
+$ kubectl describe node $NODE_NAME
+$ kubectl get events --field-selector 'involvedObject.kind=Node'
+$ kubectl get events
+```
+
+If you have SSH access to the nodes, access the logs for the Kubelet directly:
+
+```console
+$ journalctl -b -f -u kubelet.service
+```
+
+## Mitigation
+
+The mitigation depends on what is causing the Kubelets to become
+unresponsive. Check for wide-spread networking issues, or node level
+configuration issues.
--- a/content/runbooks/node/NodeFileDescriptorLimit.md
+++ b/content/runbooks/node/NodeFileDescriptorLimit.md
@@ -0,0 +1,33 @@
+# NodeFileDescriptorLimit
+
+## Meaning
+
+This alert is triggered when a node's kernel is found to be running out of
+available file descriptors -- a `warning` level alert at greater than 70% usage
+and a `critical` level alert at greater than 90% usage.
+
+## Impact
+
+Applications on the node may no longer be able to open and operate on
+files. This is likely to have severe consequences for anything scheduled on this
+node.
+
+## Diagnosis
+
+You can open a shell on the node and use the standard Linux utilities to
+diagnose the issue:
+
+```console
+$ NODE_NAME='<value of instance label from alert>'
+
+$ oc debug "node/$NODE_NAME"
+# sysctl -a | grep 'fs.file-'
+fs.file-max = 1597016
+fs.file-nr = 7104       0       1597016
+# lsof -n
+```
+
+## Mitigation
+
+Reduce the number of files opened simultaneously by either adjusting application
+configuration or by moving some applications to other nodes.
--- a/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md
+++ b/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md
@@ -0,0 +1,27 @@
+# NodeFilesystemAlmostOutOfFiles
+
+## Meaning
+
+This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
+than being based on a prediction that a filesystem will run out of inodes in a
+certain amount of time, it uses simple static thresholds. The alert will fire as
+at a `warning` level at 5% of available inodes left, and at a `critical` level
+with 3% of available inodes left.
+
+## Impact
+
+A node's filesystem becoming full can have a far reaching impact, as it may
+cause any or all of the applications scheduled to that node to experience
+anything from performance degradation to full inoperability. Depending on the
+node and filesystem involved, this could pose a critical threat to the stability
+of the cluster.
+
+## Diagnosis
+
+Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
+
+## Mitigation
+
+Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
+
+[1]: ./NodeFilesystemFilesFillingUp.md
--- a/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md
+++ b/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md
@@ -0,0 +1,26 @@
+# NodeFilesystemAlmostOutOfSpace
+
+## Meaning
+
+This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
+than being based on a prediction that a filesystem will become full in a certain
+amount of time, it uses simple static thresholds. The alert will fire as at a
+`warning` level at 5% space left, and at a `critical` level with 3% space left.
+
+## Impact
+
+A node's filesystem becoming full can have a far reaching impact, as it may
+cause any or all of the applications scheduled to that node to experience
+anything from performance degradation to full inoperability. Depending on the
+node and filesystem involved, this could pose a critical threat to the stability
+of the cluster.
+
+## Diagnosis
+
+Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
+
+## Mitigation
+
+Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
+
+[1]: ./NodeFilesystemSpaceFillingUp.md
--- a/content/runbooks/node/NodeFilesystemFilesFillingUp.md
+++ b/content/runbooks/node/NodeFilesystemFilesFillingUp.md
@@ -0,0 +1,53 @@
+# NodeFilesystemFilesFillingUp
+
+## Meaning
+
+This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but
+predicts the filesystem will run out of inodes rather than bytes of storage
+space. The alert fires at a `critical` level when the filesystem is predicted to
+run out of available inodes within four hours.
+
+## Impact
+
+A node's filesystem becoming full can have a far reaching impact, as it may
+cause any or all of the applications scheduled to that node to experience
+anything from performance degradation to full inoperability. Depending on the
+node and filesystem involved, this could pose a critical threat to the stability
+of the cluster.
+
+## Diagnosis
+
+Note the `instance` and `mountpoint` labels from the alert. You can graph the
+usage history of this filesystem with the following query in the OpenShift web
+console:
+
+```text
+node_filesystem_files_free{
+  instance="<value of instance label from alert>",
+  mountpoint="<value of mountpoint label from alert>"
+}
+```
+
+You can also open a debug session on the node and use the standard Linux
+utilities to locate the source of the usage:
+
+```console
+$ MOUNT_POINT='<value of mountpoint label from alert>'
+$ NODE_NAME='<value of instance label from alert>'
+
+$ oc debug "node/$NODE_NAME"
+$ df -hi "/host/$MOUNT_POINT"
+```
+
+Note that in many cases a filesystem running out of inodes will still have
+available storage. Running out of inodes is often caused by many many small
+files being created by an application.
+
+## Mitigation
+
+The number of inodes allocated to a filesystem is usually based on the storage
+size. You may be able to solve the problem, or buy time, by increasing size of
+the storage volume. Otherwise, determine the application that is creating large
+numbers of files and adjust its configuration or provide it dedicated storage.
+
+[1]: ./NodeFilesystemSpaceFillingUp.md
--- a/content/runbooks/node/NodeFilesystemSpaceFillingUp.md
+++ b/content/runbooks/node/NodeFilesystemSpaceFillingUp.md
@@ -0,0 +1,62 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running full is very bad for any process in need to write to the
+filesystem. But even before a filesystem runs full, performance is usually
+degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert. Use the usual OS tools to investigate what directories are the worst
+and/or recent offenders. Is this some irregular condition, e.g. a process fails
+to clean up behind itself or is this organic growth? If monitoring is enabled,
+the following metric can be watched in PromQL.
+
+```console
+node_filesystem_free_bytes
+```
+
+Check the alert's `mountpoint` label.
+
+## Mitigation
+
+For the case that the `mountpoint` label is `/`, `/sysroot` or `/var`; then
+removing unused images solves that issue:
+
+Debug the node by accessing the node filesystem:
+
+```console
+$ NODE_NAME=<instance label from alert>
+$ kubectl -n default debug node/$NODE_NAME
+$ chroot /host
+```
+
+Remove dangling images:
+
+```console
+# TODO: Command needed
+```
+
+Remove unused images:
+
+```console
+# TODO: Command needed
+```
+
+Exit debug:
+
+```console
+$ exit
+$ exit
+```
--- a/content/runbooks/node/NodeRAIDDegraded.md
+++ b/content/runbooks/node/NodeRAIDDegraded.md
@@ -0,0 +1,31 @@
+# NodeRAIDDegraded
+
+## Meaning
+
+This alert is triggered when a node has a storage configuration with RAID array,
+and the array is reporting as being in a degraded state due to one or more disk
+failures.
+
+## Impact
+
+The affected node could go offline at any moment if the RAID array fully fails
+due to further issues with disks.
+
+## Diagnosis
+
+You can open a shell on the node and use the standard Linux utilities to
+diagnose the issue, but you may need to install additional software in the debug
+container:
+
+```console
+$ NODE_NAME='<value of instance label from alert>'
+
+$ oc debug "node/$NODE_NAME"
+$ cat /proc/mdstat
+```
+
+## Mitigation
+
+See the Red Hat Enterprise Linux [documentation][1] for potential steps.
+
+[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices
--- a/content/runbooks/prometheus/PrometheusTargetSyncFailure.md
+++ b/content/runbooks/prometheus/PrometheusTargetSyncFailure.md
@@ -0,0 +1,30 @@
+# PrometheusTargetSyncFailure
+
+## Meaning
+
+This alert is triggered when at least one of the Prometheus instances has
+consistently failed to sync its configuration.
+
+## Impact
+
+Metrics and alerts may be missing or inaccurate.
+
+## Diagnosis
+
+Determine whether the alert is for the cluster or user workload Prometheus by
+inspecting the alert's `namespace` label.
+
+Check the logs for the appropriate Prometheus instance:
+
+```console
+$ NAMESPACE='<value of namespace label from alert>'
+
+$ oc -n $NAMESPACE logs -l 'app=prometheus'
+level=error ... msg="Creating target failed" ...
+```
+
+## Mitigation
+
+If the logs indicate a syntax or other configuration error, correct the
+corresponding `ServiceMonitor`, `PodMonitor`, or other configuration
+resource. In most all cases, the operator should prevent this from happening.