From 8d8b22196b7ac4de8a1bcd22f551873d3b826f0d Mon Sep 17 00:00:00 2001 From: paulfantom Date: Fri, 5 Nov 2021 12:22:36 +0100 Subject: [PATCH 1/9] port runbooks from openshift Co-Authored-By: Brad Ison Co-Authored-By: Lili Cosic Co-Authored-By: Rick Rackow --- .../alertmanager/AlertmanagerFailedReload.md | 23 +++++ .../runbooks/etcd/etcdBackendQuotaLowSpace.md | 81 ++++++++++++++++ content/runbooks/etcd/etcdGRPCRequestsSlow.md | 96 +++++++++++++++++++ .../runbooks/etcd/etcdHighFsyncDurations.md | 55 +++++++++++ .../etcdHighNumberOfFailedGRPCRequests.md | 41 ++++++++ .../runbooks/etcd/etcdInsufficientMembers.md | 65 +++++++++++++ content/runbooks/etcd/etcdMembersDown.md | 68 +++++++++++++ content/runbooks/etcd/etcdNoLeader.md | 42 ++++++++ content/runbooks/kubernetes/KubeAPIDown.md | 35 +++++++ .../runbooks/kubernetes/KubeNodeNotReady.md | 39 ++++++++ content/runbooks/kubernetes/KubeletDown.md | 38 ++++++++ .../runbooks/node/NodeFileDescriptorLimit.md | 33 +++++++ .../node/NodeFilesystemAlmostOutOfFiles.md | 27 ++++++ .../node/NodeFilesystemAlmostOutOfSpace.md | 26 +++++ .../node/NodeFilesystemFilesFillingUp.md | 53 ++++++++++ .../node/NodeFilesystemSpaceFillingUp.md | 62 ++++++++++++ content/runbooks/node/NodeRAIDDegraded.md | 31 ++++++ .../prometheus/PrometheusTargetSyncFailure.md | 30 ++++++ 18 files changed, 845 insertions(+) create mode 100644 content/runbooks/alertmanager/AlertmanagerFailedReload.md create mode 100644 content/runbooks/etcd/etcdBackendQuotaLowSpace.md create mode 100644 content/runbooks/etcd/etcdGRPCRequestsSlow.md create mode 100644 content/runbooks/etcd/etcdHighFsyncDurations.md create mode 100644 content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md create mode 100644 content/runbooks/etcd/etcdInsufficientMembers.md create mode 100644 content/runbooks/etcd/etcdMembersDown.md create mode 100644 content/runbooks/etcd/etcdNoLeader.md create mode 100644 content/runbooks/kubernetes/KubeAPIDown.md create mode 100644 content/runbooks/kubernetes/KubeNodeNotReady.md create mode 100644 content/runbooks/kubernetes/KubeletDown.md create mode 100644 content/runbooks/node/NodeFileDescriptorLimit.md create mode 100644 content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md create mode 100644 content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md create mode 100644 content/runbooks/node/NodeFilesystemFilesFillingUp.md create mode 100644 content/runbooks/node/NodeFilesystemSpaceFillingUp.md create mode 100644 content/runbooks/node/NodeRAIDDegraded.md create mode 100644 content/runbooks/prometheus/PrometheusTargetSyncFailure.md diff --git a/content/runbooks/alertmanager/AlertmanagerFailedReload.md b/content/runbooks/alertmanager/AlertmanagerFailedReload.md new file mode 100644 index 0000000..55b9f0b --- /dev/null +++ b/content/runbooks/alertmanager/AlertmanagerFailedReload.md @@ -0,0 +1,23 @@ +# AlertmanagerFailedReload + +## Meaning + +The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance +for the cluster monitoring stack has consistently failed to reload its +configuration for a certain period. + +## Impact + +Alerts for cluster components may not be delivered as expected. + +## Diagnosis + +Check the logs for the `alertmanager-main` pods in the `monitoring` namespace: + +```console +$ kubectl -n monitoring logs -l 'alertmanager=main' +``` + +## Mitigation + +The resolution depends on the particular issue reported in the logs. diff --git a/content/runbooks/etcd/etcdBackendQuotaLowSpace.md b/content/runbooks/etcd/etcdBackendQuotaLowSpace.md new file mode 100644 index 0000000..86f0358 --- /dev/null +++ b/content/runbooks/etcd/etcdBackendQuotaLowSpace.md @@ -0,0 +1,81 @@ +# etcdBackendQuotaLowSpace + +## Meaning + +This alert fires when the total existing DB size exceeds 95% of the maximum +DB quota. The consumed space is in Prometheus represented by the metric +`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by +`etcd_server_quota_backend_bytes`. + +## Impact + +In case the DB size exceeds the DB quota, no writes can be performed anymore on +the etcd cluster. This further prevents any updates in the cluster, such as the +creation of pods. + +## Diagnosis + +The following two approaches can be used for the diagnosis. + +### CLI Checks + +To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any +etcd pod. + +```console +$ NAMESPACE="kube-etcd" +$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }') +``` + +Validate that the `etcdctl` command is available: + +```console +$ etcdctl version +``` + +`etcdctl` can be used to fetch the DB size of the etcd endpoints. + +```console +$ etcdctl endpoint status -w table +``` + +### PromQL queries + +Check the percentage consumption of etcd DB with the following query in the +metrics console: + +```console +(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100 +``` + +Check the DB size in MB that can be reduced after defragmentation: + +```console +(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024 +``` + +## Mitigation + +### Capacity planning + +If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to +the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's +start planning for new cluster. + +In the meantime before migration happens, you can use defrag to gain some time. + +### Defrag + +When the etcd DB size increases, we can defragment existing etcd DB to optimize +DB consumption as described in [here][etcdDefragmentation]. Run the following +command in all etcd pods. + +```console +$ etcdctl defrag +``` + +As validation, check the endpoint status of etcd members to know the reduced +size of etcd DB. Use for this purpose the same diagnostic approaches as listed +above. More space should be available now. + +[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/ diff --git a/content/runbooks/etcd/etcdGRPCRequestsSlow.md b/content/runbooks/etcd/etcdGRPCRequestsSlow.md new file mode 100644 index 0000000..8c8884d --- /dev/null +++ b/content/runbooks/etcd/etcdGRPCRequestsSlow.md @@ -0,0 +1,96 @@ +# etcdGRPCRequestsSlow + +## Meaning + +This alert fires when the 99th percentile of etcd gRPC requests are too slow. + +## Impact + +When requests are too slow, they can lead to various scenarios like leader +election failure, slow reads and writes. + +## Diagnosis + +This could be result of slow disk (due to fragmented state) or CPU contention. + +### Slow disk + +One of the most common reasons for slow gRPC requests is disk. Checking disk +related metrics and dashboards should provide a more clear picture. + +#### PromQL queries used to troubleshoot + +Verify the value of how slow the etcd gRPC requests are by using the following +query in the metrics console: + +```console +histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type)) +``` +That result should give a rough timeline of when the issue started. + +`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync +duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To +rule out a slow disk and confirm that the disk is reasonably fast, 99th +percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less +than 10ms. Query in metrics UI: + +```console +histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))) +``` +#### Console dashboards + +In the OpenShift dashboard console under Observe section, select the etcd +dashboard. There are both RPC rate as well as Disk Sync Duration dashboards +which will assist with further issues. + +### Resource exhaustion + +It can happen that etcd responds slower due to CPU resource exhaustion. +This was seen in some cases when one application was requesting too much CPU +which led to this alert firing for multiple methods. + +Often if this is the case, we also see +`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well. + +To confirm this is the cause of the slow requests either: + +1. In OpenShift console on primary page under "Cluster utilization" view the + requested CPU vs available. + +2. PromQL query is the following to see top consumers of CPU: + +```console + topk(25, sort_desc( + sum by (namespace) ( + ( + sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod) + * + on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:) + ) + * + on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"})) + ) + )) +``` + +## Mitigation + +### Fragmented state + +In the case of slow fisk or when the etcd DB size increases, we can defragment +existing etcd DB to optimize DB consumption as described in +[here][etcdDefragmentation]. Run the following command in all etcd pods. + +```console +$ etcdctl defrag +``` + +As validation, check the endpoint status of etcd members to know the reduced +size of etcd DB. Use for this purpose the same diagnostic approaches as listed +above. More space should be available now. + +Further info on etcd best practices can be found in the [OpenShift docs +here][etcdPractices]. + +[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/ +[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_ diff --git a/content/runbooks/etcd/etcdHighFsyncDurations.md b/content/runbooks/etcd/etcdHighFsyncDurations.md new file mode 100644 index 0000000..5691c05 --- /dev/null +++ b/content/runbooks/etcd/etcdHighFsyncDurations.md @@ -0,0 +1,55 @@ +# etcdHighFsyncDurations + +## Meaning + +This alert fires when the 99th percentile of etcd disk fsync duration is too +high for 10 minutes. + +## Impact + +When this happens it can lead to various scenarios like leader election failure, +frequent leader elections, slow reads and writes. + +## Diagnosis + +This could be result of slow disk possibly due to fragmented state in etcd or +simply due to slow disk. + +### Slow disk + +Checking disk related metrics and dashboards should provide a more clear +picture. + +#### PromQL queries used to troubleshoot + +`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync +duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To +rule out a slow disk and confirm that the disk is reasonably fast, 99th +percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less +than 10ms. Query in metrics UI: + +```console +histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))) +``` + +## Mitigation + +### Fragmented state + +In the case of slow fisk or when the etcd DB size increases, we can defragment +existing etcd DB to optimize DB consumption as described in +[here][etcdDefragmentation]. Run the following command in all etcd pods. + +```console +$ etcdctl defrag +``` + +As validation, check the endpoint status of etcd members to know the reduced +size of etcd DB. Use for this purpose the same diagnostic approaches as listed +above. More space should be available now. + +Further info on etcd best practices can be found in the [OpenShift docs +here][etcdPractices]. + +[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/ +[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_ diff --git a/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md b/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md new file mode 100644 index 0000000..ea3eea9 --- /dev/null +++ b/content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md @@ -0,0 +1,41 @@ +# etcdHighNumberOfFailedGRPCRequests + +## Meaning + +This alert fires when at least 50% of etcd gRPC requests failed in the past 10 +minutes. + +## Impact + +First establish which gRPC method is failing, this will be visible in the alert. +If it's not part of the alert, the following query will display method and etcd +instance that has failing requests: + +```sh +100 * sum without(grpc_type, grpc_code) +(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m])) +/ sum without(grpc_type, grpc_code) +(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on() +(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1)) +``` + +## Diagnosis + +All the gRPC errors should also be logged in each respective etcd instance logs. +You can get the instance name from the alert that is firing or by running the +query detailed above. Those etcd instance logs should serve as further insight +into what is wrong. + +To get logs of etcd containers either check the instance from the alert and +check logs directly or run the following: + +```sh +NAMESPACE="kube-etcd" +kubectl logs -n $NAMESPACE -lapp=etcd etcd +``` + +## Mitigation + +Depending on the above diagnosis, the issue will most likely be described in the +error log line of either etcd or openshift-etcd-operator. Most likely causes +tend to be networking issues. diff --git a/content/runbooks/etcd/etcdInsufficientMembers.md b/content/runbooks/etcd/etcdInsufficientMembers.md new file mode 100644 index 0000000..bb4f91d --- /dev/null +++ b/content/runbooks/etcd/etcdInsufficientMembers.md @@ -0,0 +1,65 @@ +# etcdInsufficientMembers + +## Meaning + +This alert fires when there are fewer instances available than are needed by +etcd to be healthy. + +## Impact + +When etcd does not have a majority of instances available the Kubernetes and +OpenShift APIs will reject read and write requests and operations that preserve +the health of workloads cannot be performed. + +## Diagnosis + +This can kubectlcur multiple control plane nodes are powered off or are unable to +connect each other via the network. Check that all control plane nodes are +powered and that network connections between each machine are functional. + +Check any other critical, warning or info alerts firing that can assist with the +diagnosis. + +Login to the cluster. Check health of master nodes if any of them is in +`NotReady` state or not. + +```console +$ kubectl get nodes -l node-role.kubernetes.io/master= +``` + +### General etcd health + +To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any +etcd pod. + +```console +$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }') +``` + +Validate that the `etcdctl` command is available: + +```console +$ etcdctl version +``` + +Run the following command to get the health of etcd: + +```console +$ etcdctl endpoint health -w table +``` +## Mitigation + +### Disaster and recovery + +If an upgrade is in progress, the alert may automatically resolve in some time +when the master node comes up again. If MCO is not working on the master node, +check the cloud provider to verify if the master node instances are running or not. + +In the case when you are running on AWS, the AWS instance retirement might need +a manual reboot of the master node. + +As a last resort if none of the above fix the issue and the alert is still +firing, for etcd specific issues follow the steps described in the [disaster and +recovery dkubectls](dkubectls). + +[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html). diff --git a/content/runbooks/etcd/etcdMembersDown.md b/content/runbooks/etcd/etcdMembersDown.md new file mode 100644 index 0000000..f472e9e --- /dev/null +++ b/content/runbooks/etcd/etcdMembersDown.md @@ -0,0 +1,68 @@ +# etcdMembersDown + +## Meaning + +This alert fires when one or more etcd member goes down and evaluates the +number of etcd members that are currently down. Often, this alert was observed +as part of a cluster upgrade when a master node is being upgraded and requires a +reboot. + +## Impact + +In etcd a majority of (n/2)+1 has to agree on membership changes or key-value +upgrade proposals. With this approach, a split-brain inconsistency can be +avoided. In the case that only one member is down in a 3-member cluster, it +still can make forward progress. Due to the fact that the quorum is 2 and 2 +members are still alive. However, when more members are down, the cluster +becomes unrecoverable. + +## Diagnosis + +Login to the cluster. Check health of master nodes if any of them is in +`NotReady` state or not. + +```console +$ kubectl get nodes -l node-role.kubernetes.io/master= +``` + +In case there is no upgrade going on, but there is a change in the +`machineconfig` for the master pool causing a rolling reboot of each master +node, this alert can be triggered as well. We can check if the +`machineconfiguration.openshift.io/state : Working` annotation is set for any of +the master nodes. This is the case when the [machine-config-operator +(MCO)](https://github.com/openshift/machine-config-operator) is working on it. + +```console +$ kubectl get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}' +``` + +### General etcd health + +To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any +etcd pod. + +```console +$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }') +``` + +Validate that the `etcdctl` command is available: + +```console +$ etcdctl version +``` + +Run the following command to get the health of etcd: + +```console +$ etcdctl endpoint health -w table +``` + +## Mitigation + +If an upgrade is in progress, the alert may automatically resolve in some time +when the master node comes up again. If MCO is not working on the master node, +check the cloud provider to verify if the master node instances are running or not. + +In the case when you are running on AWS, the AWS instance retirement might need +a manual reboot of the master node. + diff --git a/content/runbooks/etcd/etcdNoLeader.md b/content/runbooks/etcd/etcdNoLeader.md new file mode 100644 index 0000000..8d05a33 --- /dev/null +++ b/content/runbooks/etcd/etcdNoLeader.md @@ -0,0 +1,42 @@ +# etcdNoLeader + +## Meaning + +This alert is triggered when etcd cluster does not have a leader for more than 1 +minute. + +## Impact + +When there is no leader, Kubernetes API will not be able to work +as expected and cluster cannot process any writes or reads, and any write +requests are queued for processing until a new leader is elected. Operations +that preserve the health of the workloads cannot be performed. + +## Diagnosis + +### Control plane nodes issue + +This can occur multiple control plane nodes are powered off or are unable to +connect each other via the network. Check that all control plane nodes are +powered and that network connections between each machine are functional. + +### Slow disk issue + +Another potential cause could be slow disk, inspect the `Disk Sync +Duration`dashboard, as well as the `Total Leader Elections Per Day` to get more +insight and help with diagnosis. + +### Other + +Check the logs of etcd containers to see any further information and to verify +that etcd does not have leader. Logs should contain something like `etcdserver: +no leader`. + +## Mitigation + +### Disaster and recovery + +Follow the steps described in the [disaster and recovery docs](docs). + + +[docs]:(https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html). diff --git a/content/runbooks/kubernetes/KubeAPIDown.md b/content/runbooks/kubernetes/KubeAPIDown.md new file mode 100644 index 0000000..958c115 --- /dev/null +++ b/content/runbooks/kubernetes/KubeAPIDown.md @@ -0,0 +1,35 @@ +# KubeAPIDown + +## Meaning + +The `KubeAPIDown` alert is triggered when all Kubernetes API servers have not +been reachable by the monitoring system for more than 15 minutes. + +## Impact + +This is a critical alert. The Kubernetes API is not responding. The +cluster may partially or fully non-functional. + +## Diagnosis + +Check the status of the API server targets in the Prometheus UI. + +Then, confirm whether the API is also unresponsive for you: + +```console +$ kubectl cluster-info +``` + +If you can still reach the API server, there may be a network issue between the +Prometheus instances and the API server pods. Check the status of the API server +pods. + +```console +$ kubectl -n kube-system get pods +$ kubectl -n kube-system logs -l 'app=kube-apiserver' +``` +## Mitigation + +If you can still reach the API server intermittently, you may be able treat this +like any other failing deployment. If not, it's possible you may have to refer +to the disaster recovery documentation. diff --git a/content/runbooks/kubernetes/KubeNodeNotReady.md b/content/runbooks/kubernetes/KubeNodeNotReady.md new file mode 100644 index 0000000..5d8e9bf --- /dev/null +++ b/content/runbooks/kubernetes/KubeNodeNotReady.md @@ -0,0 +1,39 @@ +# KubeNodeNotReady + +## Meaning + +KubeNodeNotReady alert is fired when a Kubernetes node is not in `Ready` +state for a certain period. In this case, the node is not able to host any new +pods as described [here][KubeNode]. + +## Impact + +The performance of the cluster deployments is affected, depending on the overall +workload and the type of the node. + +## Diagnosis + +The notification details should list the node that's not ready. For Example: + +```txt + - alertname = KubeNodeNotReady +... + - node = node1.example.com +... +``` + +Login to the cluster. Check the status of that node: + +```console +$ kubectl get node $NODE -o yaml +``` + +The output should describe why the node isn't ready (e.g.: timeouts reaching the +API or kubelet). + +## Mitigation + +Once, the problem was resolved that prevented node from being replaced, +the instance should be terminated. + +[KubeNode]: https://kubernetes.io/docs/concepts/architecture/nodes/#condition diff --git a/content/runbooks/kubernetes/KubeletDown.md b/content/runbooks/kubernetes/KubeletDown.md new file mode 100644 index 0000000..a3bd13a --- /dev/null +++ b/content/runbooks/kubernetes/KubeletDown.md @@ -0,0 +1,38 @@ +# KubeletDown + +## Meaning + +This alert is triggered when the monitoring system has not been able to reach +any of the cluster's Kubelets for more than 15 minutes. + +## Impact + +This alert represents a critical threat to the cluster's stability. Excluding +the possibility of a network issue preventing the monitoring system from +scraping Kubelet metrics, multiple nodes in the cluster are likely unable to +respond to configuration changes for pods and other resources, and some +debugging tools are likely not functional, e.g. `kubectl exec` and `kubectl logs`. + +## Diagnosis + +Check the status of nodes and for recent events on `Node` objects, or for recent +events in general: + +```console +$ kubectl get nodes +$ kubectl describe node $NODE_NAME +$ kubectl get events --field-selector 'involvedObject.kind=Node' +$ kubectl get events +``` + +If you have SSH access to the nodes, access the logs for the Kubelet directly: + +```console +$ journalctl -b -f -u kubelet.service +``` + +## Mitigation + +The mitigation depends on what is causing the Kubelets to become +unresponsive. Check for wide-spread networking issues, or node level +configuration issues. diff --git a/content/runbooks/node/NodeFileDescriptorLimit.md b/content/runbooks/node/NodeFileDescriptorLimit.md new file mode 100644 index 0000000..9a1d0b5 --- /dev/null +++ b/content/runbooks/node/NodeFileDescriptorLimit.md @@ -0,0 +1,33 @@ +# NodeFileDescriptorLimit + +## Meaning + +This alert is triggered when a node's kernel is found to be running out of +available file descriptors -- a `warning` level alert at greater than 70% usage +and a `critical` level alert at greater than 90% usage. + +## Impact + +Applications on the node may no longer be able to open and operate on +files. This is likely to have severe consequences for anything scheduled on this +node. + +## Diagnosis + +You can open a shell on the node and use the standard Linux utilities to +diagnose the issue: + +```console +$ NODE_NAME='' + +$ oc debug "node/$NODE_NAME" +# sysctl -a | grep 'fs.file-' +fs.file-max = 1597016 +fs.file-nr = 7104 0 1597016 +# lsof -n +``` + +## Mitigation + +Reduce the number of files opened simultaneously by either adjusting application +configuration or by moving some applications to other nodes. diff --git a/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md b/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md new file mode 100644 index 0000000..ff564da --- /dev/null +++ b/content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md @@ -0,0 +1,27 @@ +# NodeFilesystemAlmostOutOfFiles + +## Meaning + +This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather +than being based on a prediction that a filesystem will run out of inodes in a +certain amount of time, it uses simple static thresholds. The alert will fire as +at a `warning` level at 5% of available inodes left, and at a `critical` level +with 3% of available inodes left. + +## Impact + +A node's filesystem becoming full can have a far reaching impact, as it may +cause any or all of the applications scheduled to that node to experience +anything from performance degradation to full inoperability. Depending on the +node and filesystem involved, this could pose a critical threat to the stability +of the cluster. + +## Diagnosis + +Refer to the [NodeFilesystemFilesFillingUp][1] runbook. + +## Mitigation + +Refer to the [NodeFilesystemFilesFillingUp][1] runbook. + +[1]: ./NodeFilesystemFilesFillingUp.md diff --git a/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md b/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md new file mode 100644 index 0000000..5f241cf --- /dev/null +++ b/content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md @@ -0,0 +1,26 @@ +# NodeFilesystemAlmostOutOfSpace + +## Meaning + +This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather +than being based on a prediction that a filesystem will become full in a certain +amount of time, it uses simple static thresholds. The alert will fire as at a +`warning` level at 5% space left, and at a `critical` level with 3% space left. + +## Impact + +A node's filesystem becoming full can have a far reaching impact, as it may +cause any or all of the applications scheduled to that node to experience +anything from performance degradation to full inoperability. Depending on the +node and filesystem involved, this could pose a critical threat to the stability +of the cluster. + +## Diagnosis + +Refer to the [NodeFilesystemSpaceFillingUp][1] runbook. + +## Mitigation + +Refer to the [NodeFilesystemSpaceFillingUp][1] runbook. + +[1]: ./NodeFilesystemSpaceFillingUp.md diff --git a/content/runbooks/node/NodeFilesystemFilesFillingUp.md b/content/runbooks/node/NodeFilesystemFilesFillingUp.md new file mode 100644 index 0000000..62de979 --- /dev/null +++ b/content/runbooks/node/NodeFilesystemFilesFillingUp.md @@ -0,0 +1,53 @@ +# NodeFilesystemFilesFillingUp + +## Meaning + +This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but +predicts the filesystem will run out of inodes rather than bytes of storage +space. The alert fires at a `critical` level when the filesystem is predicted to +run out of available inodes within four hours. + +## Impact + +A node's filesystem becoming full can have a far reaching impact, as it may +cause any or all of the applications scheduled to that node to experience +anything from performance degradation to full inoperability. Depending on the +node and filesystem involved, this could pose a critical threat to the stability +of the cluster. + +## Diagnosis + +Note the `instance` and `mountpoint` labels from the alert. You can graph the +usage history of this filesystem with the following query in the OpenShift web +console: + +```text +node_filesystem_files_free{ + instance="", + mountpoint="" +} +``` + +You can also open a debug session on the node and use the standard Linux +utilities to locate the source of the usage: + +```console +$ MOUNT_POINT='' +$ NODE_NAME='' + +$ oc debug "node/$NODE_NAME" +$ df -hi "/host/$MOUNT_POINT" +``` + +Note that in many cases a filesystem running out of inodes will still have +available storage. Running out of inodes is often caused by many many small +files being created by an application. + +## Mitigation + +The number of inodes allocated to a filesystem is usually based on the storage +size. You may be able to solve the problem, or buy time, by increasing size of +the storage volume. Otherwise, determine the application that is creating large +numbers of files and adjust its configuration or provide it dedicated storage. + +[1]: ./NodeFilesystemSpaceFillingUp.md diff --git a/content/runbooks/node/NodeFilesystemSpaceFillingUp.md b/content/runbooks/node/NodeFilesystemSpaceFillingUp.md new file mode 100644 index 0000000..54f7ed2 --- /dev/null +++ b/content/runbooks/node/NodeFilesystemSpaceFillingUp.md @@ -0,0 +1,62 @@ +# NodeFilesystemSpaceFillingUp + +## Meaning + +This alert is based on an extrapolation of the space used in a file system. It +fires if both the current usage is above a certain threshold _and_ the +extrapolation predicts to run out of space in a certain time. This is a +warning-level alert if that time is less than 24h. It's a critical alert if that +time is less than 4h. + +## Impact + +A filesystem running full is very bad for any process in need to write to the +filesystem. But even before a filesystem runs full, performance is usually +degrading. + +## Diagnosis + +Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic +pattern of writing and cleaning up can trick the linear prediction into a false +alert. Use the usual OS tools to investigate what directories are the worst +and/or recent offenders. Is this some irregular condition, e.g. a process fails +to clean up behind itself or is this organic growth? If monitoring is enabled, +the following metric can be watched in PromQL. + +```console +node_filesystem_free_bytes +``` + +Check the alert's `mountpoint` label. + +## Mitigation + +For the case that the `mountpoint` label is `/`, `/sysroot` or `/var`; then +removing unused images solves that issue: + +Debug the node by accessing the node filesystem: + +```console +$ NODE_NAME= +$ kubectl -n default debug node/$NODE_NAME +$ chroot /host +``` + +Remove dangling images: + +```console +# TODO: Command needed +``` + +Remove unused images: + +```console +# TODO: Command needed +``` + +Exit debug: + +```console +$ exit +$ exit +``` diff --git a/content/runbooks/node/NodeRAIDDegraded.md b/content/runbooks/node/NodeRAIDDegraded.md new file mode 100644 index 0000000..4adef9c --- /dev/null +++ b/content/runbooks/node/NodeRAIDDegraded.md @@ -0,0 +1,31 @@ +# NodeRAIDDegraded + +## Meaning + +This alert is triggered when a node has a storage configuration with RAID array, +and the array is reporting as being in a degraded state due to one or more disk +failures. + +## Impact + +The affected node could go offline at any moment if the RAID array fully fails +due to further issues with disks. + +## Diagnosis + +You can open a shell on the node and use the standard Linux utilities to +diagnose the issue, but you may need to install additional software in the debug +container: + +```console +$ NODE_NAME='' + +$ oc debug "node/$NODE_NAME" +$ cat /proc/mdstat +``` + +## Mitigation + +See the Red Hat Enterprise Linux [documentation][1] for potential steps. + +[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices diff --git a/content/runbooks/prometheus/PrometheusTargetSyncFailure.md b/content/runbooks/prometheus/PrometheusTargetSyncFailure.md new file mode 100644 index 0000000..e38062d --- /dev/null +++ b/content/runbooks/prometheus/PrometheusTargetSyncFailure.md @@ -0,0 +1,30 @@ +# PrometheusTargetSyncFailure + +## Meaning + +This alert is triggered when at least one of the Prometheus instances has +consistently failed to sync its configuration. + +## Impact + +Metrics and alerts may be missing or inaccurate. + +## Diagnosis + +Determine whether the alert is for the cluster or user workload Prometheus by +inspecting the alert's `namespace` label. + +Check the logs for the appropriate Prometheus instance: + +```console +$ NAMESPACE='' + +$ oc -n $NAMESPACE logs -l 'app=prometheus' +level=error ... msg="Creating target failed" ... +``` + +## Mitigation + +If the logs indicate a syntax or other configuration error, correct the +corresponding `ServiceMonitor`, `PodMonitor`, or other configuration +resource. In most all cases, the operator should prevent this from happening. From 89112c890c308a1bc28b7b3dc25f38df1c4f8878 Mon Sep 17 00:00:00 2001 From: paulfantom Date: Fri, 5 Nov 2021 12:31:57 +0100 Subject: [PATCH 2/9] group etcd alerts in one section --- content/runbooks/etcd/_index.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 content/runbooks/etcd/_index.md diff --git a/content/runbooks/etcd/_index.md b/content/runbooks/etcd/_index.md new file mode 100644 index 0000000..7ea2710 --- /dev/null +++ b/content/runbooks/etcd/_index.md @@ -0,0 +1,7 @@ +--- +title: etcd +bookCollapseSection: true +bookFlatSection: true +weight: 10 +--- + From f6f31eba186e4eb48b1caf9ce52c64956200e2d3 Mon Sep 17 00:00:00 2001 From: Tigran Tch Date: Thu, 11 Nov 2021 14:28:45 +0100 Subject: [PATCH 3/9] add doc for AlertmanagerClusterFailedToSendAlerts --- .../AlertmanagerClusterFailedToSendAlerts.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md diff --git a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md new file mode 100644 index 0000000..f70b63a --- /dev/null +++ b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md @@ -0,0 +1,22 @@ +--- +title: Alertmanager Cluster Failed To Send Alerts +weight: 20 +--- + +# Alertmanager Cluster Failed To Send Alerts + +## Meaning + +All instances failed to send notification to a critical integration. + +## Impact + +You will not receive notification when an alert is raised. + +## Diagnosis + +No alerts are received at the integration level from the cluster. + +## Mitigation + +Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall...) \ No newline at end of file From a35d153274a124e2d90cd34f500e4e548e6e4ee0 Mon Sep 17 00:00:00 2001 From: Tigran Tch Date: Fri, 12 Nov 2021 09:01:57 +0100 Subject: [PATCH 4/9] same alert exists for non critical --- .../alertmanager/AlertmanagerClusterFailedToSendAlerts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md index f70b63a..de831e7 100644 --- a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md +++ b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md @@ -7,7 +7,7 @@ weight: 20 ## Meaning -All instances failed to send notification to a critical integration. +All instances failed to send notification to an integration. ## Impact From 2f88d7d32980686f7aed12a188dfe7e2bf7ebe93 Mon Sep 17 00:00:00 2001 From: Tigran Tch Date: Fri, 12 Nov 2021 12:08:47 +0100 Subject: [PATCH 5/9] add doc for AlertmanagerConfigInconsistent --- .../AlertmanagerConfigInconsistent.md | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md diff --git a/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md new file mode 100644 index 0000000..ce9cf86 --- /dev/null +++ b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md @@ -0,0 +1,24 @@ +--- +title: Alertmanager ConfigInconsistent +weight: 20 +--- + +# Alertmanager Config Inconsistent + +## Meaning + +The configuration between instances inside a cluster is inconsistent. + +## Impact + +Configuration inconsistency can be multiple and impact is hard to predict. +Nevertheless, most of the case the alert might be lost or routed to the incorrect integration. + +## Diagnosis + +Run a `diff` tool between all `alertmanager.yml` that are deployed to find what is wrong. +You could run a job within your CI to avoid this issue in the future. + +## Mitigation + +Delete the incorrect secret and deploy the correct one. From 7fb999b4bcbe92337336ab265096f0941ad1a2fe Mon Sep 17 00:00:00 2001 From: Tigran Tch <3153333+NargiT@users.noreply.github.com> Date: Fri, 12 Nov 2021 13:29:48 +0100 Subject: [PATCH 6/9] Update AlertmanagerClusterFailedToSendAlerts.md --- .../alertmanager/AlertmanagerClusterFailedToSendAlerts.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md index de831e7..2ecbb5c 100644 --- a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md +++ b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md @@ -3,7 +3,7 @@ title: Alertmanager Cluster Failed To Send Alerts weight: 20 --- -# Alertmanager Cluster Failed To Send Alerts +# AlertmanagerClusterFailedToSendAlerts ## Meaning @@ -19,4 +19,4 @@ No alerts are received at the integration level from the cluster. ## Mitigation -Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall...) \ No newline at end of file +Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall...) From 71e2271213225667afe9cb404ebae156c8ec2b9b Mon Sep 17 00:00:00 2001 From: Tigran Tch <3153333+NargiT@users.noreply.github.com> Date: Fri, 12 Nov 2021 13:30:09 +0100 Subject: [PATCH 7/9] Update AlertmanagerConfigInconsistent.md --- content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md index ce9cf86..3a507da 100644 --- a/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md +++ b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md @@ -3,7 +3,7 @@ title: Alertmanager ConfigInconsistent weight: 20 --- -# Alertmanager Config Inconsistent +# AlertmanagerConfigInconsistent ## Meaning From 3ac613a118593d8c1ea7c8e6c65e011b9c6a9613 Mon Sep 17 00:00:00 2001 From: Tigran Tch <3153333+NargiT@users.noreply.github.com> Date: Thu, 25 Nov 2021 16:27:21 +0100 Subject: [PATCH 8/9] Update content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md Co-authored-by: Drew Boswell --- content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md index 3a507da..3cc5e6b 100644 --- a/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md +++ b/content/runbooks/alertmanager/AlertmanagerConfigInconsistent.md @@ -12,7 +12,7 @@ The configuration between instances inside a cluster is inconsistent. ## Impact Configuration inconsistency can be multiple and impact is hard to predict. -Nevertheless, most of the case the alert might be lost or routed to the incorrect integration. +Nevertheless, in most cases the alert might be lost or routed to the incorrect integration. ## Diagnosis From d9c25a2c6146340af15d88f7917346cabdd9a733 Mon Sep 17 00:00:00 2001 From: Tigran Tch <3153333+NargiT@users.noreply.github.com> Date: Thu, 25 Nov 2021 16:27:36 +0100 Subject: [PATCH 9/9] Update content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md Co-authored-by: Drew Boswell --- .../alertmanager/AlertmanagerClusterFailedToSendAlerts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md index 2ecbb5c..7e04b7e 100644 --- a/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md +++ b/content/runbooks/alertmanager/AlertmanagerClusterFailedToSendAlerts.md @@ -11,7 +11,7 @@ All instances failed to send notification to an integration. ## Impact -You will not receive notification when an alert is raised. +You will not receive a notification when an alert is raised. ## Diagnosis