mirror of
https://github.com/prometheus-operator/runbooks.git
synced 2026-05-21 14:22:46 +00:00
merge
This commit is contained in:
@@ -0,0 +1,22 @@
|
||||
---
|
||||
title: Alertmanager Cluster Failed To Send Alerts
|
||||
weight: 20
|
||||
---
|
||||
|
||||
# AlertmanagerClusterFailedToSendAlerts
|
||||
|
||||
## Meaning
|
||||
|
||||
All instances failed to send notification to an integration.
|
||||
|
||||
## Impact
|
||||
|
||||
You will not receive a notification when an alert is raised.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
No alerts are received at the integration level from the cluster.
|
||||
|
||||
## Mitigation
|
||||
|
||||
Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall...)
|
||||
@@ -0,0 +1,24 @@
|
||||
---
|
||||
title: Alertmanager ConfigInconsistent
|
||||
weight: 20
|
||||
---
|
||||
|
||||
# AlertmanagerConfigInconsistent
|
||||
|
||||
## Meaning
|
||||
|
||||
The configuration between instances inside a cluster is inconsistent.
|
||||
|
||||
## Impact
|
||||
|
||||
Configuration inconsistency can be multiple and impact is hard to predict.
|
||||
Nevertheless, in most cases the alert might be lost or routed to the incorrect integration.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Run a `diff` tool between all `alertmanager.yml` that are deployed to find what is wrong.
|
||||
You could run a job within your CI to avoid this issue in the future.
|
||||
|
||||
## Mitigation
|
||||
|
||||
Delete the incorrect secret and deploy the correct one.
|
||||
@@ -7,7 +7,9 @@ weight: 20
|
||||
|
||||
## Meaning
|
||||
|
||||
The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance for the cluster monitoring stack has consistently failed to reload its configuration for a certain period.
|
||||
The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance
|
||||
for the cluster monitoring stack has consistently failed to reload its
|
||||
configuration for a certain period.
|
||||
|
||||
## Impact
|
||||
|
||||
|
||||
7
content/runbooks/etcd/_index.md
Normal file
7
content/runbooks/etcd/_index.md
Normal file
@@ -0,0 +1,7 @@
|
||||
---
|
||||
title: etcd
|
||||
bookCollapseSection: true
|
||||
bookFlatSection: true
|
||||
weight: 10
|
||||
---
|
||||
|
||||
81
content/runbooks/etcd/etcdBackendQuotaLowSpace.md
Normal file
81
content/runbooks/etcd/etcdBackendQuotaLowSpace.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# etcdBackendQuotaLowSpace
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert fires when the total existing DB size exceeds 95% of the maximum
|
||||
DB quota. The consumed space is in Prometheus represented by the metric
|
||||
`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by
|
||||
`etcd_server_quota_backend_bytes`.
|
||||
|
||||
## Impact
|
||||
|
||||
In case the DB size exceeds the DB quota, no writes can be performed anymore on
|
||||
the etcd cluster. This further prevents any updates in the cluster, such as the
|
||||
creation of pods.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
The following two approaches can be used for the diagnosis.
|
||||
|
||||
### CLI Checks
|
||||
|
||||
To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any
|
||||
etcd pod.
|
||||
|
||||
```console
|
||||
$ NAMESPACE="kube-etcd"
|
||||
$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }')
|
||||
```
|
||||
|
||||
Validate that the `etcdctl` command is available:
|
||||
|
||||
```console
|
||||
$ etcdctl version
|
||||
```
|
||||
|
||||
`etcdctl` can be used to fetch the DB size of the etcd endpoints.
|
||||
|
||||
```console
|
||||
$ etcdctl endpoint status -w table
|
||||
```
|
||||
|
||||
### PromQL queries
|
||||
|
||||
Check the percentage consumption of etcd DB with the following query in the
|
||||
metrics console:
|
||||
|
||||
```console
|
||||
(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100
|
||||
```
|
||||
|
||||
Check the DB size in MB that can be reduced after defragmentation:
|
||||
|
||||
```console
|
||||
(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Capacity planning
|
||||
|
||||
If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to
|
||||
the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's
|
||||
start planning for new cluster.
|
||||
|
||||
In the meantime before migration happens, you can use defrag to gain some time.
|
||||
|
||||
### Defrag
|
||||
|
||||
When the etcd DB size increases, we can defragment existing etcd DB to optimize
|
||||
DB consumption as described in [here][etcdDefragmentation]. Run the following
|
||||
command in all etcd pods.
|
||||
|
||||
```console
|
||||
$ etcdctl defrag
|
||||
```
|
||||
|
||||
As validation, check the endpoint status of etcd members to know the reduced
|
||||
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
|
||||
above. More space should be available now.
|
||||
|
||||
[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/
|
||||
96
content/runbooks/etcd/etcdGRPCRequestsSlow.md
Normal file
96
content/runbooks/etcd/etcdGRPCRequestsSlow.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# etcdGRPCRequestsSlow
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert fires when the 99th percentile of etcd gRPC requests are too slow.
|
||||
|
||||
## Impact
|
||||
|
||||
When requests are too slow, they can lead to various scenarios like leader
|
||||
election failure, slow reads and writes.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
This could be result of slow disk (due to fragmented state) or CPU contention.
|
||||
|
||||
### Slow disk
|
||||
|
||||
One of the most common reasons for slow gRPC requests is disk. Checking disk
|
||||
related metrics and dashboards should provide a more clear picture.
|
||||
|
||||
#### PromQL queries used to troubleshoot
|
||||
|
||||
Verify the value of how slow the etcd gRPC requests are by using the following
|
||||
query in the metrics console:
|
||||
|
||||
```console
|
||||
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type))
|
||||
```
|
||||
That result should give a rough timeline of when the issue started.
|
||||
|
||||
`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
|
||||
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
|
||||
rule out a slow disk and confirm that the disk is reasonably fast, 99th
|
||||
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
|
||||
than 10ms. Query in metrics UI:
|
||||
|
||||
```console
|
||||
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
|
||||
```
|
||||
#### Console dashboards
|
||||
|
||||
In the OpenShift dashboard console under Observe section, select the etcd
|
||||
dashboard. There are both RPC rate as well as Disk Sync Duration dashboards
|
||||
which will assist with further issues.
|
||||
|
||||
### Resource exhaustion
|
||||
|
||||
It can happen that etcd responds slower due to CPU resource exhaustion.
|
||||
This was seen in some cases when one application was requesting too much CPU
|
||||
which led to this alert firing for multiple methods.
|
||||
|
||||
Often if this is the case, we also see
|
||||
`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well.
|
||||
|
||||
To confirm this is the cause of the slow requests either:
|
||||
|
||||
1. In OpenShift console on primary page under "Cluster utilization" view the
|
||||
requested CPU vs available.
|
||||
|
||||
2. PromQL query is the following to see top consumers of CPU:
|
||||
|
||||
```console
|
||||
topk(25, sort_desc(
|
||||
sum by (namespace) (
|
||||
(
|
||||
sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod)
|
||||
*
|
||||
on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
|
||||
)
|
||||
*
|
||||
on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
|
||||
)
|
||||
))
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Fragmented state
|
||||
|
||||
In the case of slow fisk or when the etcd DB size increases, we can defragment
|
||||
existing etcd DB to optimize DB consumption as described in
|
||||
[here][etcdDefragmentation]. Run the following command in all etcd pods.
|
||||
|
||||
```console
|
||||
$ etcdctl defrag
|
||||
```
|
||||
|
||||
As validation, check the endpoint status of etcd members to know the reduced
|
||||
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
|
||||
above. More space should be available now.
|
||||
|
||||
Further info on etcd best practices can be found in the [OpenShift docs
|
||||
here][etcdPractices].
|
||||
|
||||
[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
|
||||
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
|
||||
55
content/runbooks/etcd/etcdHighFsyncDurations.md
Normal file
55
content/runbooks/etcd/etcdHighFsyncDurations.md
Normal file
@@ -0,0 +1,55 @@
|
||||
# etcdHighFsyncDurations
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert fires when the 99th percentile of etcd disk fsync duration is too
|
||||
high for 10 minutes.
|
||||
|
||||
## Impact
|
||||
|
||||
When this happens it can lead to various scenarios like leader election failure,
|
||||
frequent leader elections, slow reads and writes.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
This could be result of slow disk possibly due to fragmented state in etcd or
|
||||
simply due to slow disk.
|
||||
|
||||
### Slow disk
|
||||
|
||||
Checking disk related metrics and dashboards should provide a more clear
|
||||
picture.
|
||||
|
||||
#### PromQL queries used to troubleshoot
|
||||
|
||||
`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
|
||||
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
|
||||
rule out a slow disk and confirm that the disk is reasonably fast, 99th
|
||||
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
|
||||
than 10ms. Query in metrics UI:
|
||||
|
||||
```console
|
||||
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Fragmented state
|
||||
|
||||
In the case of slow fisk or when the etcd DB size increases, we can defragment
|
||||
existing etcd DB to optimize DB consumption as described in
|
||||
[here][etcdDefragmentation]. Run the following command in all etcd pods.
|
||||
|
||||
```console
|
||||
$ etcdctl defrag
|
||||
```
|
||||
|
||||
As validation, check the endpoint status of etcd members to know the reduced
|
||||
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
|
||||
above. More space should be available now.
|
||||
|
||||
Further info on etcd best practices can be found in the [OpenShift docs
|
||||
here][etcdPractices].
|
||||
|
||||
[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
|
||||
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_
|
||||
41
content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
Normal file
41
content/runbooks/etcd/etcdHighNumberOfFailedGRPCRequests.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# etcdHighNumberOfFailedGRPCRequests
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert fires when at least 50% of etcd gRPC requests failed in the past 10
|
||||
minutes.
|
||||
|
||||
## Impact
|
||||
|
||||
First establish which gRPC method is failing, this will be visible in the alert.
|
||||
If it's not part of the alert, the following query will display method and etcd
|
||||
instance that has failing requests:
|
||||
|
||||
```sh
|
||||
100 * sum without(grpc_type, grpc_code)
|
||||
(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m]))
|
||||
/ sum without(grpc_type, grpc_code)
|
||||
(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on()
|
||||
(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1))
|
||||
```
|
||||
|
||||
## Diagnosis
|
||||
|
||||
All the gRPC errors should also be logged in each respective etcd instance logs.
|
||||
You can get the instance name from the alert that is firing or by running the
|
||||
query detailed above. Those etcd instance logs should serve as further insight
|
||||
into what is wrong.
|
||||
|
||||
To get logs of etcd containers either check the instance from the alert and
|
||||
check logs directly or run the following:
|
||||
|
||||
```sh
|
||||
NAMESPACE="kube-etcd"
|
||||
kubectl logs -n $NAMESPACE -lapp=etcd etcd
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
Depending on the above diagnosis, the issue will most likely be described in the
|
||||
error log line of either etcd or openshift-etcd-operator. Most likely causes
|
||||
tend to be networking issues.
|
||||
65
content/runbooks/etcd/etcdInsufficientMembers.md
Normal file
65
content/runbooks/etcd/etcdInsufficientMembers.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# etcdInsufficientMembers
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert fires when there are fewer instances available than are needed by
|
||||
etcd to be healthy.
|
||||
|
||||
## Impact
|
||||
|
||||
When etcd does not have a majority of instances available the Kubernetes and
|
||||
OpenShift APIs will reject read and write requests and operations that preserve
|
||||
the health of workloads cannot be performed.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
This can kubectlcur multiple control plane nodes are powered off or are unable to
|
||||
connect each other via the network. Check that all control plane nodes are
|
||||
powered and that network connections between each machine are functional.
|
||||
|
||||
Check any other critical, warning or info alerts firing that can assist with the
|
||||
diagnosis.
|
||||
|
||||
Login to the cluster. Check health of master nodes if any of them is in
|
||||
`NotReady` state or not.
|
||||
|
||||
```console
|
||||
$ kubectl get nodes -l node-role.kubernetes.io/master=
|
||||
```
|
||||
|
||||
### General etcd health
|
||||
|
||||
To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
|
||||
etcd pod.
|
||||
|
||||
```console
|
||||
$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
|
||||
```
|
||||
|
||||
Validate that the `etcdctl` command is available:
|
||||
|
||||
```console
|
||||
$ etcdctl version
|
||||
```
|
||||
|
||||
Run the following command to get the health of etcd:
|
||||
|
||||
```console
|
||||
$ etcdctl endpoint health -w table
|
||||
```
|
||||
## Mitigation
|
||||
|
||||
### Disaster and recovery
|
||||
|
||||
If an upgrade is in progress, the alert may automatically resolve in some time
|
||||
when the master node comes up again. If MCO is not working on the master node,
|
||||
check the cloud provider to verify if the master node instances are running or not.
|
||||
|
||||
In the case when you are running on AWS, the AWS instance retirement might need
|
||||
a manual reboot of the master node.
|
||||
|
||||
As a last resort if none of the above fix the issue and the alert is still
|
||||
firing, for etcd specific issues follow the steps described in the [disaster and
|
||||
recovery dkubectls](dkubectls).
|
||||
|
||||
[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).
|
||||
68
content/runbooks/etcd/etcdMembersDown.md
Normal file
68
content/runbooks/etcd/etcdMembersDown.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# etcdMembersDown
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert fires when one or more etcd member goes down and evaluates the
|
||||
number of etcd members that are currently down. Often, this alert was observed
|
||||
as part of a cluster upgrade when a master node is being upgraded and requires a
|
||||
reboot.
|
||||
|
||||
## Impact
|
||||
|
||||
In etcd a majority of (n/2)+1 has to agree on membership changes or key-value
|
||||
upgrade proposals. With this approach, a split-brain inconsistency can be
|
||||
avoided. In the case that only one member is down in a 3-member cluster, it
|
||||
still can make forward progress. Due to the fact that the quorum is 2 and 2
|
||||
members are still alive. However, when more members are down, the cluster
|
||||
becomes unrecoverable.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Login to the cluster. Check health of master nodes if any of them is in
|
||||
`NotReady` state or not.
|
||||
|
||||
```console
|
||||
$ kubectl get nodes -l node-role.kubernetes.io/master=
|
||||
```
|
||||
|
||||
In case there is no upgrade going on, but there is a change in the
|
||||
`machineconfig` for the master pool causing a rolling reboot of each master
|
||||
node, this alert can be triggered as well. We can check if the
|
||||
`machineconfiguration.openshift.io/state : Working` annotation is set for any of
|
||||
the master nodes. This is the case when the [machine-config-operator
|
||||
(MCO)](https://github.com/openshift/machine-config-operator) is working on it.
|
||||
|
||||
```console
|
||||
$ kubectl get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
|
||||
```
|
||||
|
||||
### General etcd health
|
||||
|
||||
To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
|
||||
etcd pod.
|
||||
|
||||
```console
|
||||
$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
|
||||
```
|
||||
|
||||
Validate that the `etcdctl` command is available:
|
||||
|
||||
```console
|
||||
$ etcdctl version
|
||||
```
|
||||
|
||||
Run the following command to get the health of etcd:
|
||||
|
||||
```console
|
||||
$ etcdctl endpoint health -w table
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
If an upgrade is in progress, the alert may automatically resolve in some time
|
||||
when the master node comes up again. If MCO is not working on the master node,
|
||||
check the cloud provider to verify if the master node instances are running or not.
|
||||
|
||||
In the case when you are running on AWS, the AWS instance retirement might need
|
||||
a manual reboot of the master node.
|
||||
|
||||
42
content/runbooks/etcd/etcdNoLeader.md
Normal file
42
content/runbooks/etcd/etcdNoLeader.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# etcdNoLeader
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is triggered when etcd cluster does not have a leader for more than 1
|
||||
minute.
|
||||
|
||||
## Impact
|
||||
|
||||
When there is no leader, Kubernetes API will not be able to work
|
||||
as expected and cluster cannot process any writes or reads, and any write
|
||||
requests are queued for processing until a new leader is elected. Operations
|
||||
that preserve the health of the workloads cannot be performed.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Control plane nodes issue
|
||||
|
||||
This can occur multiple control plane nodes are powered off or are unable to
|
||||
connect each other via the network. Check that all control plane nodes are
|
||||
powered and that network connections between each machine are functional.
|
||||
|
||||
### Slow disk issue
|
||||
|
||||
Another potential cause could be slow disk, inspect the `Disk Sync
|
||||
Duration`dashboard, as well as the `Total Leader Elections Per Day` to get more
|
||||
insight and help with diagnosis.
|
||||
|
||||
### Other
|
||||
|
||||
Check the logs of etcd containers to see any further information and to verify
|
||||
that etcd does not have leader. Logs should contain something like `etcdserver:
|
||||
no leader`.
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Disaster and recovery
|
||||
|
||||
Follow the steps described in the [disaster and recovery docs](docs).
|
||||
|
||||
|
||||
[docs]:(https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).
|
||||
35
content/runbooks/kubernetes/KubeAPIDown.md
Normal file
35
content/runbooks/kubernetes/KubeAPIDown.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# KubeAPIDown
|
||||
|
||||
## Meaning
|
||||
|
||||
The `KubeAPIDown` alert is triggered when all Kubernetes API servers have not
|
||||
been reachable by the monitoring system for more than 15 minutes.
|
||||
|
||||
## Impact
|
||||
|
||||
This is a critical alert. The Kubernetes API is not responding. The
|
||||
cluster may partially or fully non-functional.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Check the status of the API server targets in the Prometheus UI.
|
||||
|
||||
Then, confirm whether the API is also unresponsive for you:
|
||||
|
||||
```console
|
||||
$ kubectl cluster-info
|
||||
```
|
||||
|
||||
If you can still reach the API server, there may be a network issue between the
|
||||
Prometheus instances and the API server pods. Check the status of the API server
|
||||
pods.
|
||||
|
||||
```console
|
||||
$ kubectl -n kube-system get pods
|
||||
$ kubectl -n kube-system logs -l 'app=kube-apiserver'
|
||||
```
|
||||
## Mitigation
|
||||
|
||||
If you can still reach the API server intermittently, you may be able treat this
|
||||
like any other failing deployment. If not, it's possible you may have to refer
|
||||
to the disaster recovery documentation.
|
||||
39
content/runbooks/kubernetes/KubeNodeNotReady.md
Normal file
39
content/runbooks/kubernetes/KubeNodeNotReady.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# KubeNodeNotReady
|
||||
|
||||
## Meaning
|
||||
|
||||
KubeNodeNotReady alert is fired when a Kubernetes node is not in `Ready`
|
||||
state for a certain period. In this case, the node is not able to host any new
|
||||
pods as described [here][KubeNode].
|
||||
|
||||
## Impact
|
||||
|
||||
The performance of the cluster deployments is affected, depending on the overall
|
||||
workload and the type of the node.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
The notification details should list the node that's not ready. For Example:
|
||||
|
||||
```txt
|
||||
- alertname = KubeNodeNotReady
|
||||
...
|
||||
- node = node1.example.com
|
||||
...
|
||||
```
|
||||
|
||||
Login to the cluster. Check the status of that node:
|
||||
|
||||
```console
|
||||
$ kubectl get node $NODE -o yaml
|
||||
```
|
||||
|
||||
The output should describe why the node isn't ready (e.g.: timeouts reaching the
|
||||
API or kubelet).
|
||||
|
||||
## Mitigation
|
||||
|
||||
Once, the problem was resolved that prevented node from being replaced,
|
||||
the instance should be terminated.
|
||||
|
||||
[KubeNode]: https://kubernetes.io/docs/concepts/architecture/nodes/#condition
|
||||
38
content/runbooks/kubernetes/KubeletDown.md
Normal file
38
content/runbooks/kubernetes/KubeletDown.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# KubeletDown
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is triggered when the monitoring system has not been able to reach
|
||||
any of the cluster's Kubelets for more than 15 minutes.
|
||||
|
||||
## Impact
|
||||
|
||||
This alert represents a critical threat to the cluster's stability. Excluding
|
||||
the possibility of a network issue preventing the monitoring system from
|
||||
scraping Kubelet metrics, multiple nodes in the cluster are likely unable to
|
||||
respond to configuration changes for pods and other resources, and some
|
||||
debugging tools are likely not functional, e.g. `kubectl exec` and `kubectl logs`.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Check the status of nodes and for recent events on `Node` objects, or for recent
|
||||
events in general:
|
||||
|
||||
```console
|
||||
$ kubectl get nodes
|
||||
$ kubectl describe node $NODE_NAME
|
||||
$ kubectl get events --field-selector 'involvedObject.kind=Node'
|
||||
$ kubectl get events
|
||||
```
|
||||
|
||||
If you have SSH access to the nodes, access the logs for the Kubelet directly:
|
||||
|
||||
```console
|
||||
$ journalctl -b -f -u kubelet.service
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
The mitigation depends on what is causing the Kubelets to become
|
||||
unresponsive. Check for wide-spread networking issues, or node level
|
||||
configuration issues.
|
||||
33
content/runbooks/node/NodeFileDescriptorLimit.md
Normal file
33
content/runbooks/node/NodeFileDescriptorLimit.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# NodeFileDescriptorLimit
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is triggered when a node's kernel is found to be running out of
|
||||
available file descriptors -- a `warning` level alert at greater than 70% usage
|
||||
and a `critical` level alert at greater than 90% usage.
|
||||
|
||||
## Impact
|
||||
|
||||
Applications on the node may no longer be able to open and operate on
|
||||
files. This is likely to have severe consequences for anything scheduled on this
|
||||
node.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
You can open a shell on the node and use the standard Linux utilities to
|
||||
diagnose the issue:
|
||||
|
||||
```console
|
||||
$ NODE_NAME='<value of instance label from alert>'
|
||||
|
||||
$ oc debug "node/$NODE_NAME"
|
||||
# sysctl -a | grep 'fs.file-'
|
||||
fs.file-max = 1597016
|
||||
fs.file-nr = 7104 0 1597016
|
||||
# lsof -n
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
Reduce the number of files opened simultaneously by either adjusting application
|
||||
configuration or by moving some applications to other nodes.
|
||||
27
content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md
Normal file
27
content/runbooks/node/NodeFilesystemAlmostOutOfFiles.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# NodeFilesystemAlmostOutOfFiles
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
|
||||
than being based on a prediction that a filesystem will run out of inodes in a
|
||||
certain amount of time, it uses simple static thresholds. The alert will fire as
|
||||
at a `warning` level at 5% of available inodes left, and at a `critical` level
|
||||
with 3% of available inodes left.
|
||||
|
||||
## Impact
|
||||
|
||||
A node's filesystem becoming full can have a far reaching impact, as it may
|
||||
cause any or all of the applications scheduled to that node to experience
|
||||
anything from performance degradation to full inoperability. Depending on the
|
||||
node and filesystem involved, this could pose a critical threat to the stability
|
||||
of the cluster.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
|
||||
|
||||
## Mitigation
|
||||
|
||||
Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
|
||||
|
||||
[1]: ./NodeFilesystemFilesFillingUp.md
|
||||
26
content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md
Normal file
26
content/runbooks/node/NodeFilesystemAlmostOutOfSpace.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# NodeFilesystemAlmostOutOfSpace
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
|
||||
than being based on a prediction that a filesystem will become full in a certain
|
||||
amount of time, it uses simple static thresholds. The alert will fire as at a
|
||||
`warning` level at 5% space left, and at a `critical` level with 3% space left.
|
||||
|
||||
## Impact
|
||||
|
||||
A node's filesystem becoming full can have a far reaching impact, as it may
|
||||
cause any or all of the applications scheduled to that node to experience
|
||||
anything from performance degradation to full inoperability. Depending on the
|
||||
node and filesystem involved, this could pose a critical threat to the stability
|
||||
of the cluster.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
|
||||
|
||||
## Mitigation
|
||||
|
||||
Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
|
||||
|
||||
[1]: ./NodeFilesystemSpaceFillingUp.md
|
||||
53
content/runbooks/node/NodeFilesystemFilesFillingUp.md
Normal file
53
content/runbooks/node/NodeFilesystemFilesFillingUp.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# NodeFilesystemFilesFillingUp
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but
|
||||
predicts the filesystem will run out of inodes rather than bytes of storage
|
||||
space. The alert fires at a `critical` level when the filesystem is predicted to
|
||||
run out of available inodes within four hours.
|
||||
|
||||
## Impact
|
||||
|
||||
A node's filesystem becoming full can have a far reaching impact, as it may
|
||||
cause any or all of the applications scheduled to that node to experience
|
||||
anything from performance degradation to full inoperability. Depending on the
|
||||
node and filesystem involved, this could pose a critical threat to the stability
|
||||
of the cluster.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Note the `instance` and `mountpoint` labels from the alert. You can graph the
|
||||
usage history of this filesystem with the following query in the OpenShift web
|
||||
console:
|
||||
|
||||
```text
|
||||
node_filesystem_files_free{
|
||||
instance="<value of instance label from alert>",
|
||||
mountpoint="<value of mountpoint label from alert>"
|
||||
}
|
||||
```
|
||||
|
||||
You can also open a debug session on the node and use the standard Linux
|
||||
utilities to locate the source of the usage:
|
||||
|
||||
```console
|
||||
$ MOUNT_POINT='<value of mountpoint label from alert>'
|
||||
$ NODE_NAME='<value of instance label from alert>'
|
||||
|
||||
$ oc debug "node/$NODE_NAME"
|
||||
$ df -hi "/host/$MOUNT_POINT"
|
||||
```
|
||||
|
||||
Note that in many cases a filesystem running out of inodes will still have
|
||||
available storage. Running out of inodes is often caused by many many small
|
||||
files being created by an application.
|
||||
|
||||
## Mitigation
|
||||
|
||||
The number of inodes allocated to a filesystem is usually based on the storage
|
||||
size. You may be able to solve the problem, or buy time, by increasing size of
|
||||
the storage volume. Otherwise, determine the application that is creating large
|
||||
numbers of files and adjust its configuration or provide it dedicated storage.
|
||||
|
||||
[1]: ./NodeFilesystemSpaceFillingUp.md
|
||||
62
content/runbooks/node/NodeFilesystemSpaceFillingUp.md
Normal file
62
content/runbooks/node/NodeFilesystemSpaceFillingUp.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# NodeFilesystemSpaceFillingUp
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is based on an extrapolation of the space used in a file system. It
|
||||
fires if both the current usage is above a certain threshold _and_ the
|
||||
extrapolation predicts to run out of space in a certain time. This is a
|
||||
warning-level alert if that time is less than 24h. It's a critical alert if that
|
||||
time is less than 4h.
|
||||
|
||||
## Impact
|
||||
|
||||
A filesystem running full is very bad for any process in need to write to the
|
||||
filesystem. But even before a filesystem runs full, performance is usually
|
||||
degrading.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
|
||||
pattern of writing and cleaning up can trick the linear prediction into a false
|
||||
alert. Use the usual OS tools to investigate what directories are the worst
|
||||
and/or recent offenders. Is this some irregular condition, e.g. a process fails
|
||||
to clean up behind itself or is this organic growth? If monitoring is enabled,
|
||||
the following metric can be watched in PromQL.
|
||||
|
||||
```console
|
||||
node_filesystem_free_bytes
|
||||
```
|
||||
|
||||
Check the alert's `mountpoint` label.
|
||||
|
||||
## Mitigation
|
||||
|
||||
For the case that the `mountpoint` label is `/`, `/sysroot` or `/var`; then
|
||||
removing unused images solves that issue:
|
||||
|
||||
Debug the node by accessing the node filesystem:
|
||||
|
||||
```console
|
||||
$ NODE_NAME=<instance label from alert>
|
||||
$ kubectl -n default debug node/$NODE_NAME
|
||||
$ chroot /host
|
||||
```
|
||||
|
||||
Remove dangling images:
|
||||
|
||||
```console
|
||||
# TODO: Command needed
|
||||
```
|
||||
|
||||
Remove unused images:
|
||||
|
||||
```console
|
||||
# TODO: Command needed
|
||||
```
|
||||
|
||||
Exit debug:
|
||||
|
||||
```console
|
||||
$ exit
|
||||
$ exit
|
||||
```
|
||||
31
content/runbooks/node/NodeRAIDDegraded.md
Normal file
31
content/runbooks/node/NodeRAIDDegraded.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# NodeRAIDDegraded
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is triggered when a node has a storage configuration with RAID array,
|
||||
and the array is reporting as being in a degraded state due to one or more disk
|
||||
failures.
|
||||
|
||||
## Impact
|
||||
|
||||
The affected node could go offline at any moment if the RAID array fully fails
|
||||
due to further issues with disks.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
You can open a shell on the node and use the standard Linux utilities to
|
||||
diagnose the issue, but you may need to install additional software in the debug
|
||||
container:
|
||||
|
||||
```console
|
||||
$ NODE_NAME='<value of instance label from alert>'
|
||||
|
||||
$ oc debug "node/$NODE_NAME"
|
||||
$ cat /proc/mdstat
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
See the Red Hat Enterprise Linux [documentation][1] for potential steps.
|
||||
|
||||
[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices
|
||||
30
content/runbooks/prometheus/PrometheusTargetSyncFailure.md
Normal file
30
content/runbooks/prometheus/PrometheusTargetSyncFailure.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# PrometheusTargetSyncFailure
|
||||
|
||||
## Meaning
|
||||
|
||||
This alert is triggered when at least one of the Prometheus instances has
|
||||
consistently failed to sync its configuration.
|
||||
|
||||
## Impact
|
||||
|
||||
Metrics and alerts may be missing or inaccurate.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Determine whether the alert is for the cluster or user workload Prometheus by
|
||||
inspecting the alert's `namespace` label.
|
||||
|
||||
Check the logs for the appropriate Prometheus instance:
|
||||
|
||||
```console
|
||||
$ NAMESPACE='<value of namespace label from alert>'
|
||||
|
||||
$ oc -n $NAMESPACE logs -l 'app=prometheus'
|
||||
level=error ... msg="Creating target failed" ...
|
||||
```
|
||||
|
||||
## Mitigation
|
||||
|
||||
If the logs indicate a syntax or other configuration error, correct the
|
||||
corresponding `ServiceMonitor`, `PodMonitor`, or other configuration
|
||||
resource. In most all cases, the operator should prevent this from happening.
|
||||
Reference in New Issue
Block a user