This commit is contained in:
Tigran Tchougourian
2021-11-26 10:25:54 +01:00
21 changed files with 878 additions and 1 deletions

View File

@@ -0,0 +1,22 @@
---
title: Alertmanager Cluster Failed To Send Alerts
weight: 20
---
# AlertmanagerClusterFailedToSendAlerts
## Meaning
All instances failed to send notification to an integration.
## Impact
You will not receive a notification when an alert is raised.
## Diagnosis
No alerts are received at the integration level from the cluster.
## Mitigation
Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall...)

View File

@@ -0,0 +1,24 @@
---
title: Alertmanager ConfigInconsistent
weight: 20
---
# AlertmanagerConfigInconsistent
## Meaning
The configuration between instances inside a cluster is inconsistent.
## Impact
Configuration inconsistency can be multiple and impact is hard to predict.
Nevertheless, in most cases the alert might be lost or routed to the incorrect integration.
## Diagnosis
Run a `diff` tool between all `alertmanager.yml` that are deployed to find what is wrong.
You could run a job within your CI to avoid this issue in the future.
## Mitigation
Delete the incorrect secret and deploy the correct one.

View File

@@ -7,7 +7,9 @@ weight: 20
## Meaning
The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance for the cluster monitoring stack has consistently failed to reload its configuration for a certain period.
The alert `AlertmanagerFailedReload` is triggered when the Alertmanager instance
for the cluster monitoring stack has consistently failed to reload its
configuration for a certain period.
## Impact

View File

@@ -0,0 +1,7 @@
---
title: etcd
bookCollapseSection: true
bookFlatSection: true
weight: 10
---

View File

@@ -0,0 +1,81 @@
# etcdBackendQuotaLowSpace
## Meaning
This alert fires when the total existing DB size exceeds 95% of the maximum
DB quota. The consumed space is in Prometheus represented by the metric
`etcd_mvcc_db_total_size_in_bytes`, and the DB quota size is defined by
`etcd_server_quota_backend_bytes`.
## Impact
In case the DB size exceeds the DB quota, no writes can be performed anymore on
the etcd cluster. This further prevents any updates in the cluster, such as the
creation of pods.
## Diagnosis
The following two approaches can be used for the diagnosis.
### CLI Checks
To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any
etcd pod.
```console
$ NAMESPACE="kube-etcd"
$ kubectl rsh -c etcdctl -n $NAMESPACE $(kubectl get po -l app=etcd -oname -n $NAMESPACE | awk -F"/" 'NR==1{ print $2 }')
```
Validate that the `etcdctl` command is available:
```console
$ etcdctl version
```
`etcdctl` can be used to fetch the DB size of the etcd endpoints.
```console
$ etcdctl endpoint status -w table
```
### PromQL queries
Check the percentage consumption of etcd DB with the following query in the
metrics console:
```console
(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100
```
Check the DB size in MB that can be reduced after defragmentation:
```console
(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
```
## Mitigation
### Capacity planning
If the `etcd_mvcc_db_total_size_in_bytes` shows that you are growing close to
the `etcd_server_quota_backend_bytes`, etcd almost reached max capacity and it's
start planning for new cluster.
In the meantime before migration happens, you can use defrag to gain some time.
### Defrag
When the etcd DB size increases, we can defragment existing etcd DB to optimize
DB consumption as described in [here][etcdDefragmentation]. Run the following
command in all etcd pods.
```console
$ etcdctl defrag
```
As validation, check the endpoint status of etcd members to know the reduced
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
above. More space should be available now.
[etcdDefragmentation]: https://etcd.io/dkubectls/v3.4.0/op-guide/maintenance/

View File

@@ -0,0 +1,96 @@
# etcdGRPCRequestsSlow
## Meaning
This alert fires when the 99th percentile of etcd gRPC requests are too slow.
## Impact
When requests are too slow, they can lead to various scenarios like leader
election failure, slow reads and writes.
## Diagnosis
This could be result of slow disk (due to fragmented state) or CPU contention.
### Slow disk
One of the most common reasons for slow gRPC requests is disk. Checking disk
related metrics and dashboards should provide a more clear picture.
#### PromQL queries used to troubleshoot
Verify the value of how slow the etcd gRPC requests are by using the following
query in the metrics console:
```console
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) without(grpc_type))
```
That result should give a rough timeline of when the issue started.
`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
rule out a slow disk and confirm that the disk is reasonably fast, 99th
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
than 10ms. Query in metrics UI:
```console
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
```
#### Console dashboards
In the OpenShift dashboard console under Observe section, select the etcd
dashboard. There are both RPC rate as well as Disk Sync Duration dashboards
which will assist with further issues.
### Resource exhaustion
It can happen that etcd responds slower due to CPU resource exhaustion.
This was seen in some cases when one application was requesting too much CPU
which led to this alert firing for multiple methods.
Often if this is the case, we also see
`etcd_disk_wal_fsync_duration_seconds_bucket` slower as well.
To confirm this is the cause of the slow requests either:
1. In OpenShift console on primary page under "Cluster utilization" view the
requested CPU vs available.
2. PromQL query is the following to see top consumers of CPU:
```console
topk(25, sort_desc(
sum by (namespace) (
(
sum(avg_over_time(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod)
*
on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
)
*
on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
)
))
```
## Mitigation
### Fragmented state
In the case of slow fisk or when the etcd DB size increases, we can defragment
existing etcd DB to optimize DB consumption as described in
[here][etcdDefragmentation]. Run the following command in all etcd pods.
```console
$ etcdctl defrag
```
As validation, check the endpoint status of etcd members to know the reduced
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
above. More space should be available now.
Further info on etcd best practices can be found in the [OpenShift docs
here][etcdPractices].
[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_

View File

@@ -0,0 +1,55 @@
# etcdHighFsyncDurations
## Meaning
This alert fires when the 99th percentile of etcd disk fsync duration is too
high for 10 minutes.
## Impact
When this happens it can lead to various scenarios like leader election failure,
frequent leader elections, slow reads and writes.
## Diagnosis
This could be result of slow disk possibly due to fragmented state in etcd or
simply due to slow disk.
### Slow disk
Checking disk related metrics and dashboards should provide a more clear
picture.
#### PromQL queries used to troubleshoot
`etcd_disk_wal_fsync_duration_seconds_bucket` reports the etcd disk fsync
duration, `etcd_server_leader_changes_seen_total` reports the leader changes. To
rule out a slow disk and confirm that the disk is reasonably fast, 99th
percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less
than 10ms. Query in metrics UI:
```console
histogram_quantile(0.99, sum by (instance, le) (irate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m])))
```
## Mitigation
### Fragmented state
In the case of slow fisk or when the etcd DB size increases, we can defragment
existing etcd DB to optimize DB consumption as described in
[here][etcdDefragmentation]. Run the following command in all etcd pods.
```console
$ etcdctl defrag
```
As validation, check the endpoint status of etcd members to know the reduced
size of etcd DB. Use for this purpose the same diagnostic approaches as listed
above. More space should be available now.
Further info on etcd best practices can be found in the [OpenShift docs
here][etcdPractices].
[etcdDefragmentation]: https://etcd.io/docs/v3.4.0/op-guide/maintenance/
[etcdPractices]: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_

View File

@@ -0,0 +1,41 @@
# etcdHighNumberOfFailedGRPCRequests
## Meaning
This alert fires when at least 50% of etcd gRPC requests failed in the past 10
minutes.
## Impact
First establish which gRPC method is failing, this will be visible in the alert.
If it's not part of the alert, the following query will display method and etcd
instance that has failing requests:
```sh
100 * sum without(grpc_type, grpc_code)
(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job="etcd"}[5m]))
/ sum without(grpc_type, grpc_code)
(rate(grpc_server_handled_total{job="etcd"}[5m])) > 5 and on()
(sum(cluster_infrastructure_provider{type!~"ipi|BareMetal"} == bool 1))
```
## Diagnosis
All the gRPC errors should also be logged in each respective etcd instance logs.
You can get the instance name from the alert that is firing or by running the
query detailed above. Those etcd instance logs should serve as further insight
into what is wrong.
To get logs of etcd containers either check the instance from the alert and
check logs directly or run the following:
```sh
NAMESPACE="kube-etcd"
kubectl logs -n $NAMESPACE -lapp=etcd etcd
```
## Mitigation
Depending on the above diagnosis, the issue will most likely be described in the
error log line of either etcd or openshift-etcd-operator. Most likely causes
tend to be networking issues.

View File

@@ -0,0 +1,65 @@
# etcdInsufficientMembers
## Meaning
This alert fires when there are fewer instances available than are needed by
etcd to be healthy.
## Impact
When etcd does not have a majority of instances available the Kubernetes and
OpenShift APIs will reject read and write requests and operations that preserve
the health of workloads cannot be performed.
## Diagnosis
This can kubectlcur multiple control plane nodes are powered off or are unable to
connect each other via the network. Check that all control plane nodes are
powered and that network connections between each machine are functional.
Check any other critical, warning or info alerts firing that can assist with the
diagnosis.
Login to the cluster. Check health of master nodes if any of them is in
`NotReady` state or not.
```console
$ kubectl get nodes -l node-role.kubernetes.io/master=
```
### General etcd health
To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
etcd pod.
```console
$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
```
Validate that the `etcdctl` command is available:
```console
$ etcdctl version
```
Run the following command to get the health of etcd:
```console
$ etcdctl endpoint health -w table
```
## Mitigation
### Disaster and recovery
If an upgrade is in progress, the alert may automatically resolve in some time
when the master node comes up again. If MCO is not working on the master node,
check the cloud provider to verify if the master node instances are running or not.
In the case when you are running on AWS, the AWS instance retirement might need
a manual reboot of the master node.
As a last resort if none of the above fix the issue and the alert is still
firing, for etcd specific issues follow the steps described in the [disaster and
recovery dkubectls](dkubectls).
[dkubectls]:(https://dkubectls.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).

View File

@@ -0,0 +1,68 @@
# etcdMembersDown
## Meaning
This alert fires when one or more etcd member goes down and evaluates the
number of etcd members that are currently down. Often, this alert was observed
as part of a cluster upgrade when a master node is being upgraded and requires a
reboot.
## Impact
In etcd a majority of (n/2)+1 has to agree on membership changes or key-value
upgrade proposals. With this approach, a split-brain inconsistency can be
avoided. In the case that only one member is down in a 3-member cluster, it
still can make forward progress. Due to the fact that the quorum is 2 and 2
members are still alive. However, when more members are down, the cluster
becomes unrecoverable.
## Diagnosis
Login to the cluster. Check health of master nodes if any of them is in
`NotReady` state or not.
```console
$ kubectl get nodes -l node-role.kubernetes.io/master=
```
In case there is no upgrade going on, but there is a change in the
`machineconfig` for the master pool causing a rolling reboot of each master
node, this alert can be triggered as well. We can check if the
`machineconfiguration.openshift.io/state : Working` annotation is set for any of
the master nodes. This is the case when the [machine-config-operator
(MCO)](https://github.com/openshift/machine-config-operator) is working on it.
```console
$ kubectl get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
```
### General etcd health
To run `etcdctl` commands, we need to `exec` into the `etcdctl` container of any
etcd pod.
```console
$ kubectl exec -c etcdctl -n openshift-etcd $(kubectl get po -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
```
Validate that the `etcdctl` command is available:
```console
$ etcdctl version
```
Run the following command to get the health of etcd:
```console
$ etcdctl endpoint health -w table
```
## Mitigation
If an upgrade is in progress, the alert may automatically resolve in some time
when the master node comes up again. If MCO is not working on the master node,
check the cloud provider to verify if the master node instances are running or not.
In the case when you are running on AWS, the AWS instance retirement might need
a manual reboot of the master node.

View File

@@ -0,0 +1,42 @@
# etcdNoLeader
## Meaning
This alert is triggered when etcd cluster does not have a leader for more than 1
minute.
## Impact
When there is no leader, Kubernetes API will not be able to work
as expected and cluster cannot process any writes or reads, and any write
requests are queued for processing until a new leader is elected. Operations
that preserve the health of the workloads cannot be performed.
## Diagnosis
### Control plane nodes issue
This can occur multiple control plane nodes are powered off or are unable to
connect each other via the network. Check that all control plane nodes are
powered and that network connections between each machine are functional.
### Slow disk issue
Another potential cause could be slow disk, inspect the `Disk Sync
Duration`dashboard, as well as the `Total Leader Elections Per Day` to get more
insight and help with diagnosis.
### Other
Check the logs of etcd containers to see any further information and to verify
that etcd does not have leader. Logs should contain something like `etcdserver:
no leader`.
## Mitigation
### Disaster and recovery
Follow the steps described in the [disaster and recovery docs](docs).
[docs]:(https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/about-disaster-recovery.html).

View File

@@ -0,0 +1,35 @@
# KubeAPIDown
## Meaning
The `KubeAPIDown` alert is triggered when all Kubernetes API servers have not
been reachable by the monitoring system for more than 15 minutes.
## Impact
This is a critical alert. The Kubernetes API is not responding. The
cluster may partially or fully non-functional.
## Diagnosis
Check the status of the API server targets in the Prometheus UI.
Then, confirm whether the API is also unresponsive for you:
```console
$ kubectl cluster-info
```
If you can still reach the API server, there may be a network issue between the
Prometheus instances and the API server pods. Check the status of the API server
pods.
```console
$ kubectl -n kube-system get pods
$ kubectl -n kube-system logs -l 'app=kube-apiserver'
```
## Mitigation
If you can still reach the API server intermittently, you may be able treat this
like any other failing deployment. If not, it's possible you may have to refer
to the disaster recovery documentation.

View File

@@ -0,0 +1,39 @@
# KubeNodeNotReady
## Meaning
KubeNodeNotReady alert is fired when a Kubernetes node is not in `Ready`
state for a certain period. In this case, the node is not able to host any new
pods as described [here][KubeNode].
## Impact
The performance of the cluster deployments is affected, depending on the overall
workload and the type of the node.
## Diagnosis
The notification details should list the node that's not ready. For Example:
```txt
- alertname = KubeNodeNotReady
...
- node = node1.example.com
...
```
Login to the cluster. Check the status of that node:
```console
$ kubectl get node $NODE -o yaml
```
The output should describe why the node isn't ready (e.g.: timeouts reaching the
API or kubelet).
## Mitigation
Once, the problem was resolved that prevented node from being replaced,
the instance should be terminated.
[KubeNode]: https://kubernetes.io/docs/concepts/architecture/nodes/#condition

View File

@@ -0,0 +1,38 @@
# KubeletDown
## Meaning
This alert is triggered when the monitoring system has not been able to reach
any of the cluster's Kubelets for more than 15 minutes.
## Impact
This alert represents a critical threat to the cluster's stability. Excluding
the possibility of a network issue preventing the monitoring system from
scraping Kubelet metrics, multiple nodes in the cluster are likely unable to
respond to configuration changes for pods and other resources, and some
debugging tools are likely not functional, e.g. `kubectl exec` and `kubectl logs`.
## Diagnosis
Check the status of nodes and for recent events on `Node` objects, or for recent
events in general:
```console
$ kubectl get nodes
$ kubectl describe node $NODE_NAME
$ kubectl get events --field-selector 'involvedObject.kind=Node'
$ kubectl get events
```
If you have SSH access to the nodes, access the logs for the Kubelet directly:
```console
$ journalctl -b -f -u kubelet.service
```
## Mitigation
The mitigation depends on what is causing the Kubelets to become
unresponsive. Check for wide-spread networking issues, or node level
configuration issues.

View File

@@ -0,0 +1,33 @@
# NodeFileDescriptorLimit
## Meaning
This alert is triggered when a node's kernel is found to be running out of
available file descriptors -- a `warning` level alert at greater than 70% usage
and a `critical` level alert at greater than 90% usage.
## Impact
Applications on the node may no longer be able to open and operate on
files. This is likely to have severe consequences for anything scheduled on this
node.
## Diagnosis
You can open a shell on the node and use the standard Linux utilities to
diagnose the issue:
```console
$ NODE_NAME='<value of instance label from alert>'
$ oc debug "node/$NODE_NAME"
# sysctl -a | grep 'fs.file-'
fs.file-max = 1597016
fs.file-nr = 7104 0 1597016
# lsof -n
```
## Mitigation
Reduce the number of files opened simultaneously by either adjusting application
configuration or by moving some applications to other nodes.

View File

@@ -0,0 +1,27 @@
# NodeFilesystemAlmostOutOfFiles
## Meaning
This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
than being based on a prediction that a filesystem will run out of inodes in a
certain amount of time, it uses simple static thresholds. The alert will fire as
at a `warning` level at 5% of available inodes left, and at a `critical` level
with 3% of available inodes left.
## Impact
A node's filesystem becoming full can have a far reaching impact, as it may
cause any or all of the applications scheduled to that node to experience
anything from performance degradation to full inoperability. Depending on the
node and filesystem involved, this could pose a critical threat to the stability
of the cluster.
## Diagnosis
Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
## Mitigation
Refer to the [NodeFilesystemFilesFillingUp][1] runbook.
[1]: ./NodeFilesystemFilesFillingUp.md

View File

@@ -0,0 +1,26 @@
# NodeFilesystemAlmostOutOfSpace
## Meaning
This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but rather
than being based on a prediction that a filesystem will become full in a certain
amount of time, it uses simple static thresholds. The alert will fire as at a
`warning` level at 5% space left, and at a `critical` level with 3% space left.
## Impact
A node's filesystem becoming full can have a far reaching impact, as it may
cause any or all of the applications scheduled to that node to experience
anything from performance degradation to full inoperability. Depending on the
node and filesystem involved, this could pose a critical threat to the stability
of the cluster.
## Diagnosis
Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
## Mitigation
Refer to the [NodeFilesystemSpaceFillingUp][1] runbook.
[1]: ./NodeFilesystemSpaceFillingUp.md

View File

@@ -0,0 +1,53 @@
# NodeFilesystemFilesFillingUp
## Meaning
This alert is similar to the [NodeFilesystemSpaceFillingUp][1] alert, but
predicts the filesystem will run out of inodes rather than bytes of storage
space. The alert fires at a `critical` level when the filesystem is predicted to
run out of available inodes within four hours.
## Impact
A node's filesystem becoming full can have a far reaching impact, as it may
cause any or all of the applications scheduled to that node to experience
anything from performance degradation to full inoperability. Depending on the
node and filesystem involved, this could pose a critical threat to the stability
of the cluster.
## Diagnosis
Note the `instance` and `mountpoint` labels from the alert. You can graph the
usage history of this filesystem with the following query in the OpenShift web
console:
```text
node_filesystem_files_free{
instance="<value of instance label from alert>",
mountpoint="<value of mountpoint label from alert>"
}
```
You can also open a debug session on the node and use the standard Linux
utilities to locate the source of the usage:
```console
$ MOUNT_POINT='<value of mountpoint label from alert>'
$ NODE_NAME='<value of instance label from alert>'
$ oc debug "node/$NODE_NAME"
$ df -hi "/host/$MOUNT_POINT"
```
Note that in many cases a filesystem running out of inodes will still have
available storage. Running out of inodes is often caused by many many small
files being created by an application.
## Mitigation
The number of inodes allocated to a filesystem is usually based on the storage
size. You may be able to solve the problem, or buy time, by increasing size of
the storage volume. Otherwise, determine the application that is creating large
numbers of files and adjust its configuration or provide it dedicated storage.
[1]: ./NodeFilesystemSpaceFillingUp.md

View File

@@ -0,0 +1,62 @@
# NodeFilesystemSpaceFillingUp
## Meaning
This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.
## Impact
A filesystem running full is very bad for any process in need to write to the
filesystem. But even before a filesystem runs full, performance is usually
degrading.
## Diagnosis
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert. Use the usual OS tools to investigate what directories are the worst
and/or recent offenders. Is this some irregular condition, e.g. a process fails
to clean up behind itself or is this organic growth? If monitoring is enabled,
the following metric can be watched in PromQL.
```console
node_filesystem_free_bytes
```
Check the alert's `mountpoint` label.
## Mitigation
For the case that the `mountpoint` label is `/`, `/sysroot` or `/var`; then
removing unused images solves that issue:
Debug the node by accessing the node filesystem:
```console
$ NODE_NAME=<instance label from alert>
$ kubectl -n default debug node/$NODE_NAME
$ chroot /host
```
Remove dangling images:
```console
# TODO: Command needed
```
Remove unused images:
```console
# TODO: Command needed
```
Exit debug:
```console
$ exit
$ exit
```

View File

@@ -0,0 +1,31 @@
# NodeRAIDDegraded
## Meaning
This alert is triggered when a node has a storage configuration with RAID array,
and the array is reporting as being in a degraded state due to one or more disk
failures.
## Impact
The affected node could go offline at any moment if the RAID array fully fails
due to further issues with disks.
## Diagnosis
You can open a shell on the node and use the standard Linux utilities to
diagnose the issue, but you may need to install additional software in the debug
container:
```console
$ NODE_NAME='<value of instance label from alert>'
$ oc debug "node/$NODE_NAME"
$ cat /proc/mdstat
```
## Mitigation
See the Red Hat Enterprise Linux [documentation][1] for potential steps.
[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/managing-raid_managing-storage-devices

View File

@@ -0,0 +1,30 @@
# PrometheusTargetSyncFailure
## Meaning
This alert is triggered when at least one of the Prometheus instances has
consistently failed to sync its configuration.
## Impact
Metrics and alerts may be missing or inaccurate.
## Diagnosis
Determine whether the alert is for the cluster or user workload Prometheus by
inspecting the alert's `namespace` label.
Check the logs for the appropriate Prometheus instance:
```console
$ NAMESPACE='<value of namespace label from alert>'
$ oc -n $NAMESPACE logs -l 'app=prometheus'
level=error ... msg="Creating target failed" ...
```
## Mitigation
If the logs indicate a syntax or other configuration error, correct the
corresponding `ServiceMonitor`, `PodMonitor`, or other configuration
resource. In most all cases, the operator should prevent this from happening.