Expand SLOs profile to cover monitoring for more alerts

This commit: - Also sets appropriate severity to avoid false failures for the test cases especially given that theses are monitored during the chaos vs post chaos. Critical alerts are all monitored post chaos with few monitored during the chaos that represent overall health and performance of the service. - Renames Alerts to SLOs validation Metrics reference: f09a492b13/cmd/kube-burner/ocp-config/alerts.yml
2026-02-14 09:59:59 +00:00 · 2023-06-14 09:57:30 -04:00
parent 68dc17bc44
commit 0eb8d38596
3 changed files with 70 additions and 12 deletions
--- a/README.md
+++ b/README.md
@@ -94,8 +94,12 @@ Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chao
 Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes then into Elasticsearch to be able to store and evaluate the state of the runs long term. The indexed metrics can be visualized with the help of Grafana. It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Information on enabling and leveraging this feature can be found [here](docs/metrics.md).


-### Alerts
-In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. Information on enabling and leveraging this feature can be found [here](docs/alerts.md).
+### SLOs validation during and post chaos
+- In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. 
+- Kraken also provides ability to check if any critical alerts are firing in the cluster post chaos and pass/fail's. 
+
+Information on enabling and leveraging this feature can be found [here](docs/SLOs_validation.md)
+

 ### OCM / ACM integration

--- a/config/alerts
+++ b/config/alerts
@@ -1,11 +1,65 @@
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
-  description: 5 minutes avg. etcd fsync latency on {{$labels.pod}} higher than 10ms {{$value}}
+# etcd
+
+- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[10m:]) > 0.01
+  description: 10 minutes avg. 99th etcd fsync latency on {{$labels.pod}} higher than 10ms. {{$value}}s
+  severity: warning
+
+- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[10m:]) > 1
+  description: 10 minutes avg. 99th etcd fsync latency on {{$labels.pod}} higher than 1s. {{$value}}s
  severity: error

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))[5m:]) > 0.1
-  description: 5 minutes avg. etcd netowrk peer round trip on {{$labels.pod}} higher than 100ms {{$value}}
-  severity: info
+- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[2m]))[10m:]) > 0.03
+  description: 10 minutes avg. 99th etcd commit latency on {{$labels.pod}} higher than 30ms. {{$value}}s
+  severity: warning

- expr: increase(etcd_server_leader_changes_seen_total[2m]) > 0
+- expr: rate(etcd_server_leader_changes_seen_total[2m]) > 0
  description: etcd leader changes observed
+  severity: warning
+
+# API server
+- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"POST|PUT|DELETE|PATCH", subresource!~"log|exec|portforward|attach|proxy"}[2m])) by (le, resource, verb))[10m:]) > 1
+  description: 10 minutes avg. 99th mutating API call latency for {{$labels.verb}}/{{$labels.resource}} higher than 1 second. {{$value}}s
  severity: error
+
+- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"LIST|GET", subresource!~"log|exec|portforward|attach|proxy", scope="resource"}[2m])) by (le, resource, verb, scope))[5m:]) > 1
+  description: 5 minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 1 second. {{$value}}s
+  severity: error
+
+- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"LIST|GET", subresource!~"log|exec|portforward|attach|proxy", scope="namespace"}[2m])) by (le, resource, verb, scope))[5m:]) > 5
+  description: 5 minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 5 seconds. {{$value}}s
+  severity: error
+
+- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"LIST|GET", subresource!~"log|exec|portforward|attach|proxy", scope="cluster"}[2m])) by (le, resource, verb, scope))[5m:]) > 30
+  description: 5 minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 30 seconds. {{$value}}s
+  severity: error
+
+# Control plane pods
+- expr: up{apiserver=~"kube-apiserver|openshift-apiserver"} == 0
+  description: "{{$labels.apiserver}} {{$labels.instance}} down"
+  severity: warning
+
+- expr: up{namespace=~"openshift-etcd"} == 0
+  description: "{{$labels.namespace}}/{{$labels.pod}} down"
+  severity: error
+
+- expr: up{namespace=~"openshift-.*(kube-controller-manager|scheduler|controller-manager|sdn|ovn-kubernetes|dns)"} == 0
+  description: "{{$labels.namespace}}/{{$labels.pod}} down"
+  severity: warning
+
+- expr: up{job=~"crio|kubelet"} == 0
+  description: "{{$labels.node}}/{{$labels.job}} down"
+  severity: warning
+
+- expr: up{job="ovnkube-node"} == 0
+  description: "{{$labels.instance}}/{{$labels.pod}} {{$labels.job}} down"
+  severity: warning
+
+# Service sync latency
+- expr: histogram_quantile(0.99, sum(rate(kubeproxy_network_programming_duration_seconds_bucket[2m])) by (le)) > 10
+  description: 99th Kubeproxy network programming latency higher than 10 seconds. {{$value}}s 
+  severity: warning
+
+# Prometheus alerts
+- expr: ALERTS{severity="critical", alertstate="firing"} > 0
+  description: Critical prometheus alert. {{$labels.alertname}}
+  severity: warning
--- a/docs/SLOs_validation.md
+++ b/docs/SLOs_validation.md
@@ -1,16 +1,16 @@
-## Alerts
+## SLOs validation

 Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports:

-###  Checking for critical alerts 
-If enabled, the check runs at the end of each scenario and Kraken exits in case critical alerts are firing to allow user to debug. You can enable it in the config:
+###  Checking for critical alerts post chaos 
+If enabled, the check runs at the end of each scenario ( post chaos ) and Kraken exits in case critical alerts are firing to allow user to debug. You can enable it in the config:

 ```
 performance_monitoring:
    check_critical_alerts: False                          # When enabled will check prometheus for critical alerts firing post chaos
 ```

-### Alerting based on the queries defined by the user
+### Validation and alerting based on the queries defined by the user during chaos
 Takes PromQL queries as input and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:

 ```