krkn/docs/metrics.md at c200f0774fe0122b4bcc1b94286aeddd8b87999f

mirror of https://github.com/krkn-chaos/krkn.git synced 2026-02-14 18:10:00 +00:00

Files

Tullio Sebastiani f2d7f88cb8 Krkn lib prometheus client + kube_burner references removed

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

2024-01-09 10:43:32 -05:00

2.3 KiB

Raw Blame History

Scraping and storing metrics for the run

There are cases where the state of the cluster and metrics on the cluster during the chaos test run need to be stored long term to review after the cluster is terminated, for example CI and automation test runs. To help with this, Kraken supports capturing metrics for the duration of the scenarios defined in the config.

The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus with the start and end timestamp of the run. Each run has a unique identifier ( uuid ). The uuid is generated automatically if not set in the config. This feature can be enabled in the config by setting the following:

performance_monitoring:
    capture_metrics: True
    metrics_profile_path: config/metrics-aggregated.yaml
    prometheus_url:                                       # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                              # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
    uuid:                                                 # uuid for the run is generated by default if not set.

Metrics profile

A couple of metric profiles, metrics.yaml, and metrics-aggregated.yaml are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:

metrics:
# API server
  - query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
    metricName: API99thLatency

  - query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
    metricName: APIRequestRate

  - query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
    metricName: APIInflightRequests

2.3 KiB Raw Blame History

Scraping and storing metrics for the run

Metrics profile

2.3 KiB

Raw Blame History