mirror of
https://github.com/krkn-chaos/krkn.git
synced 2026-02-18 20:09:55 +00:00
Compare commits
9 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
4f7c58106d | ||
|
|
a7e5ae6c80 | ||
|
|
aa030a21d3 | ||
|
|
631f12bdff | ||
|
|
2525982c55 | ||
|
|
9760d7d97d | ||
|
|
720488c159 | ||
|
|
487a9f464c | ||
|
|
d9e137e85a |
15
.github/workflows/docker-image.yml
vendored
15
.github/workflows/docker-image.yml
vendored
@@ -12,14 +12,25 @@ jobs:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v3
|
||||
- name: Build the Docker images
|
||||
run: docker build --no-cache -t quay.io/redhat-chaos/krkn containers/
|
||||
run: |
|
||||
docker build --no-cache -t quay.io/krkn-chaos/krkn containers/
|
||||
docker tag quay.io/krkn-chaos/krkn quay.io/redhat-chaos/krkn
|
||||
- name: Login in quay
|
||||
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
|
||||
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
|
||||
env:
|
||||
QUAY_USER: ${{ secrets.QUAY_USERNAME }}
|
||||
QUAY_TOKEN: ${{ secrets.QUAY_PASSWORD }}
|
||||
- name: Push the KrknChaos Docker images
|
||||
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
|
||||
run: docker push quay.io/krkn-chaos/krkn
|
||||
- name: Login in to redhat-chaos quay
|
||||
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
|
||||
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
|
||||
env:
|
||||
QUAY_USER: ${{ secrets.QUAY_USER_1 }}
|
||||
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}
|
||||
- name: Push the Docker images
|
||||
- name: Push the RedHat Chaos Docker images
|
||||
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
|
||||
run: docker push quay.io/redhat-chaos/krkn
|
||||
- name: Rebuild krkn-hub
|
||||
|
||||
51
README.md
51
README.md
@@ -1,11 +1,11 @@
|
||||
# KrknChaos aka Kraken
|
||||
# Krkn aka Kraken
|
||||
[](https://quay.io/repository/redhat-chaos/krkn?tab=tags&tag=latest)
|
||||

|
||||
|
||||

|
||||
|
||||
Chaos and resiliency testing tool for Kubernetes and OpenShift.
|
||||
Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to turbulent conditions.
|
||||
Chaos and resiliency testing tool for Kubernetes.
|
||||
Kraken injects deliberate failures into Kubernetes clusters to check if it is resilient to turbulent conditions.
|
||||
|
||||
|
||||
### Workflow
|
||||
@@ -18,13 +18,13 @@ Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check i
|
||||
### Chaos Testing Guide
|
||||
[Guide](docs/index.md) encapsulates:
|
||||
- Test methodology that needs to be embraced.
|
||||
- Best practices that an OpenShift cluster, platform and applications running on top of it should take into account for best user experience, performance, resilience and reliability.
|
||||
- Best practices that an Kubernetes cluster, platform and applications running on top of it should take into account for best user experience, performance, resilience and reliability.
|
||||
- Tooling.
|
||||
- Scenarios supported.
|
||||
- Test environment recommendations as to how and where to run chaos tests.
|
||||
- Chaos testing in practice.
|
||||
|
||||
The guide is hosted at https://redhat-chaos.github.io/krknChoas.
|
||||
The guide is hosted at https://krkn-chaos.github.io/krkn.
|
||||
|
||||
|
||||
### How to Get Started
|
||||
@@ -57,29 +57,29 @@ This will manage the Cerberus and Elasticsearch containers on the host on which
|
||||
Instructions on how to setup the config and the options supported can be found at [Config](docs/config.md).
|
||||
|
||||
|
||||
### Kubernetes/OpenShift chaos scenarios supported
|
||||
### Kubernetes chaos scenarios supported
|
||||
|
||||
Scenario type | Kubernetes | OpenShift
|
||||
--------------------------- | ------------- |--------------------|
|
||||
[Pod Scenarios](docs/pod_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Pod Network Scenarios](docs/pod_network_scenarios.md) | :x: | :heavy_check_mark: |
|
||||
[Container Scenarios](docs/container_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Node Scenarios](docs/node_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Time Scenarios](docs/time_scenarios.md) | :x: | :heavy_check_mark: |
|
||||
[Hog Scenarios: CPU, Memory](docs/arcaflow_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Cluster Shut Down Scenarios](docs/cluster_shut_down_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Service Disruption Scenarios](docs/service_disruption_scenarios.md.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Zone Outage Scenarios](docs/zone_outage.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Application_outages](docs/application_outages.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: | :heavy_check_mark: |
|
||||
[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: | :question: |
|
||||
Scenario type | Kubernetes
|
||||
--------------------------- | ------------- |
|
||||
[Pod Scenarios](docs/pod_scenarios.md) | :heavy_check_mark: |
|
||||
[Pod Network Scenarios](docs/pod_network_scenarios.md) | :x: |
|
||||
[Container Scenarios](docs/container_scenarios.md) | :heavy_check_mark: |
|
||||
[Node Scenarios](docs/node_scenarios.md) | :heavy_check_mark: |
|
||||
[Time Scenarios](docs/time_scenarios.md) | :x: |
|
||||
[Hog Scenarios: CPU, Memory](docs/arcaflow_scenarios.md) | :heavy_check_mark: |
|
||||
[Cluster Shut Down Scenarios](docs/cluster_shut_down_scenarios.md) | :heavy_check_mark: |
|
||||
[Service Disruption Scenarios](docs/service_disruption_scenarios.md.md) | :heavy_check_mark: |
|
||||
[Zone Outage Scenarios](docs/zone_outage.md) | :heavy_check_mark: |
|
||||
[Application_outages](docs/application_outages.md) | :heavy_check_mark: |
|
||||
[PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: |
|
||||
[Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: |
|
||||
[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: |
|
||||
|
||||
|
||||
### Kraken scenario pass/fail criteria and report
|
||||
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
|
||||
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
|
||||
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
|
||||
- Leveraging [Cerberus](https://github.com/redhat-chaos/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
|
||||
- Leveraging built-in alert collection feature to fail the runs in case of critical alerts.
|
||||
|
||||
### Signaling
|
||||
@@ -94,10 +94,6 @@ More detailed information on enabling and leveraging this feature can be found [
|
||||
Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chaos scenarios on various components is key to find out the bottlenecks as it is important to make sure the cluster is healthy in terms if both recovery as well as performance during/after the failure has been injected. Instructions on enabling it can be found [here](docs/performance_dashboards.md).
|
||||
|
||||
|
||||
### Scraping and storing metrics long term
|
||||
Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes then into Elasticsearch to be able to store and evaluate the state of the runs long term. The indexed metrics can be visualized with the help of Grafana. It uses [Kube-burner](https://github.com/kube-burner/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Information on enabling and leveraging this feature can be found [here](docs/metrics.md).
|
||||
|
||||
|
||||
### SLOs validation during and post chaos
|
||||
- In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics.
|
||||
- Kraken also provides ability to check if any critical alerts are firing in the cluster post chaos and pass/fail's.
|
||||
@@ -116,7 +112,8 @@ Kraken supports injecting faults into [Open Cluster Management (OCM)](https://op
|
||||
- Blog post emphasizing the importance of making Chaos part of Performance and Scale runs to mimic the production environments: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests
|
||||
- Blog post on findings from Chaos test runs: https://cloud.redhat.com/blog/openshift/kubernetes-chaos-stories
|
||||
- Discussion with CNCF TAG App Delivery on Krkn workflow, features and addition to CNCF sandbox: [Github](https://github.com/cncf/sandbox/issues/44), [Tracker](https://github.com/cncf/tag-app-delivery/issues/465), [recording](https://www.youtube.com/watch?v=nXQkBFK_MWc&t=722s)
|
||||
|
||||
- Blog post on supercharging chaos testing using AI integration in Krkn: https://www.redhat.com/en/blog/supercharging-chaos-testing-using-ai
|
||||
- Blog post announcing Krkn joining CNCF Sandbox: https://www.redhat.com/en/blog/krknchaos-joining-cncf-sandbox
|
||||
|
||||
### Roadmap
|
||||
Enhancements being planned can be found in the [roadmap](ROADMAP.md).
|
||||
|
||||
@@ -51,8 +51,6 @@ cerberus:
|
||||
performance_monitoring:
|
||||
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
|
||||
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
|
||||
capture_metrics: False
|
||||
metrics_profile_path: config/metrics-aggregated.yaml
|
||||
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
|
||||
uuid: # uuid for the run is generated by default if not set
|
||||
|
||||
@@ -20,8 +20,6 @@ cerberus:
|
||||
performance_monitoring:
|
||||
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
|
||||
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
|
||||
capture_metrics: False
|
||||
metrics_profile_path: config/metrics-aggregated.yaml
|
||||
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
|
||||
uuid: # uuid for the run is generated by default if not set
|
||||
|
||||
@@ -19,8 +19,6 @@ cerberus:
|
||||
performance_monitoring:
|
||||
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
|
||||
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
|
||||
capture_metrics: False
|
||||
metrics_profile_path: config/metrics-aggregated.yaml
|
||||
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
|
||||
uuid: # uuid for the run is generated by default if not set
|
||||
|
||||
@@ -4,8 +4,6 @@ FROM mcr.microsoft.com/azure-cli:latest as azure-cli
|
||||
|
||||
FROM registry.access.redhat.com/ubi8/ubi:latest
|
||||
|
||||
LABEL org.opencontainers.image.authors="Red Hat OpenShift Chaos Engineering"
|
||||
|
||||
ENV KUBECONFIG /root/.kube/config
|
||||
|
||||
# Copy azure client binary from azure-cli image
|
||||
@@ -14,7 +12,7 @@ COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
|
||||
# Install dependencies
|
||||
RUN yum install -y git python39 python3-pip jq gettext wget && \
|
||||
python3.9 -m pip install -U pip && \
|
||||
git clone https://github.com/redhat-chaos/krkn.git --branch v1.5.3 /root/kraken && \
|
||||
git clone https://github.com/krkn-chaos/krkn.git --branch v1.5.4 /root/kraken && \
|
||||
mkdir -p /root/.kube && cd /root/kraken && \
|
||||
pip3.9 install -r requirements.txt && \
|
||||
pip3.9 install virtualenv && \
|
||||
|
||||
@@ -14,7 +14,7 @@ COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
|
||||
# Install dependencies
|
||||
RUN yum install -y git python39 python3-pip jq gettext wget && \
|
||||
python3.9 -m pip install -U pip && \
|
||||
git clone https://github.com/redhat-chaos/krkn.git --branch v1.5.3 /root/kraken && \
|
||||
git clone https://github.com/redhat-chaos/krkn.git --branch v1.5.4 /root/kraken && \
|
||||
mkdir -p /root/.kube && cd /root/kraken && \
|
||||
pip3.9 install -r requirements.txt && \
|
||||
pip3.9 install virtualenv && \
|
||||
|
||||
@@ -1,31 +0,0 @@
|
||||
## Scraping and storing metrics for the run
|
||||
|
||||
There are cases where the state of the cluster and metrics on the cluster during the chaos test run need to be stored long term to review after the cluster is terminated, for example CI and automation test runs. To help with this, Kraken supports capturing metrics for the duration of the scenarios defined in the config.
|
||||
|
||||
The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus with the start and end timestamp of the run. Each run has a unique identifier ( uuid ). The uuid is generated automatically if not set in the config. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
|
||||
|
||||
```
|
||||
performance_monitoring:
|
||||
capture_metrics: True
|
||||
metrics_profile_path: config/metrics-aggregated.yaml
|
||||
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
|
||||
uuid: # uuid for the run is generated by default if not set.
|
||||
```
|
||||
|
||||
### Metrics profile
|
||||
A couple of [metric profiles](https://github.com/redhat-chaos/krkn/tree/main/config), [metrics.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/metrics.yaml), and [metrics-aggregated.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/metrics-aggregated.yaml) are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:
|
||||
|
||||
```
|
||||
metrics:
|
||||
# API server
|
||||
- query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
|
||||
metricName: API99thLatency
|
||||
|
||||
- query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
|
||||
metricName: APIRequestRate
|
||||
|
||||
- query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
|
||||
metricName: APIInflightRequests
|
||||
```
|
||||
|
||||
@@ -2,14 +2,18 @@ import datetime
|
||||
import time
|
||||
import logging
|
||||
import re
|
||||
|
||||
import yaml
|
||||
import random
|
||||
|
||||
from krkn_lib import utils
|
||||
from kubernetes.client import ApiException
|
||||
|
||||
from ..cerberus import setup as cerberus
|
||||
from ..invoke import command as runcommand
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
|
||||
from krkn_lib.models.telemetry import ScenarioTelemetry
|
||||
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
|
||||
from krkn_lib.utils.functions import get_yaml_item_value, log_exception, get_random_string
|
||||
|
||||
|
||||
# krkn_lib
|
||||
@@ -35,13 +39,6 @@ def pod_exec(pod_name, command, namespace, container_name, kubecli:KrknKubernete
|
||||
return response
|
||||
|
||||
|
||||
def node_debug(node_name, command):
|
||||
response = runcommand.invoke(
|
||||
"oc debug node/" + node_name + " -- chroot /host " + command
|
||||
)
|
||||
return response
|
||||
|
||||
|
||||
# krkn_lib
|
||||
def get_container_name(pod_name, namespace, kubecli:KrknKubernetes, container_name=""):
|
||||
|
||||
@@ -65,15 +62,46 @@ def get_container_name(pod_name, namespace, kubecli:KrknKubernetes, container_na
|
||||
return container_name
|
||||
|
||||
|
||||
|
||||
def skew_node(node_name: str, action: str, kubecli: KrknKubernetes):
|
||||
pod_namespace = "default"
|
||||
status_pod_name = f"time-skew-pod-{get_random_string(5)}"
|
||||
skew_pod_name = f"time-skew-pod-{get_random_string(5)}"
|
||||
ntp_enabled = True
|
||||
logging.info(f'Creating pod to skew {"time" if action == "skew_time" else "date"} on node {node_name}')
|
||||
status_command = ["timedatectl"]
|
||||
param = "2001-01-01"
|
||||
skew_command = ["timedatectl", "set-time"]
|
||||
if action == "skew_time":
|
||||
skew_command.append("01:01:01")
|
||||
else:
|
||||
skew_command.append("2001-01-01")
|
||||
|
||||
try:
|
||||
status_response = kubecli.exec_command_on_node(node_name, status_command, status_pod_name, pod_namespace)
|
||||
if "Network time on: no" in status_response:
|
||||
ntp_enabled = False
|
||||
|
||||
logging.warning(f'ntp unactive on node {node_name} skewing {"time" if action == "skew_time" else "date"} to {param}')
|
||||
pod_exec(skew_pod_name, skew_command, pod_namespace, None, kubecli)
|
||||
else:
|
||||
logging.info(f'ntp active in cluster node, {"time" if action == "skew_time" else "date"} skewing will have no effect, skipping')
|
||||
except ApiException:
|
||||
pass
|
||||
except Exception as e:
|
||||
logging.error(f"failed to execute skew command in pod: {e}")
|
||||
finally:
|
||||
kubecli.delete_pod(status_pod_name, pod_namespace)
|
||||
if not ntp_enabled :
|
||||
kubecli.delete_pod(skew_pod_name, pod_namespace)
|
||||
|
||||
|
||||
|
||||
# krkn_lib
|
||||
def skew_time(scenario, kubecli:KrknKubernetes):
|
||||
skew_command = "date --date "
|
||||
if scenario["action"] == "skew_date":
|
||||
skewed_date = "00-01-01"
|
||||
skew_command += skewed_date
|
||||
elif scenario["action"] == "skew_time":
|
||||
skewed_time = "01:01:01"
|
||||
skew_command += skewed_time
|
||||
if scenario["action"] not in ["skew_date","skew_time"]:
|
||||
raise RuntimeError(f'{scenario["action"]} is not a valid time skew action')
|
||||
|
||||
if "node" in scenario["object_type"]:
|
||||
node_names = []
|
||||
if "object_name" in scenario.keys() and scenario["object_name"]:
|
||||
@@ -83,13 +111,19 @@ def skew_time(scenario, kubecli:KrknKubernetes):
|
||||
scenario["label_selector"]
|
||||
):
|
||||
node_names = kubecli.list_nodes(scenario["label_selector"])
|
||||
|
||||
for node in node_names:
|
||||
node_debug(node, skew_command)
|
||||
skew_node(node, scenario["action"], kubecli)
|
||||
logging.info("Reset date/time on node " + str(node))
|
||||
return "node", node_names
|
||||
|
||||
elif "pod" in scenario["object_type"]:
|
||||
skew_command = "date --date "
|
||||
if scenario["action"] == "skew_date":
|
||||
skewed_date = "00-01-01"
|
||||
skew_command += skewed_date
|
||||
elif scenario["action"] == "skew_time":
|
||||
skewed_time = "01:01:01"
|
||||
skew_command += skewed_time
|
||||
container_name = get_yaml_item_value(scenario, "container_name", "")
|
||||
pod_names = []
|
||||
if "object_name" in scenario.keys() and scenario["object_name"]:
|
||||
@@ -241,7 +275,8 @@ def check_date_time(object_type, names, kubecli:KrknKubernetes):
|
||||
if object_type == "node":
|
||||
for node_name in names:
|
||||
first_date_time = datetime.datetime.utcnow()
|
||||
node_datetime_string = node_debug(node_name, skew_command)
|
||||
check_pod_name = f"time-skew-pod-{get_random_string(5)}"
|
||||
node_datetime_string = kubecli.exec_command_on_node(node_name, [skew_command], check_pod_name)
|
||||
node_datetime = string_to_date(node_datetime_string)
|
||||
counter = 0
|
||||
while not (
|
||||
@@ -252,7 +287,8 @@ def check_date_time(object_type, names, kubecli:KrknKubernetes):
|
||||
"Date/time on node %s still not reset, "
|
||||
"waiting 10 seconds and retrying" % node_name
|
||||
)
|
||||
node_datetime_string = node_debug(node_name, skew_command)
|
||||
|
||||
node_datetime_string = kubecli.exec_cmd_in_pod([skew_command], check_pod_name, "default")
|
||||
node_datetime = string_to_date(node_datetime_string)
|
||||
counter += 1
|
||||
if counter > max_retries:
|
||||
@@ -266,6 +302,8 @@ def check_date_time(object_type, names, kubecli:KrknKubernetes):
|
||||
logging.info(
|
||||
"Date in node " + str(node_name) + " reset properly"
|
||||
)
|
||||
kubecli.delete_pod(check_pod_name)
|
||||
|
||||
elif object_type == "pod":
|
||||
for pod_name in names:
|
||||
first_date_time = datetime.datetime.utcnow()
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 248 KiB After Width: | Height: | Size: 283 KiB |
@@ -18,8 +18,8 @@ google-api-python-client
|
||||
ibm_cloud_sdk_core
|
||||
ibm_vpc
|
||||
itsdangerous==2.0.1
|
||||
jinja2==3.0.3
|
||||
krkn-lib >= 1.4.5
|
||||
jinja2==3.1.3
|
||||
krkn-lib >= 1.4.6
|
||||
kubernetes
|
||||
lxml >= 4.3.0
|
||||
oauth2client>=4.1.3
|
||||
|
||||
@@ -170,7 +170,10 @@ def main(cfg):
|
||||
# KrknTelemetry init
|
||||
telemetry_k8s = KrknTelemetryKubernetes(safe_logger, kubecli)
|
||||
telemetry_ocp = KrknTelemetryOpenshift(safe_logger, ocpcli)
|
||||
prometheus = KrknPrometheus(prometheus_url, prometheus_bearer_token)
|
||||
|
||||
|
||||
if enable_alerts:
|
||||
prometheus = KrknPrometheus(prometheus_url, prometheus_bearer_token)
|
||||
|
||||
logging.info("Server URL: %s" % kubecli.get_host())
|
||||
|
||||
@@ -200,6 +203,7 @@ def main(cfg):
|
||||
|
||||
# Capture the start time
|
||||
start_time = int(time.time())
|
||||
|
||||
chaos_telemetry = ChaosRunTelemetry()
|
||||
chaos_telemetry.run_uuid = run_uuid
|
||||
# Loop to run the chaos starts here
|
||||
@@ -280,16 +284,9 @@ def main(cfg):
|
||||
# in the config
|
||||
# krkn_lib
|
||||
elif scenario_type == "time_scenarios":
|
||||
if distribution == "openshift":
|
||||
logging.info("Running time skew scenarios")
|
||||
failed_post_scenarios, scenario_telemetries = time_actions.run(scenarios_list, config, wait_duration, kubecli, telemetry_k8s)
|
||||
chaos_telemetry.scenarios.extend(scenario_telemetries)
|
||||
else:
|
||||
logging.error(
|
||||
"Litmus scenarios are currently "
|
||||
"supported only on openshift"
|
||||
)
|
||||
sys.exit(1)
|
||||
# Inject cluster shutdown scenarios
|
||||
# krkn_lib
|
||||
elif scenario_type == "cluster_shut_down_scenarios":
|
||||
@@ -337,12 +334,18 @@ def main(cfg):
|
||||
failed_post_scenarios, scenario_telemetries = network_chaos.run(scenarios_list, config, wait_duration, kubecli, telemetry_k8s)
|
||||
|
||||
# Check for critical alerts when enabled
|
||||
if check_critical_alerts:
|
||||
if enable_alerts and check_critical_alerts :
|
||||
logging.info("Checking for critical alerts firing post choas")
|
||||
|
||||
##PROM
|
||||
query = r"""ALERTS{severity="critical"}"""
|
||||
critical_alerts = prometheus.process_prom_query_in_range(query, datetime.datetime.fromtimestamp(start_time))
|
||||
end_time = datetime.datetime.now()
|
||||
critical_alerts = prometheus.process_prom_query_in_range(
|
||||
query,
|
||||
start_time = datetime.datetime.fromtimestamp(start_time),
|
||||
end_time = end_time
|
||||
|
||||
)
|
||||
critical_alerts_count = len(critical_alerts)
|
||||
if critical_alerts_count > 0:
|
||||
logging.error("Critical alerts are firing: %s", critical_alerts)
|
||||
|
||||
Reference in New Issue
Block a user