removed deprecated ES fields + removed host validator (#774 )

DCO Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
adding scenario type (#758 )
2026-02-17 19:40:01 +00:00 · 2025-03-19 13:10:44 -04:00 · 2025-03-19 17:38:45 +01:00 · 2025-03-19 17:08:44 +01:00 · 2025-03-19 15:02:12 +00:00 · 2025-03-19 14:28:35 +00:00
33 changed files with 964 additions and 918 deletions
--- a/CI/config/common_test_config.yaml
+++ b/CI/config/common_test_config.yaml
@@ -62,3 +62,11 @@ elastic:
    metrics_index: "krkn-metrics"
    alerts_index: "krkn-alerts"
    telemetry_index: "krkn-telemetry"
+
+health_checks:                                              # Utilizing health check endpoints to observe application behavior during chaos injection.
+    interval:                                               # Interval in seconds to perform health checks, default value is 2 seconds
+    config:                                                 # Provide list of health check configurations for applications
+        - url:                                              # Provide application endpoint
+          bearer_token:                                     # Bearer token for authentication if any
+          auth:                                             # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
+          exit_on_failure:                                  # If value is True exits when health check failed for application, values can be True/False
--- a/README.md
+++ b/README.md
@@ -72,6 +72,7 @@ It is important to make sure to check if the targeted component recovered from t
 - Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
 - Leveraging [Cerberus](https://github.com/krkn-chaos/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/krkn-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
 - Leveraging built-in alert collection feature to fail the runs in case of critical alerts.
+- Utilizing health check endpoints to observe application behavior during chaos injection [Health checks](https://github.com/krkn-chaos/krkn/docs/health_checks.md)

 ### Signaling
 In CI runs or any external job it is useful to stop Kraken once a certain test or state gets reached. We created a way to signal to kraken to pause the chaos or stop it completely using a signal posted to a port of your choice.
--- a/config/config.yaml
+++ b/config/config.yaml
@@ -1,5 +1,4 @@
 kraken:
-    distribution: kubernetes                                # Distribution can be kubernetes or openshift
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    exit_on_failure: False                                 # Exit when a post action scenario fails
    publish_kraken_status: True                            # Can be accessed at http://0.0.0.0:8081
@@ -25,12 +24,10 @@ kraken:
            - scenarios/openshift/prom_kill.yml
            - scenarios/openshift/openshift-apiserver.yml
            - scenarios/openshift/openshift-kube-apiserver.yml
-        - vmware_node_scenarios:
-            - scenarios/openshift/vmware_node_scenarios.yml
-        - ibmcloud_node_scenarios:
-            - scenarios/openshift/ibmcloud_node_scenarios.yml
        - node_scenarios:                                  # List of chaos node scenarios to load
            - scenarios/openshift/aws_node_scenarios.yml
+            - scenarios/openshift/vmware_node_scenarios.yml
+            - scenarios/openshift/ibmcloud_node_scenarios.yml
        - time_scenarios:                                  # List of chaos time scenarios to load
            - scenarios/openshift/time_scenarios_example.yml
        - cluster_shut_down_scenarios:
@@ -63,12 +60,10 @@ performance_monitoring:
    enable_alerts: False                                  # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
    enable_metrics: False
    alert_profile: config/alerts.yaml                          # Path or URL to alert profile with the prometheus queries
-    metrics_profile: config/metrics.yaml
+    metrics_profile: config/metrics-report.yaml
    check_critical_alerts: False                          # When enabled will check prometheus for critical alerts firing post chaos
 elastic:
    enable_elastic: False
-    collect_metrics: False
-    collect_alerts: False
    verify_certs: False
    elastic_url: ""                                         # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
    elastic_port: 32766
@@ -112,7 +107,10 @@ telemetry:
    oc_cli_path: /usr/bin/oc                                # optional, if not specified will be search in $PATH
    events_backup: True                                     # enables/disables cluster events collection

-
-
-
-
+health_checks:                                              # Utilizing health check endpoints to observe application behavior during chaos injection.
+    interval:                                               # Interval in seconds to perform health checks, default value is 2 seconds
+    config:                                                 # Provide list of health check configurations for applications
+        - url:                                              # Provide application endpoint
+          bearer_token:                                     # Bearer token for authentication if any
+          auth:                                             # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
+          exit_on_failure:                                  # If value is True exits when health check failed for application, values can be True/False
--- a/config/metrics-report.yaml
+++ b/config/metrics-report.yaml
@@ -0,0 +1,248 @@
+metrics:
+
+# API server
+- query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
+  metricName: APIInflightRequests
+  instant: true
+
+# Kubelet & CRI-O
+
+# Average and max of the CPU usage from all worker's kubelet
+- query: avg(avg_over_time(irate(process_cpu_seconds_total{service="kubelet",job="kubelet"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: cpu-kubelet
+  instant: true
+
+- query: max(max_over_time(irate(process_cpu_seconds_total{service="kubelet",job="kubelet"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: max-cpu-kubelet
+  instant: true
+
+# Average of the memory usage from all worker's kubelet
+- query: avg(avg_over_time(process_resident_memory_bytes{service="kubelet",job="kubelet"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: memory-kubelet
+  instant: true
+
+# Max of the memory usage from all worker's kubelet
+- query: max(max_over_time(process_resident_memory_bytes{service="kubelet",job="kubelet"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: max-memory-kubelet
+  instant: true
+
+- query: max_over_time(sum(process_resident_memory_bytes{service="kubelet",job="kubelet"} and on (node) kube_node_role{role="worker"})[.elapsed:])
+  metricName: max-memory-sum-kubelet
+  instant: true
+
+# Average and max of the CPU usage from all worker's CRI-O
+- query: avg(avg_over_time(irate(process_cpu_seconds_total{service="kubelet",job="crio"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: cpu-crio
+  instant: true
+
+- query: max(max_over_time(irate(process_cpu_seconds_total{service="kubelet",job="crio"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: max-cpu-crio
+  instant: true
+
+# Average of the memory usage from all worker's CRI-O
+- query: avg(avg_over_time(process_resident_memory_bytes{service="kubelet",job="crio"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: memory-crio
+  instant: true
+
+# Max of the memory usage from all worker's CRI-O
+- query: max(max_over_time(process_resident_memory_bytes{service="kubelet",job="crio"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
+  metricName: max-memory-crio
+  instant: true
+
+# Etcd
+
+- query: avg(avg_over_time(histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[2m]))[.elapsed:]))
+  metricName: 99thEtcdDiskBackendCommit
+  instant: true
+
+- query: avg(avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[.elapsed:]))
+  metricName: 99thEtcdDiskWalFsync
+  instant: true
+
+- query: avg(avg_over_time(histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[2m]))[.elapsed:]))
+  metricName: 99thEtcdRoundTripTime
+  instant: true
+
+# Control-plane
+
+- query: avg(avg_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-controller-manager"}[2m])) by (pod))[.elapsed:]))
+  metricName: cpu-kube-controller-manager
+  instant: true
+
+- query: max(max_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-controller-manager"}[2m])) by (pod))[.elapsed:]))
+  metricName: max-cpu-kube-controller-manager
+  instant: true
+
+- query: avg(avg_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-kube-controller-manager"}) by (pod))[.elapsed:]))
+  metricName: memory-kube-controller-manager
+  instant: true
+
+- query: max(max_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-kube-controller-manager"}) by (pod))[.elapsed:]))
+  metricName: max-memory-kube-controller-manager
+  instant: true
+
+- query: avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-apiserver"}[2m])) by (pod))[.elapsed:]))
+  metricName: cpu-kube-apiserver
+  instant: true
+
+- query: avg(avg_over_time(topk(3, sum(container_memory_rss{name!="", namespace="openshift-kube-apiserver"}) by (pod))[.elapsed:]))
+  metricName: memory-kube-apiserver
+  instant: true
+
+- query: avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-apiserver"}[2m])) by (pod))[.elapsed:]))
+  metricName: cpu-openshift-apiserver
+  instant: true
+
+- query: avg(avg_over_time(topk(3, sum(container_memory_rss{name!="", namespace="openshift-apiserver"}) by (pod))[.elapsed:]))
+  metricName: memory-openshift-apiserver
+  instant: true
+
+- query: avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-etcd"}[2m])) by (pod))[.elapsed:]))
+  metricName: cpu-etcd
+  instant: true
+
+- query: avg(avg_over_time(topk(3,sum(container_memory_rss{name!="", namespace="openshift-etcd"}) by (pod))[.elapsed:]))
+  metricName: memory-etcd
+  instant: true
+
+- query: avg(avg_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-controller-manager"}[2m])) by (pod))[.elapsed:]))
+  metricName: cpu-openshift-controller-manager
+  instant: true
+
+- query: avg(avg_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-controller-manager"}) by (pod))[.elapsed:]))
+  metricName: memory-openshift-controller-manager
+  instant: true
+
+ # multus
+
+- query: avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-multus", pod=~"(multus).+", container!="POD"}[2m])[.elapsed:])) by (container)
+  metricName: cpu-multus
+  instant: true
+
+- query: avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-multus", pod=~"(multus).+", container!="POD"}[.elapsed:])) by (container)
+  metricName: memory-multus
+  instant: true
+
+# OVNKubernetes - standard & IC
+
+- query: avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ovn-kubernetes", pod=~"(ovnkube-master|ovnkube-control-plane).+", container!="POD"}[2m])[.elapsed:])) by (container)
+  metricName: cpu-ovn-control-plane
+  instant: true
+
+- query: avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-ovn-kubernetes", pod=~"(ovnkube-master|ovnkube-control-plane).+", container!="POD"}[.elapsed:])) by (container)
+  metricName: memory-ovn-control-plane
+  instant: true
+
+- query: avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ovn-kubernetes", pod=~"ovnkube-node.+", container!="POD"}[2m])[.elapsed:])) by (container)
+  metricName: cpu-ovnkube-node
+  instant: true
+
+- query: avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-ovn-kubernetes", pod=~"ovnkube-node.+", container!="POD"}[.elapsed:])) by (container)
+  metricName: memory-ovnkube-node
+  instant: true
+
+# Nodes
+
+- query: avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
+  metricName: cpu-masters
+  instant: true
+
+- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
+  metricName: memory-masters
+  instant: true
+
+- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
+  metricName: max-memory-masters
+  instant: true
+
+- query: avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
+  metricName: cpu-workers
+  instant: true
+
+- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
+  metricName: max-cpu-workers
+  instant: true
+
+- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
+  metricName: memory-workers
+  instant: true
+
+- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
+  metricName: max-memory-workers
+  instant: true
+
+- query: sum( (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)") )
+  metricName: memory-sum-workers
+  instant: true
+
+
+- query: avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
+  metricName: cpu-infra
+  instant: true
+
+- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
+  metricName: max-cpu-infra
+  instant: true
+
+- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
+  metricName: memory-infra
+  instant: true
+
+- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
+  metricName: max-memory-infra
+  instant: true
+
+- query: max_over_time(sum((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))[.elapsed:])
+  metricName: max-memory-sum-infra
+  instant: true
+
+# Monitoring and ingress
+
+- query: avg(avg_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}[2m])) by (pod)[.elapsed:]))
+  metricName: cpu-prometheus
+  instant: true
+
+- query: max(max_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}[2m])) by (pod)[.elapsed:]))
+  metricName: max-cpu-prometheus
+  instant: true
+
+- query: avg(avg_over_time(sum(container_memory_rss{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}) by (pod)[.elapsed:]))
+  metricName: memory-prometheus
+  instant: true
+
+- query: max(max_over_time(sum(container_memory_rss{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}) by (pod)[.elapsed:]))
+  metricName: max-memory-prometheus
+  instant: true
+
+- query: avg(avg_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ingress", pod=~"router-default.+"}[2m])) by (pod)[.elapsed:]))
+  metricName: cpu-router
+  instant: true
+
+- query: avg(avg_over_time(sum(container_memory_rss{name!="", namespace="openshift-ingress", pod=~"router-default.+"}) by (pod)[.elapsed:]))
+  metricName: memory-router
+  instant: true
+
+# Cluster
+
+- query: avg_over_time(cluster:memory_usage:ratio[.elapsed:])
+  metricName: memory-cluster-usage-ratio
+  instant: true
+
+- query: avg_over_time(cluster:node_cpu:ratio[.elapsed:])
+  metricName: cpu-cluster-usage-ratio
+  instant: true
+
+# Retain the raw CPU seconds totals for comparison
+- query: sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="worker",role!="infra"}, "instance", "$1", "node", "(.+)")) by (mode)
+  metricName: nodeCPUSeconds-Workers
+  instant: true
+
+
+- query: sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (mode)
+  metricName: nodeCPUSeconds-Masters
+  instant: true
+
+
+- query: sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (mode)
+  metricName: nodeCPUSeconds-Infra
+  instant: true
--- a/config/metrics.yaml
+++ b/config/metrics.yaml
@@ -1,13 +1,7 @@
 metrics:
 # API server
-  - query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
-    metricName: API99thLatency
-
-  - query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
-    metricName: APIRequestRate
-
-  - query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
-    metricName: APIInflightRequests
+  - query: irate(apiserver_request_total{verb="POST", resource="pods", subresource="binding",code="201"}[2m]) > 0
+    metricName: schedulingThroughput

 # Containers & pod metrics
  - query: sum(irate(container_cpu_usage_seconds_total{name!="",namespace=~"openshift-(etcd|oauth-apiserver|.*apiserver|ovn-kubernetes|sdn|ingress|authentication|.*controller-manager|.*scheduler|monitoring|logging|image-registry)"}[2m]) * 100) by (pod, namespace, node)
--- a/containers/krknctl-input.json
+++ b/containers/krknctl-input.json
@@ -125,7 +125,6 @@
    "variable": "ES_SERVER",
    "type": "string",
    "default": "http://0.0.0.0",
-    "validator": "^(http|https):\/\/.*",
    "required": "false"
  },
  {
@@ -166,28 +165,6 @@
    "default": "False",
    "required": "false"
  },
-  {
-    "name": "es-collect-metrics",
-    "short_description": "Enables metrics collection on elastic search",
-    "description": "Enables metrics collection on elastic search",
-    "variable": "ES_COLLECT_METRICS",
-    "type": "enum",
-    "allowed_values": "True,False",
-    "separator": ",",
-    "default": "False",
-    "required": "false"
-  },
-  {
-    "name": "es-collect-alerts",
-    "short_description": "Enables alerts collection on elastic search",
-    "description": "Enables alerts collection on elastic search",
-    "variable": "ES_COLLECT_ALERTS",
-    "type": "enum",
-    "allowed_values": "True,False",
-    "separator": ",",
-    "default": "False",
-    "required": "false"
-  },
  {
    "name": "es-metrics-index",
    "short_description": "Elasticsearch metrics index",
@@ -393,4 +370,3 @@
    "required": "false"
  }
 ]
-
--- a/docs/config.md
+++ b/docs/config.md
@@ -12,10 +12,6 @@ Config components:
 # Kraken 
 This section defines scenarios and specific data to the chaos run 

-## Distribution
-Either **openshift** or **kubernetes** depending on the type of cluster you want to run chaos on. 
-The prometheus url/route and bearer token are automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
-
 ## Exit on failure
 **exit_on_failure**:  Exit when a post action check or cerberus run fails

--- a/docs/health_checks.md
+++ b/docs/health_checks.md
@@ -0,0 +1,59 @@
+### Health Checks
+
+Health checks provide real-time visibility into the impact of chaos scenarios on application availability and performance. Health check configuration supports application endpoints accessible via http / https along with authentication mechanism such as bearer token and authentication credentials.
+Health checks are configured in the ```config.yaml```
+
+The system periodically checks the provided URLs based on the defined interval and records the results in Telemetry. The telemetry data includes:
+
+- Success response ```200``` when the application is running normally.
+- Failure response other than 200 if the application experiences downtime or errors.
+
+This helps users quickly identify application health issues and take necessary actions.
+
+#### Sample health check config
+```
+health_checks:
+  interval: <time_in_seconds>                       # Defines the frequency of health checks, default value is 2 seconds
+  config:                                           # List of application endpoints to check
+    - url: "https://example.com/health"
+      bearer_token: "hfjauljl..."                   # Bearer token for authentication if any
+      auth:                                         
+      exit_on_failure: True                         # If value is True exits when health check failed for application, values can be True/False
+    - url: "https://another-service.com/status"
+      bearer_token:
+      auth: ("admin","secretpassword")              # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
+      exit_on_failure: False
+    - url: http://general-service.com
+      bearer_token:
+      auth:
+      exit_on_failure:  
+```
+#### Sample health check telemetry
+```
+"health_checks": [
+            {
+                "url": "https://example.com/health",
+                "status": False,
+                "status_code": "503",
+                "start_timestamp": "2025-02-25 11:51:33",
+                "end_timestamp": "2025-02-25 11:51:40",
+                "duration": "0:00:07"
+            },
+            {
+                "url": "https://another-service.com/status",
+                "status": True,
+                "status_code": 200,
+                "start_timestamp": "2025-02-25 22:18:19",
+                "end_timestamp": "22025-02-25 22:22:46",
+                "duration": "0:04:27"
+            },
+            {
+                "url": "http://general-service.com",
+                "status": True,
+                "status_code": 200,
+                "start_timestamp": "2025-02-25 22:18:19",
+                "end_timestamp": "22025-02-25 22:22:46",
+                "duration": "0:04:27"
+            }
+        ],
+```
--- a/docs/index.md
+++ b/docs/index.md
@@ -88,7 +88,8 @@ We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
 - Appropriate caching and Content Delivery Network should be enabled to be performant and usable when there is a latency on the client side.
  - Not every user or machine has access to unlimited bandwidth, there might be a delay on the user side ( client ) to access the API’s due to limited bandwidth, throttling or latency depending on the geographic location. It is important to inject latency between the client and API calls to understand the behavior and optimize things including caching wherever possible, using CDN’s or opting for different protocols like HTTP/2 or HTTP/3 vs HTTP.

-
+- Ensure Disruption Budgets are enabled for your critical applications
+  - Protect your application during disruptions by setting a [pod disruption budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) to avoid downtime. For instance, etcd, zookeeper or similar applications need at least 2 replicas to maintain quorum. This can be ensured by setting PDB maxUnavailable to 1. 


 ### Tooling
--- a/docs/node_scenarios.md
+++ b/docs/node_scenarios.md
@@ -2,7 +2,7 @@

 The following node chaos scenarios are supported:

-1. **node_start_scenario**: Scenario to stop the node instance.
+1. **node_start_scenario**: Scenario to start the node instance.
 2. **node_stop_scenario**: Scenario to stop the node instance.
 3. **node_stop_start_scenario**: Scenario to stop the node instance for specified duration and then start the node instance. Not supported on VMware.
 4. **node_termination_scenario**: Scenario to terminate the node instance.
--- a/docs/service_disruption_scenarios.md
+++ b/docs/service_disruption_scenarios.md
@@ -1,6 +1,9 @@
 ###  Service Disruption Scenarios (Previously Delete Namespace Scenario)

-Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.
+Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string. The goal of this scenario is to ensure Pod Disruption Budgets with appropriate configurations are set to have minimum number of replicas are running at a given time and avoid downtime.
+
+
+**NOTE**: Protect your application during disruptions by setting a [pod disruption budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) to avoid downtime. For instance, etcd, zookeeper or similar applications need at least 2 replicas to maintain quorum. This can be ensured by setting PDB maxUnavailable to 1. 

 Configuration Options:

--- a/krkn/prometheus/client.py
+++ b/krkn/prometheus/client.py
@@ -2,10 +2,11 @@ from __future__ import annotations

 import datetime
 import os.path
+import math
 from typing import Optional, List, Dict, Any

-import urllib3
 import logging
+import urllib3
 import sys

 import yaml
@@ -25,8 +26,7 @@ def alerts(
    start_time,
    end_time,
    alert_profile,
-    elastic_collect_alerts,
-    elastic_alerts_index,
+    elastic_alerts_index
 ):

    if alert_profile is None or os.path.exists(alert_profile) is False:
@@ -46,6 +46,7 @@ def alerts(
        for alert in profile_yaml:
            if list(alert.keys()).sort() != ["expr", "description", "severity"].sort():
                logging.error(f"wrong alert {alert}, skipping")
+                continue

            processed_alert = prom_cli.process_alert(
                alert,
@@ -56,7 +57,6 @@ def alerts(
                processed_alert[0]
                and processed_alert[1]
                and elastic
-                and elastic_collect_alerts
            ):
                elastic_alert = ElasticAlert(
                    run_uuid=run_uuid,
@@ -156,15 +156,15 @@ def metrics(
    start_time,
    end_time,
    metrics_profile,
-    elastic_collect_metrics,
-    elastic_metrics_index,
+    elastic_metrics_index
 ) -> list[dict[str, list[(int, float)] | str]]:
-    metrics_list: list[dict[str, list[(int, float)] | str]] = []
+   
    if metrics_profile is None or os.path.exists(metrics_profile) is False:
        logging.error(f"{metrics_profile} alert profile does not exist")
        sys.exit(1)
    with open(metrics_profile) as profile:
        profile_yaml = yaml.safe_load(profile)
+
        if not profile_yaml["metrics"] or not isinstance(profile_yaml["metrics"], list):
            logging.error(
                f"{metrics_profile} wrong file format, alert profile must be "
@@ -172,30 +172,58 @@ def metrics(
                f"expr, description, severity"
            )
            sys.exit(1)
-
+        elapsed_ceil = math.ceil((end_time - start_time)/ 60 )
+        elapsed_time = str(elapsed_ceil) + "m"
+        metrics_list: list[dict[str, int | float | str]] = []
        for metric_query in profile_yaml["metrics"]:
-            if (
+            query = metric_query['query']
+            
+            # calculate elapsed time
+            if ".elapsed" in metric_query["query"]:
+                query = metric_query['query'].replace(".elapsed", elapsed_time)
+            if "instant" in list(metric_query.keys()) and metric_query['instant']:
+                metrics_result = prom_cli.process_query(
+                   query
+                )
+            elif (
                list(metric_query.keys()).sort()
-                != ["query", "metricName", "instant"].sort()
+                == ["query", "metricName"].sort()
            ):
-                logging.error(f"wrong alert {metric_query}, skipping")
-            metrics_result = prom_cli.process_prom_query_in_range(
-                metric_query["query"],
-                start_time=datetime.datetime.fromtimestamp(start_time),
-                end_time=datetime.datetime.fromtimestamp(end_time),
-            )
-
-            metric = {"name": metric_query["metricName"], "values": []}
+                metrics_result = prom_cli.process_prom_query_in_range(
+                    query,
+                    start_time=datetime.datetime.fromtimestamp(start_time),
+                    end_time=datetime.datetime.fromtimestamp(end_time), granularity=30
+                )
+            else: 
+                logging.info('didnt match keys')
+                continue
+            
            for returned_metric in metrics_result:
-                if "values" in returned_metric:
+                metric = {"query": query, "metricName": metric_query['metricName']}
+                for k,v in returned_metric['metric'].items():
+                    metric[k] = v
+                
+                if "values" in returned_metric: 
                    for value in returned_metric["values"]:
                        try:
-                            metric["values"].append((value[0], float(value[1])))
+                            metric['timestamp'] = str(datetime.datetime.fromtimestamp(value[0]))
+                            metric["value"] = float(value[1])
+                            # want double array of the known details and the metrics specific to each call                    
+                            metrics_list.append(metric.copy())
                        except ValueError:
                            pass
-            metrics_list.append(metric)
+                elif "value" in returned_metric:
+                    try:
+                        value = returned_metric["value"]
+                        metric['timestamp'] = str(datetime.datetime.fromtimestamp(value[0]))
+                        metric["value"] = float(value[1])

-        if elastic_collect_metrics and elastic:
+                        # want double array of the known details and the metrics specific to each call
+                        metrics_list.append(metric.copy())
+                    except ValueError:
+                        pass
+
+        if elastic:
            result = elastic.upload_metrics_to_elasticsearch(
                run_uuid=run_uuid, index=elastic_metrics_index, raw_data=metrics_list
            )
--- a/krkn/scenario_plugins/abstract_scenario_plugin.py
+++ b/krkn/scenario_plugins/abstract_scenario_plugin.py
@@ -68,6 +68,7 @@ class AbstractScenarioPlugin(ABC):

            scenario_telemetry = ScenarioTelemetry()
            scenario_telemetry.scenario = scenario_config
+            scenario_telemetry.scenario_type = self.get_scenario_types()[0]
            scenario_telemetry.start_timestamp = time.time()
            parsed_scenario_config = telemetry.set_parameters_base64(
                scenario_telemetry, scenario_config
@@ -103,7 +104,7 @@ class AbstractScenarioPlugin(ABC):

            if events_backup: 
                utils.populate_cluster_events(
-                    scenario_telemetry,
+                    krkn_config,
                    parsed_scenario_config,
                    telemetry.get_lib_kubernetes(),
                    int(scenario_telemetry.start_timestamp),
--- a/krkn/scenario_plugins/application_outage/application_outage_scenario_plugin.py
+++ b/krkn/scenario_plugins/application_outage/application_outage_scenario_plugin.py
@@ -3,7 +3,7 @@ import time
 import yaml
 from krkn_lib.models.telemetry import ScenarioTelemetry
 from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
-from krkn_lib.utils import get_yaml_item_value
+from krkn_lib.utils import get_yaml_item_value, get_random_string
 from jinja2 import Template
 from krkn import cerberus
 from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
@@ -33,17 +33,22 @@ class ApplicationOutageScenarioPlugin(AbstractScenarioPlugin):
                duration = get_yaml_item_value(scenario_config, "duration", 60)

                start_time = int(time.time())
+                policy_name = f"krkn-deny-{get_random_string(5)}"

-                network_policy_template = """---
+                network_policy_template = (
+                    """---
        apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        metadata:
-          name: kraken-deny
+          name: """
+                    + policy_name
+                    + """
        spec:
          podSelector:
            matchLabels: {{ pod_selector }}
          policyTypes: {{ traffic_type }}
        """
+                )
                t = Template(network_policy_template)
                rendered_spec = t.render(
                    pod_selector=pod_selector, traffic_type=traffic_type
@@ -65,7 +70,7 @@ class ApplicationOutageScenarioPlugin(AbstractScenarioPlugin):
                # unblock the traffic by deleting the network policy
                logging.info("Deleting the network policy")
                lib_telemetry.get_lib_kubernetes().delete_net_policy(
-                    "kraken-deny", namespace
+                    policy_name, namespace
                )

                logging.info(
--- a/krkn/scenario_plugins/container/container_scenario_plugin.py
+++ b/krkn/scenario_plugins/container/container_scenario_plugin.py
@@ -22,9 +22,7 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
        lib_telemetry: KrknTelemetryOpenshift,
        scenario_telemetry: ScenarioTelemetry,
    ) -> int:
-        start_time = int(time.time())
        pool = PodsMonitorPool(lib_telemetry.get_lib_kubernetes())
-        wait_duration = krkn_config["tunings"]["wait_duration"]
        try:
            with open(scenario, "r") as f:
                cont_scenario_config = yaml.full_load(f)
@@ -45,16 +43,10 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
                    )
                    return 1
                scenario_telemetry.affected_pods = result
-                logging.info("Waiting for the specified duration: %s" % (wait_duration))
-                time.sleep(wait_duration)
-
-                # capture end time
-                end_time = int(time.time())

                # publish cerberus status
-                cerberus.publish_kraken_status(krkn_config, [], start_time, end_time)
        except (RuntimeError, Exception):
-            logging.error("ContainerScenarioPlugin exiting due to Exception %s" % e)
+            logging.error("ContainerScenarioPlugin exiting due to Exception %s")
            return 1
        else:
            return 0
--- a/krkn/scenario_plugins/native/native_scenario_plugin.py
+++ b/krkn/scenario_plugins/native/native_scenario_plugin.py
@@ -49,7 +49,6 @@ class NativeScenarioPlugin(AbstractScenarioPlugin):
        return [
            "pod_disruption_scenarios",
            "pod_network_scenarios",
-            "ibmcloud_node_scenarios",
        ]

    def start_monitoring(self, pool: PodsMonitorPool, scenarios: list[Any]):
--- a/krkn/scenario_plugins/native/node_scenarios/ibmcloud_plugin.py
+++ b/krkn/scenario_plugins/native/node_scenarios/ibmcloud_plugin.py
@@ -1,589 +0,0 @@
-#!/usr/bin/env python
-import time
-import typing
-from os import environ
-from dataclasses import dataclass, field
-from traceback import format_exc
-import logging
-from krkn.scenario_plugins.native.node_scenarios import (
-    kubernetes_functions as kube_helper,
-)
-from arcaflow_plugin_sdk import validation, plugin
-from kubernetes import client, watch
-from ibm_vpc import VpcV1
-from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
-import sys
-
-
-class IbmCloud:
-    def __init__(self):
-        """
-        Initialize the ibm cloud client by using the the env variables:
-            'IBMC_APIKEY' 'IBMC_URL'
-        """
-        apiKey = environ.get("IBMC_APIKEY")
-        service_url = environ.get("IBMC_URL")
-        if not apiKey:
-            raise Exception("Environmental variable 'IBMC_APIKEY' is not set")
-        if not service_url:
-            raise Exception("Environmental variable 'IBMC_URL' is not set")
-        try:
-            authenticator = IAMAuthenticator(apiKey)
-            self.service = VpcV1(authenticator=authenticator)
-
-            self.service.set_service_url(service_url)
-        except Exception as e:
-            logging.error("error authenticating" + str(e))
-
-
-    # Get the instance ID of the node
-    def get_instance_id(self, node_name):
-        node_list = self.list_instances()
-        for node in node_list:
-            if node_name == node["vpc_name"]:
-                return node["vpc_id"]
-        logging.error("Couldn't find node with name " + str(node_name) + ", you could try another region")
-        sys.exit(1)
-
-    def delete_instance(self, instance_id):
-        """
-        Deletes the Instance whose name is given by 'instance_id'
-        """
-        try:
-            self.service.delete_instance(instance_id)
-            logging.info("Deleted Instance -- '{}'".format(instance_id))
-        except Exception as e:
-            logging.info("Instance '{}' could not be deleted. ".format(instance_id))
-            return False
-
-    def reboot_instances(self, instance_id):
-        """
-        Reboots the Instance whose name is given by 'instance_id'. Returns True if successful, or
-        returns False if the Instance is not powered on
-        """
-
-        try:
-            self.service.create_instance_action(
-                instance_id,
-                type="reboot",
-            )
-            logging.info("Reset Instance -- '{}'".format(instance_id))
-            return True
-        except Exception as e:
-            logging.info("Instance '{}' could not be rebooted".format(instance_id))
-            return False
-
-    def stop_instances(self, instance_id):
-        """
-        Stops the Instance whose name is given by 'instance_id'. Returns True if successful, or
-        returns False if the Instance is already stopped
-        """
-
-        try:
-            self.service.create_instance_action(
-                instance_id,
-                type="stop",
-            )
-            logging.info("Stopped Instance -- '{}'".format(instance_id))
-            return True
-        except Exception as e:
-            logging.info("Instance '{}' could not be stopped".format(instance_id))
-            logging.info("error" + str(e))
-            return False
-
-    def start_instances(self, instance_id):
-        """
-        Stops the Instance whose name is given by 'instance_id'. Returns True if successful, or
-        returns False if the Instance is already running
-        """
-
-        try:
-            self.service.create_instance_action(
-                instance_id,
-                type="start",
-            )
-            logging.info("Started Instance -- '{}'".format(instance_id))
-            return True
-        except Exception as e:
-            logging.info("Instance '{}' could not start running".format(instance_id))
-            return False
-
-    def list_instances(self):
-        """
-        Returns a list of Instances present in the datacenter
-        """
-        instance_names = []
-        try:
-            instances_result = self.service.list_instances().get_result()
-            instances_list = instances_result["instances"]
-            for vpc in instances_list:
-                instance_names.append({"vpc_name": vpc["name"], "vpc_id": vpc["id"]})
-            starting_count = instances_result["total_count"]
-            while instances_result["total_count"] == instances_result["limit"]:
-                instances_result = self.service.list_instances(
-                    start=starting_count
-                ).get_result()
-                instances_list = instances_result["instances"]
-                starting_count += instances_result["total_count"]
-                for vpc in instances_list:
-                    instance_names.append({"vpc_name": vpc.name, "vpc_id": vpc.id})
-        except Exception as e:
-            logging.error("Error listing out instances: " + str(e))
-            sys.exit(1)
-        return instance_names
-
-    def find_id_in_list(self, name, vpc_list):
-        for vpc in vpc_list:
-            if vpc["vpc_name"] == name:
-                return vpc["vpc_id"]
-
-    def get_instance_status(self, instance_id):
-        """
-        Returns the status of the Instance whose name is given by 'instance_id'
-        """
-
-        try:
-            instance = self.service.get_instance(instance_id).get_result()
-            state = instance["status"]
-            return state
-        except Exception as e:
-            logging.error(
-                "Failed to get node instance status %s. Encountered following "
-                "exception: %s." % (instance_id, e)
-            )
-            return None
-
-    def wait_until_deleted(self, instance_id, timeout):
-        """
-        Waits until the instance is deleted or until the timeout. Returns True if
-        the instance is successfully deleted, else returns False
-        """
-
-        time_counter = 0
-        vpc = self.get_instance_status(instance_id)
-        while vpc is not None:
-            vpc = self.get_instance_status(instance_id)
-            logging.info(
-                "Instance %s is still being deleted, sleeping for 5 seconds"
-                % instance_id
-            )
-            time.sleep(5)
-            time_counter += 5
-            if time_counter >= timeout:
-                logging.info(
-                    "Instance %s is still not deleted in allotted time" % instance_id
-                )
-                return False
-        return True
-
-    def wait_until_running(self, instance_id, timeout):
-        """
-        Waits until the Instance switches to running state or until the timeout.
-        Returns True if the Instance switches to running, else returns False
-        """
-
-        time_counter = 0
-        status = self.get_instance_status(instance_id)
-        while status != "running":
-            status = self.get_instance_status(instance_id)
-            logging.info(
-                "Instance %s is still not running, sleeping for 5 seconds" % instance_id
-            )
-            time.sleep(5)
-            time_counter += 5
-            if time_counter >= timeout:
-                logging.info(
-                    "Instance %s is still not ready in allotted time" % instance_id
-                )
-                return False
-        return True
-
-    def wait_until_stopped(self, instance_id, timeout):
-        """
-        Waits until the Instance switches to stopped state or until the timeout.
-        Returns True if the Instance switches to stopped, else returns False
-        """
-
-        time_counter = 0
-        status = self.get_instance_status(instance_id)
-        while status != "stopped":
-            status = self.get_instance_status(instance_id)
-            logging.info(
-                "Instance %s is still not stopped, sleeping for 5 seconds" % instance_id
-            )
-            time.sleep(5)
-            time_counter += 5
-            if time_counter >= timeout:
-                logging.info(
-                    "Instance %s is still not stopped in allotted time" % instance_id
-                )
-                return False
-        return True
-
-    def wait_until_rebooted(self, instance_id, timeout):
-        """
-        Waits until the Instance switches to restarting state and then running state or until the timeout.
-        Returns True if the Instance switches back to running, else returns False
-        """
-
-        time_counter = 0
-        status = self.get_instance_status(instance_id)
-        while status == "starting":
-            status = self.get_instance_status(instance_id)
-            logging.info(
-                "Instance %s is still restarting, sleeping for 5 seconds" % instance_id
-            )
-            time.sleep(5)
-            time_counter += 5
-            if time_counter >= timeout:
-                logging.info(
-                    "Instance %s is still restarting after allotted time" % instance_id
-                )
-                return False
-        self.wait_until_running(instance_id, timeout)
-        return True
-
-
-@dataclass
-class Node:
-    name: str
-
-
-@dataclass
-class NodeScenarioSuccessOutput:
-
-    nodes: typing.Dict[int, Node] = field(
-        metadata={
-            "name": "Nodes started/stopped/terminated/rebooted",
-            "description": """Map between timestamps and the pods started/stopped/terminated/rebooted.
-                        The timestamp is provided in nanoseconds""",
-        }
-    )
-    action: kube_helper.Actions = field(
-        metadata={
-            "name": "The action performed on the node",
-            "description": """The action performed or attempted to be performed on the node. Possible values
-                        are : Start, Stop, Terminate, Reboot""",
-        }
-    )
-
-
-@dataclass
-class NodeScenarioErrorOutput:
-
-    error: str
-    action: kube_helper.Actions = field(
-        metadata={
-            "name": "The action performed on the node",
-            "description": """The action attempted to be performed on the node. Possible values are : Start
-                        Stop, Terminate, Reboot""",
-        }
-    )
-
-
-@dataclass
-class NodeScenarioConfig:
-
-    name: typing.Annotated[
-        typing.Optional[str],
-        validation.required_if_not("label_selector"),
-        validation.required_if("skip_openshift_checks"),
-    ] = field(
-        default=None,
-        metadata={
-            "name": "Name",
-            "description": "Name(s) for target nodes. Required if label_selector is not set.",
-        },
-    )
-
-    runs: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
-        default=1,
-        metadata={
-            "name": "Number of runs per node",
-            "description": "Number of times to inject each scenario under actions (will perform on same node each time)",
-        },
-    )
-
-    label_selector: typing.Annotated[
-        typing.Optional[str], validation.min(1), validation.required_if_not("name")
-    ] = field(
-        default=None,
-        metadata={
-            "name": "Label selector",
-            "description": "Kubernetes label selector for the target nodes. Required if name is not set.\n"
-            "See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details.",
-        },
-    )
-
-    timeout: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
-        default=180,
-        metadata={
-            "name": "Timeout",
-            "description": "Timeout to wait for the target pod(s) to be removed in seconds.",
-        },
-    )
-
-    instance_count: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
-        default=1,
-        metadata={
-            "name": "Instance Count",
-            "description": "Number of nodes to perform action/select that match the label selector.",
-        },
-    )
-
-    skip_openshift_checks: typing.Optional[bool] = field(
-        default=False,
-        metadata={
-            "name": "Skip Openshift Checks",
-            "description": "Skip checking the status of the openshift nodes.",
-        },
-    )
-
-    kubeconfig_path: typing.Optional[str] = field(
-        default=None,
-        metadata={
-            "name": "Kubeconfig path",
-            "description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
-            "See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for "
-            "details.",
-        },
-    )
-
-
-@plugin.step(
-    id="ibmcloud-node-start",
-    name="Start the node",
-    description="Start the node(s) by starting the Ibmcloud Instance on which the node is configured",
-    outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
-)
-def node_start(
-    cfg: NodeScenarioConfig,
-) -> typing.Tuple[
-    str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
-]:
-    with kube_helper.setup_kubernetes(None) as cli:
-        ibmcloud = IbmCloud()
-        core_v1 = client.CoreV1Api(cli)
-        watch_resource = watch.Watch()
-        node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.START, core_v1)
-        node_name_id_list = ibmcloud.list_instances()
-        nodes_started = {}
-        for name in node_list:
-            try:
-                for _ in range(cfg.runs):
-                    logging.info("Starting node_start_scenario injection")
-                    logging.info("Starting the node %s " % (name))
-                    instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
-                    if instance_id:
-                        vm_started = ibmcloud.start_instances(instance_id)
-                        if vm_started:
-                            ibmcloud.wait_until_running(instance_id, cfg.timeout)
-                            if not cfg.skip_openshift_checks:
-                                kube_helper.wait_for_ready_status(
-                                    name, cfg.timeout, watch_resource, core_v1
-                                )
-                            nodes_started[int(time.time_ns())] = Node(name=name)
-                        logging.info(
-                            "Node with instance ID: %s is in running state" % name
-                        )
-                        logging.info(
-                            "node_start_scenario has been successfully injected!"
-                        )
-                    else:
-                        logging.error(
-                            "Failed to find node that matched instances on ibm cloud in region"
-                        )
-                        return "error", NodeScenarioErrorOutput(
-                            "No matching vpc with node name " + name,
-                            kube_helper.Actions.START,
-                        )
-            except Exception as e:
-                logging.error("Failed to start node instance. Test Failed")
-                logging.error("node_start_scenario injection failed!")
-                return "error", NodeScenarioErrorOutput(
-                    format_exc(), kube_helper.Actions.START
-                )
-
-    return "success", NodeScenarioSuccessOutput(
-        nodes_started, kube_helper.Actions.START
-    )
-
-
-@plugin.step(
-    id="ibmcloud-node-stop",
-    name="Stop the node",
-    description="Stop the node(s) by starting the Ibmcloud Instance on which the node is configured",
-    outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
-)
-def node_stop(
-    cfg: NodeScenarioConfig,
-) -> typing.Tuple[
-    str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
-]:
-    with kube_helper.setup_kubernetes(None) as cli:
-        ibmcloud = IbmCloud()
-        core_v1 = client.CoreV1Api(cli)
-        watch_resource = watch.Watch()
-        logging.info("set up done")
-        node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.STOP, core_v1)
-        logging.info("set node list" + str(node_list))
-        node_name_id_list = ibmcloud.list_instances()
-        logging.info("node names" + str(node_name_id_list))
-        nodes_stopped = {}
-        for name in node_list:
-            try:
-                for _ in range(cfg.runs):
-                    logging.info("Starting node_stop_scenario injection")
-                    logging.info("Stopping the node %s " % (name))
-                    instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
-                    if instance_id:
-                        vm_stopped = ibmcloud.stop_instances(instance_id)
-                        if vm_stopped:
-                            ibmcloud.wait_until_stopped(instance_id, cfg.timeout)
-                            if not cfg.skip_openshift_checks:
-                                kube_helper.wait_for_ready_status(
-                                    name, cfg.timeout, watch_resource, core_v1
-                                )
-                            nodes_stopped[int(time.time_ns())] = Node(name=name)
-                        logging.info(
-                            "Node with instance ID: %s is in stopped state" % name
-                        )
-                        logging.info(
-                            "node_stop_scenario has been successfully injected!"
-                        )
-                    else:
-                        logging.error(
-                            "Failed to find node that matched instances on ibm cloud in region"
-                        )
-                        return "error", NodeScenarioErrorOutput(
-                            "No matching vpc with node name " + name,
-                            kube_helper.Actions.STOP,
-                        )
-            except Exception as e:
-                logging.error("Failed to stop node instance. Test Failed")
-                logging.error("node_stop_scenario injection failed!")
-                return "error", NodeScenarioErrorOutput(
-                    format_exc(), kube_helper.Actions.STOP
-                )
-
-        return "success", NodeScenarioSuccessOutput(
-            nodes_stopped, kube_helper.Actions.STOP
-        )
-
-
-@plugin.step(
-    id="ibmcloud-node-reboot",
-    name="Reboot Ibmcloud Instance",
-    description="Reboot the node(s) by starting the Ibmcloud Instance on which the node is configured",
-    outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
-)
-def node_reboot(
-    cfg: NodeScenarioConfig,
-) -> typing.Tuple[
-    str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
-]:
-    with kube_helper.setup_kubernetes(None) as cli:
-        ibmcloud = IbmCloud()
-        core_v1 = client.CoreV1Api(cli)
-        watch_resource = watch.Watch()
-        node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.REBOOT, core_v1)
-        node_name_id_list = ibmcloud.list_instances()
-        nodes_rebooted = {}
-        for name in node_list:
-            try:
-                for _ in range(cfg.runs):
-                    logging.info("Starting node_reboot_scenario injection")
-                    logging.info("Rebooting the node %s " % (name))
-                    instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
-                    if instance_id:
-                        ibmcloud.reboot_instances(instance_id)
-                        ibmcloud.wait_until_rebooted(instance_id, cfg.timeout)
-                        if not cfg.skip_openshift_checks:
-                            kube_helper.wait_for_unknown_status(
-                                name, cfg.timeout, watch_resource, core_v1
-                            )
-                            kube_helper.wait_for_ready_status(
-                                name, cfg.timeout, watch_resource, core_v1
-                            )
-                        nodes_rebooted[int(time.time_ns())] = Node(name=name)
-                        logging.info(
-                            "Node with instance ID: %s has rebooted successfully" % name
-                        )
-                        logging.info(
-                            "node_reboot_scenario has been successfully injected!"
-                        )
-                    else:
-                        logging.error(
-                            "Failed to find node that matched instances on ibm cloud in region"
-                        )
-                        return "error", NodeScenarioErrorOutput(
-                            "No matching vpc with node name " + name,
-                            kube_helper.Actions.REBOOT,
-                        )
-            except Exception as e:
-                logging.error("Failed to reboot node instance. Test Failed")
-                logging.error("node_reboot_scenario injection failed!")
-                return "error", NodeScenarioErrorOutput(
-                    format_exc(), kube_helper.Actions.REBOOT
-                )
-
-    return "success", NodeScenarioSuccessOutput(
-        nodes_rebooted, kube_helper.Actions.REBOOT
-    )
-
-
-@plugin.step(
-    id="ibmcloud-node-terminate",
-    name="Reboot Ibmcloud Instance",
-    description="Wait for node to be deleted",
-    outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
-)
-def node_terminate(
-    cfg: NodeScenarioConfig,
-) -> typing.Tuple[
-    str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
-]:
-    with kube_helper.setup_kubernetes(None) as cli:
-        ibmcloud = IbmCloud()
-        core_v1 = client.CoreV1Api(cli)
-        node_list = kube_helper.get_node_list(
-            cfg, kube_helper.Actions.TERMINATE, core_v1
-        )
-        node_name_id_list = ibmcloud.list_instances()
-        nodes_terminated = {}
-        for name in node_list:
-            try:
-                for _ in range(cfg.runs):
-                    logging.info(
-                        "Starting node_termination_scenario injection by first stopping the node"
-                    )
-                    instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
-                    logging.info("Deleting the node with instance ID: %s " % (name))
-                    if instance_id:
-                        ibmcloud.delete_instance(instance_id)
-                        ibmcloud.wait_until_released(name, cfg.timeout)
-                        nodes_terminated[int(time.time_ns())] = Node(name=name)
-                        logging.info(
-                            "Node with instance ID: %s has been released" % name
-                        )
-                        logging.info(
-                            "node_terminate_scenario has been successfully injected!"
-                        )
-                    else:
-                        logging.error(
-                            "Failed to find instances that matched the node specifications on ibm cloud in the set region"
-                        )
-                        return "error", NodeScenarioErrorOutput(
-                            "No matching vpc with node name " + name,
-                            kube_helper.Actions.TERMINATE,
-                        )
-            except Exception as e:
-                logging.error("Failed to terminate node instance. Test Failed")
-                logging.error("node_terminate_scenario injection failed!")
-                return "error", NodeScenarioErrorOutput(
-                    format_exc(), kube_helper.Actions.TERMINATE
-                )
-
-    return "success", NodeScenarioSuccessOutput(
-        nodes_terminated, kube_helper.Actions.TERMINATE
-    )
--- a/krkn/scenario_plugins/native/node_scenarios/kubernetes_functions.py
+++ b/krkn/scenario_plugins/native/node_scenarios/kubernetes_functions.py
@@ -1,179 +0,0 @@
-from kubernetes import config, client
-from kubernetes.client.rest import ApiException
-import logging
-import random
-from enum import Enum
-
-
-class Actions(Enum):
-    """
-    This enumeration indicates different kinds of node operations
-    """
-
-    START = "Start"
-    STOP = "Stop"
-    TERMINATE = "Terminate"
-    REBOOT = "Reboot"
-
-
-def setup_kubernetes(kubeconfig_path):
-    """
-    Sets up the Kubernetes client
-    """
-
-    if kubeconfig_path is None:
-        kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
-    kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
-
-    if kubeconfig.config is None:
-        raise Exception(
-            "Invalid kube-config file: %s. " "No configuration found." % kubeconfig_path
-        )
-    loader = config.kube_config.KubeConfigLoader(
-        config_dict=kubeconfig.config,
-    )
-    client_config = client.Configuration()
-    loader.load_and_set(client_config)
-    return client.ApiClient(configuration=client_config)
-
-
-def list_killable_nodes(core_v1, label_selector=None):
-    """
-    Returns a list of nodes that can be stopped/reset/released
-    """
-
-    nodes = []
-    try:
-        if label_selector:
-            ret = core_v1.list_node(pretty=True, label_selector=label_selector)
-        else:
-            ret = core_v1.list_node(pretty=True)
-    except ApiException as e:
-        logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
-        raise e
-    for node in ret.items:
-        for cond in node.status.conditions:
-            if str(cond.type) == "Ready" and str(cond.status) == "True":
-                nodes.append(node.metadata.name)
-    return nodes
-
-
-def list_startable_nodes(core_v1, label_selector=None):
-    """
-    Returns a list of nodes that can be started
-    """
-
-    nodes = []
-    try:
-        if label_selector:
-            ret = core_v1.list_node(pretty=True, label_selector=label_selector)
-        else:
-            ret = core_v1.list_node(pretty=True)
-    except ApiException as e:
-        logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
-        raise e
-    for node in ret.items:
-        for cond in node.status.conditions:
-            if str(cond.type) == "Ready" and str(cond.status) != "True":
-                nodes.append(node.metadata.name)
-    return nodes
-
-
-def get_node_list(cfg, action, core_v1):
-    """
-    Returns a list of nodes to be used in the node scenarios. The list returned is constructed as follows:
-        - If the key 'name' is present in the node scenario config, the value is extracted and split into
-          a list
-        - Each node in the list is fed to the get_node function which checks if the node is killable or
-          fetches the node using the label selector
-    """
-
-    def get_node(node_name, label_selector, instance_kill_count, action, core_v1):
-        list_nodes_func = (
-            list_startable_nodes if action == Actions.START else list_killable_nodes
-        )
-        if node_name in list_nodes_func(core_v1):
-            return [node_name]
-        elif node_name:
-            logging.info(
-                "Node with provided node_name does not exist or the node might "
-                "be in NotReady state."
-            )
-        nodes = list_nodes_func(core_v1, label_selector)
-        if not nodes:
-            raise Exception("Ready nodes with the provided label selector do not exist")
-        logging.info(
-            "Ready nodes with the label selector %s: %s" % (label_selector, nodes)
-        )
-        number_of_nodes = len(nodes)
-        if instance_kill_count == number_of_nodes:
-            return nodes
-        nodes_to_return = []
-        for i in range(instance_kill_count):
-            node_to_add = nodes[random.randint(0, len(nodes) - 1)]
-            nodes_to_return.append(node_to_add)
-            nodes.remove(node_to_add)
-        return nodes_to_return
-
-    if cfg.name:
-        input_nodes = cfg.name.split(",")
-    else:
-        input_nodes = [""]
-    scenario_nodes = set()
-
-    if cfg.skip_openshift_checks:
-        scenario_nodes = input_nodes
-    else:
-        for node in input_nodes:
-            nodes = get_node(
-                node, cfg.label_selector, cfg.instance_count, action, core_v1
-            )
-            scenario_nodes.update(nodes)
-
-    return list(scenario_nodes)
-
-
-def watch_node_status(node, status, timeout, watch_resource, core_v1):
-    """
-    Monitor the status of a node for change
-    """
-    count = timeout
-    for event in watch_resource.stream(
-        core_v1.list_node,
-        field_selector=f"metadata.name={node}",
-        timeout_seconds=timeout,
-    ):
-        conditions = [
-            status
-            for status in event["object"].status.conditions
-            if status.type == "Ready"
-        ]
-        if conditions[0].status == status:
-            watch_resource.stop()
-            break
-        else:
-            count -= 1
-            logging.info("Status of node " + node + ": " + str(conditions[0].status))
-        if not count:
-            watch_resource.stop()
-
-
-def wait_for_ready_status(node, timeout, watch_resource, core_v1):
-    """
-    Wait until the node status becomes Ready
-    """
-    watch_node_status(node, "True", timeout, watch_resource, core_v1)
-
-
-def wait_for_not_ready_status(node, timeout, watch_resource, core_v1):
-    """
-    Wait until the node status becomes Not Ready
-    """
-    watch_node_status(node, "False", timeout, watch_resource, core_v1)
-
-
-def wait_for_unknown_status(node, timeout, watch_resource, core_v1):
-    """
-    Wait until the node status becomes Unknown
-    """
-    watch_node_status(node, "Unknown", timeout, watch_resource, core_v1)
--- a/krkn/scenario_plugins/native/plugins.py
+++ b/krkn/scenario_plugins/native/plugins.py
@@ -12,7 +12,6 @@ from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin i
 from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin import (
    pod_egress_shaping,
 )
-import krkn.scenario_plugins.native.node_scenarios.ibmcloud_plugin as ibmcloud_plugin
 from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin import (
    pod_ingress_shaping,
 )
@@ -157,10 +156,6 @@ PLUGINS = Plugins(
        ),
        PluginStep(wait_for_pods, ["error"]),
        PluginStep(run_python_file, ["error"]),
-        PluginStep(ibmcloud_plugin.node_start, ["error"]),
-        PluginStep(ibmcloud_plugin.node_stop, ["error"]),
-        PluginStep(ibmcloud_plugin.node_reboot, ["error"]),
-        PluginStep(ibmcloud_plugin.node_terminate, ["error"]),
        PluginStep(network_chaos, ["error"]),
        PluginStep(pod_outage, ["error"]),
        PluginStep(pod_egress_shaping, ["error"]),
--- a/krkn/scenario_plugins/node_actions/alibaba_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/alibaba_node_scenarios.py
@@ -239,6 +239,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_start_scenario injection")
                vm_id = self.alibaba.get_instance_id(node)
+                affected_node.node_id = vm_id
                logging.info(
                    "Starting the node %s with instance ID: %s " % (node, vm_id)
                )
@@ -263,6 +264,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_stop_scenario injection")
                vm_id = self.alibaba.get_instance_id(node)
+                affected_node.node_id = vm_id
                logging.info(
                    "Stopping the node %s with instance ID: %s " % (node, vm_id)
                )
@@ -289,6 +291,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
                    "Starting node_termination_scenario injection by first stopping instance"
                )
                vm_id = self.alibaba.get_instance_id(node)
+                affected_node.node_id = vm_id
                self.alibaba.stop_instances(vm_id)
                self.alibaba.wait_until_stopped(vm_id, timeout, affected_node)
                logging.info(
@@ -316,6 +319,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_reboot_scenario injection")
                instance_id = self.alibaba.get_instance_id(node)
+                affected_node.node_id = instance_id
                logging.info("Rebooting the node with instance ID: %s " % (instance_id))
                self.alibaba.reboot_instances(instance_id)
                nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
--- a/krkn/scenario_plugins/node_actions/aws_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/aws_node_scenarios.py
@@ -272,6 +272,7 @@ class aws_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_start_scenario injection")
                instance_id = self.aws.get_instance_id(node)
+                affected_node.node_id = instance_id
                logging.info(
                    "Starting the node %s with instance ID: %s " % (node, instance_id)
                )
@@ -299,6 +300,7 @@ class aws_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_stop_scenario injection")
                instance_id = self.aws.get_instance_id(node)
+                affected_node.node_id = instance_id
                logging.info(
                    "Stopping the node %s with instance ID: %s " % (node, instance_id)
                )
@@ -325,6 +327,7 @@ class aws_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_termination_scenario injection")
                instance_id = self.aws.get_instance_id(node)
+                affected_node.node_id = instance_id
                logging.info(
                    "Terminating the node %s with instance ID: %s "
                    % (node, instance_id)
@@ -358,6 +361,7 @@ class aws_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_reboot_scenario injection" + str(node))
                instance_id = self.aws.get_instance_id(node)
+                affected_node.node_id = instance_id
                logging.info(
                    "Rebooting the node %s with instance ID: %s " % (node, instance_id)
                )
--- a/krkn/scenario_plugins/node_actions/az_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/az_node_scenarios.py
@@ -170,7 +170,7 @@ class azure_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_start_scenario injection")
                vm_name, resource_group = self.azure.get_instance_id(node)
-                
+                affected_node.node_id = vm_name
                logging.info(
                    "Starting the node %s with instance ID: %s "
                    % (vm_name, resource_group)
@@ -197,6 +197,7 @@ class azure_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_stop_scenario injection")
                vm_name, resource_group = self.azure.get_instance_id(node)
+                affected_node.node_id = vm_name
                logging.info(
                    "Stopping the node %s with instance ID: %s "
                    % (vm_name, resource_group)
@@ -221,8 +222,8 @@ class azure_node_scenarios(abstract_node_scenarios):
            affected_node = AffectedNode(node)
            try:
                logging.info("Starting node_termination_scenario injection")
-                affected_node = AffectedNode(node)
                vm_name, resource_group = self.azure.get_instance_id(node)
+                affected_node.node_id = vm_name
                logging.info(
                    "Terminating the node %s with instance ID: %s "
                    % (vm_name, resource_group)
@@ -257,6 +258,7 @@ class azure_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_reboot_scenario injection")
                vm_name, resource_group = self.azure.get_instance_id(node)
+                affected_node.node_id = vm_name
                logging.info(
                    "Rebooting the node %s with instance ID: %s "
                    % (vm_name, resource_group)
--- a/krkn/scenario_plugins/node_actions/bm_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/bm_node_scenarios.py
@@ -109,20 +109,28 @@ class BM:
        self.get_ipmi_connection(bmc_addr, node_name).chassis_control_power_cycle()

    # Wait until the node instance is running
-    def wait_until_running(self, bmc_addr, node_name):
+    def wait_until_running(self, bmc_addr, node_name, affected_node):
+        start_time = time.time()
        while (
            not self.get_ipmi_connection(bmc_addr, node_name)
            .get_chassis_status()
            .power_on
        ):
            time.sleep(1)
+        end_time = time.time()
+        if affected_node:
+            affected_node.set_affected_node_status("running", end_time - start_time)

    # Wait until the node instance is stopped
-    def wait_until_stopped(self, bmc_addr, node_name):
+    def wait_until_stopped(self, bmc_addr, node_name, affected_node):
+        start_time = time.time()
        while (
            self.get_ipmi_connection(bmc_addr, node_name).get_chassis_status().power_on
        ):
            time.sleep(1)
+        end_time = time.time()
+        if affected_node:
+            affected_node.set_affected_node_status("stopped", end_time - start_time)


 # krkn_lib
@@ -134,15 +142,17 @@ class bm_node_scenarios(abstract_node_scenarios):
    # Node scenario to start the node
    def node_start_scenario(self, instance_kill_count, node, timeout):
        for _ in range(instance_kill_count):
+            affected_node = AffectedNode(node)
            try:
                logging.info("Starting node_start_scenario injection")
                bmc_addr = self.bm.get_bmc_addr(node)
+                affected_node.node_id = bmc_addr
                logging.info(
                    "Starting the node %s with bmc address: %s " % (node, bmc_addr)
                )
                self.bm.start_instances(bmc_addr, node)
-                self.bm.wait_until_running(bmc_addr, node)
-                nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
+                self.bm.wait_until_running(bmc_addr, node, affected_node)
+                nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
                logging.info(
                    "Node with bmc address: %s is in running state" % (bmc_addr)
                )
@@ -155,6 +165,7 @@ class bm_node_scenarios(abstract_node_scenarios):
                )
                logging.error("node_start_scenario injection failed!")
                raise e
+            self.affected_nodes_status.affected_nodes.append(affected_node)

    # Node scenario to stop the node
    def node_stop_scenario(self, instance_kill_count, node, timeout):
@@ -163,6 +174,7 @@ class bm_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_stop_scenario injection")
                bmc_addr = self.bm.get_bmc_addr(node)
+                affected_node.node_id = bmc_addr
                logging.info(
                    "Stopping the node %s with bmc address: %s " % (node, bmc_addr)
                )
--- a/krkn/scenario_plugins/node_actions/docker_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/docker_node_scenarios.py
@@ -49,6 +49,7 @@ class docker_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_start_scenario injection")
                container_id = self.docker.get_container_id(node)
+                affected_node.node_id = container_id
                logging.info(
                    "Starting the node %s with container ID: %s " % (node, container_id)
                )
@@ -74,6 +75,7 @@ class docker_node_scenarios(abstract_node_scenarios):
            try:
                logging.info("Starting node_stop_scenario injection")
                container_id = self.docker.get_container_id(node)
+                affected_node.node_id = container_id
                logging.info(
                    "Stopping the node %s with container ID: %s " % (node, container_id)
                )
--- a/krkn/scenario_plugins/node_actions/gcp_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/gcp_node_scenarios.py
@@ -234,6 +234,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
                logging.info("Starting node_start_scenario injection")
                instance = self.gcp.get_node_instance(node)
                instance_id = self.gcp.get_instance_name(instance)
+                affected_node.node_id = instance_id
                logging.info(
                    "Starting the node %s with instance ID: %s " % (node, instance_id)
                )
@@ -252,7 +253,6 @@ class gcp_node_scenarios(abstract_node_scenarios):
                logging.error("node_start_scenario injection failed!")

                raise RuntimeError()
-            logging.info("started affected node" + str(affected_node.to_json()))
            self.affected_nodes_status.affected_nodes.append(affected_node)

    # Node scenario to stop the node
@@ -263,6 +263,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
                logging.info("Starting node_stop_scenario injection")
                instance = self.gcp.get_node_instance(node)
                instance_id = self.gcp.get_instance_name(instance)
+                affected_node.node_id = instance_id
                logging.info(
                    "Stopping the node %s with instance ID: %s " % (node, instance_id)
                )
@@ -280,7 +281,6 @@ class gcp_node_scenarios(abstract_node_scenarios):
                logging.error("node_stop_scenario injection failed!")

                raise RuntimeError()
-            logging.info("stopedd affected node" + str(affected_node.to_json()))
            self.affected_nodes_status.affected_nodes.append(affected_node)

    # Node scenario to terminate the node
@@ -291,6 +291,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
                logging.info("Starting node_termination_scenario injection")
                instance = self.gcp.get_node_instance(node)
                instance_id = self.gcp.get_instance_name(instance)
+                affected_node.node_id = instance_id
                logging.info(
                    "Terminating the node %s with instance ID: %s "
                    % (node, instance_id)
@@ -325,6 +326,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
                logging.info("Starting node_reboot_scenario injection")
                instance = self.gcp.get_node_instance(node)
                instance_id = self.gcp.get_instance_name(instance)
+                affected_node.node_id = instance_id
                logging.info(
                    "Rebooting the node %s with instance ID: %s " % (node, instance_id)
                )
--- a/krkn/scenario_plugins/node_actions/ibmcloud_node_scenarios.py
+++ b/krkn/scenario_plugins/node_actions/ibmcloud_node_scenarios.py
@@ -0,0 +1,367 @@
+#!/usr/bin/env python
+import time
+import typing
+from os import environ
+from dataclasses import dataclass, field
+from traceback import format_exc
+import logging
+
+from krkn_lib.k8s import KrknKubernetes
+import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
+from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
+    abstract_node_scenarios,
+)
+from kubernetes import client, watch
+from ibm_vpc import VpcV1
+from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
+import sys
+
+from krkn_lib.models.k8s import AffectedNodeStatus, AffectedNode
+
+
+class IbmCloud:
+    def __init__(self):
+        """
+        Initialize the ibm cloud client by using the the env variables:
+            'IBMC_APIKEY' 'IBMC_URL'
+        """
+        apiKey = environ.get("IBMC_APIKEY")
+        service_url = environ.get("IBMC_URL")
+        if not apiKey:
+            raise Exception("Environmental variable 'IBMC_APIKEY' is not set")
+        if not service_url:
+            raise Exception("Environmental variable 'IBMC_URL' is not set")
+        try:
+            authenticator = IAMAuthenticator(apiKey)
+            self.service = VpcV1(authenticator=authenticator)
+
+            self.service.set_service_url(service_url)
+        except Exception as e:
+            logging.error("error authenticating" + str(e))
+
+
+    # Get the instance ID of the node
+    def get_instance_id(self, node_name):
+        node_list = self.list_instances()
+        for node in node_list:
+            if node_name == node["vpc_name"]:
+                return node["vpc_id"]
+        logging.error("Couldn't find node with name " + str(node_name) + ", you could try another region")
+        sys.exit(1)
+
+    def delete_instance(self, instance_id):
+        """
+        Deletes the Instance whose name is given by 'instance_id'
+        """
+        try:
+            self.service.delete_instance(instance_id)
+            logging.info("Deleted Instance -- '{}'".format(instance_id))
+        except Exception as e:
+            logging.info("Instance '{}' could not be deleted. ".format(instance_id))
+            return False
+
+    def reboot_instances(self, instance_id):
+        """
+        Reboots the Instance whose name is given by 'instance_id'. Returns True if successful, or
+        returns False if the Instance is not powered on
+        """
+
+        try:
+            self.service.create_instance_action(
+                instance_id,
+                type="reboot",
+            )
+            logging.info("Reset Instance -- '{}'".format(instance_id))
+            return True
+        except Exception as e:
+            logging.info("Instance '{}' could not be rebooted".format(instance_id))
+            return False
+
+    def stop_instances(self, instance_id):
+        """
+        Stops the Instance whose name is given by 'instance_id'. Returns True if successful, or
+        returns False if the Instance is already stopped
+        """
+
+        try:
+            self.service.create_instance_action(
+                instance_id,
+                type="stop",
+            )
+            logging.info("Stopped Instance -- '{}'".format(instance_id))
+            return True
+        except Exception as e:
+            logging.info("Instance '{}' could not be stopped".format(instance_id))
+            logging.info("error" + str(e))
+            return False
+
+    def start_instances(self, instance_id):
+        """
+        Stops the Instance whose name is given by 'instance_id'. Returns True if successful, or
+        returns False if the Instance is already running
+        """
+
+        try:
+            self.service.create_instance_action(
+                instance_id,
+                type="start",
+            )
+            logging.info("Started Instance -- '{}'".format(instance_id))
+            return True
+        except Exception as e:
+            logging.info("Instance '{}' could not start running".format(instance_id))
+            return False
+
+    def list_instances(self):
+        """
+        Returns a list of Instances present in the datacenter
+        """
+        instance_names = []
+        try:
+            instances_result = self.service.list_instances().get_result()
+            instances_list = instances_result["instances"]
+            for vpc in instances_list:
+                instance_names.append({"vpc_name": vpc["name"], "vpc_id": vpc["id"]})
+            starting_count = instances_result["total_count"]
+            while instances_result["total_count"] == instances_result["limit"]:
+                instances_result = self.service.list_instances(
+                    start=starting_count
+                ).get_result()
+                instances_list = instances_result["instances"]
+                starting_count += instances_result["total_count"]
+                for vpc in instances_list:
+                    instance_names.append({"vpc_name": vpc.name, "vpc_id": vpc.id})
+        except Exception as e:
+            logging.error("Error listing out instances: " + str(e))
+            sys.exit(1)
+        return instance_names
+
+    def find_id_in_list(self, name, vpc_list):
+        for vpc in vpc_list:
+            if vpc["vpc_name"] == name:
+                return vpc["vpc_id"]
+
+    def get_instance_status(self, instance_id):
+        """
+        Returns the status of the Instance whose name is given by 'instance_id'
+        """
+
+        try:
+            instance = self.service.get_instance(instance_id).get_result()
+            state = instance["status"]
+            return state
+        except Exception as e:
+            logging.error(
+                "Failed to get node instance status %s. Encountered following "
+                "exception: %s." % (instance_id, e)
+            )
+            return None
+
+    def wait_until_deleted(self, instance_id, timeout, affected_node=None):
+        """
+        Waits until the instance is deleted or until the timeout. Returns True if
+        the instance is successfully deleted, else returns False
+        """
+        start_time = time.time()
+        time_counter = 0
+        vpc = self.get_instance_status(instance_id)
+        while vpc is not None:
+            vpc = self.get_instance_status(instance_id)
+            logging.info(
+                "Instance %s is still being deleted, sleeping for 5 seconds"
+                % instance_id
+            )
+            time.sleep(5)
+            time_counter += 5
+            if time_counter >= timeout:
+                logging.info(
+                    "Instance %s is still not deleted in allotted time" % instance_id
+                )
+                return False
+        end_time = time.time()
+        if affected_node:
+            affected_node.set_affected_node_status("terminated", end_time - start_time)
+        return True
+
+    def wait_until_running(self, instance_id, timeout, affected_node=None):
+        """
+        Waits until the Instance switches to running state or until the timeout.
+        Returns True if the Instance switches to running, else returns False
+        """
+        start_time = time.time()
+        time_counter = 0
+        status = self.get_instance_status(instance_id)
+        while status != "running":
+            status = self.get_instance_status(instance_id)
+            logging.info(
+                "Instance %s is still not running, sleeping for 5 seconds" % instance_id
+            )
+            time.sleep(5)
+            time_counter += 5
+            if time_counter >= timeout:
+                logging.info(
+                    "Instance %s is still not ready in allotted time" % instance_id
+                )
+                return False
+        end_time = time.time()
+        if affected_node:
+            affected_node.set_affected_node_status("running", end_time - start_time)
+        return True
+
+    def wait_until_stopped(self, instance_id, timeout, affected_node):
+        """
+        Waits until the Instance switches to stopped state or until the timeout.
+        Returns True if the Instance switches to stopped, else returns False
+        """
+        start_time = time.time()
+        time_counter = 0
+        status = self.get_instance_status(instance_id)
+        while status != "stopped":
+            status = self.get_instance_status(instance_id)
+            logging.info(
+                "Instance %s is still not stopped, sleeping for 5 seconds" % instance_id
+            )
+            time.sleep(5)
+            time_counter += 5
+            if time_counter >= timeout:
+                logging.info(
+                    "Instance %s is still not stopped in allotted time" % instance_id
+                )
+                return False
+        end_time = time.time()
+        if affected_node:
+            affected_node.set_affected_node_status("stopped", end_time - start_time)
+        return True
+
+
+    def wait_until_rebooted(self, instance_id, timeout, affected_node):
+        """
+        Waits until the Instance switches to restarting state and then running state or until the timeout.
+        Returns True if the Instance switches back to running, else returns False
+        """
+
+        time_counter = 0
+        status = self.get_instance_status(instance_id)
+        while status == "starting":
+            status = self.get_instance_status(instance_id)
+            logging.info(
+                "Instance %s is still restarting, sleeping for 5 seconds" % instance_id
+            )
+            time.sleep(5)
+            time_counter += 5
+            if time_counter >= timeout:
+                logging.info(
+                    "Instance %s is still restarting after allotted time" % instance_id
+                )
+                return False
+        self.wait_until_running(instance_id, timeout, affected_node)
+        return True
+
+
+@dataclass
+class ibm_node_scenarios(abstract_node_scenarios):
+    def __init__(self, kubecli: KrknKubernetes, affected_nodes_status: AffectedNodeStatus):
+        super().__init__(kubecli, affected_nodes_status)
+        self.ibmcloud = IbmCloud()
+
+    def node_start_scenario(self, instance_kill_count, node, timeout):
+        try:
+            instance_id = self.ibmcloud.get_instance_id( node)
+            affected_node = AffectedNode(node, node_id=instance_id)
+            for _ in range(instance_kill_count):
+                logging.info("Starting node_start_scenario injection")
+                logging.info("Starting the node %s " % (node))
+                
+                if instance_id:
+                    vm_started = self.ibmcloud.start_instances(instance_id)
+                    if vm_started:
+                        self.ibmcloud.wait_until_running(instance_id, timeout, affected_node)
+                        nodeaction.wait_for_ready_status(
+                            node, timeout, self.kubecli, affected_node
+                        )
+                    logging.info(
+                        "Node with instance ID: %s is in running state" % node
+                    )
+                    logging.info(
+                        "node_start_scenario has been successfully injected!"
+                    )
+                else:
+                    logging.error(
+                        "Failed to find node that matched instances on ibm cloud in region"
+                    )
+
+        except Exception as e:
+            logging.error("Failed to start node instance. Test Failed")
+            logging.error("node_start_scenario injection failed!")
+        self.affected_nodes_status.affected_nodes.append(affected_node)
+
+
+    def node_stop_scenario(self, instance_kill_count, node, timeout):
+        try:
+            instance_id = self.ibmcloud.get_instance_id(node)
+            for _ in range(instance_kill_count):
+                affected_node = AffectedNode(node, instance_id)
+                logging.info("Starting node_stop_scenario injection")
+                logging.info("Stopping the node %s " % (node))
+                vm_stopped = self.ibmcloud.stop_instances(instance_id)
+                if vm_stopped:
+                    self.ibmcloud.wait_until_stopped(instance_id, timeout, affected_node)
+                logging.info(
+                    "Node with instance ID: %s is in stopped state" % node
+                )
+                logging.info(
+                    "node_stop_scenario has been successfully injected!"
+                )
+        except Exception as e:
+            logging.error("Failed to stop node instance. Test Failed")
+            logging.error("node_stop_scenario injection failed!")
+
+
+    def node_reboot_scenario(self, instance_kill_count, node, timeout):
+        try:
+            instance_id = self.ibmcloud.get_instance_id(node)
+            for _ in range(instance_kill_count):
+                affected_node = AffectedNode(node, node_id=instance_id)
+                logging.info("Starting node_reboot_scenario injection")
+                logging.info("Rebooting the node %s " % (node))
+                self.ibmcloud.reboot_instances(instance_id)
+                self.ibmcloud.wait_until_rebooted(instance_id, timeout)
+                nodeaction.wait_for_unknown_status(
+                    node, timeout, affected_node
+                )
+                nodeaction.wait_for_ready_status(
+                    node, timeout, affected_node
+                )
+                logging.info(
+                    "Node with instance ID: %s has rebooted successfully" % node
+                )
+                logging.info(
+                    "node_reboot_scenario has been successfully injected!"
+                )
+
+        except Exception as e:
+            logging.error("Failed to reboot node instance. Test Failed")
+            logging.error("node_reboot_scenario injection failed!")
+
+
+    def node_terminate_scenario(self, instance_kill_count, node, timeout):
+        try:
+            instance_id = self.ibmcloud.get_instance_id(node)
+            for _ in range(instance_kill_count):
+                affected_node = AffectedNode(node, node_id=instance_id)
+                logging.info(
+                    "Starting node_termination_scenario injection by first stopping the node"
+                )
+                logging.info("Deleting the node with instance ID: %s " % (node))
+                self.ibmcloud.delete_instance(instance_id)
+                self.ibmcloud.wait_until_deleted(node, timeout, affected_node)
+                logging.info(
+                    "Node with instance ID: %s has been released" % node
+                )
+                logging.info(
+                    "node_terminate_scenario has been successfully injected!"
+                )
+        except Exception as e:
+            logging.error("Failed to terminate node instance. Test Failed")
+            logging.error("node_terminate_scenario injection failed!")
+
--- a/krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py
+++ b/krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py
@@ -23,7 +23,7 @@ from krkn.scenario_plugins.node_actions.general_cloud_node_scenarios import (
    general_node_scenarios,
 )
 from krkn.scenario_plugins.node_actions.vmware_node_scenarios import vmware_node_scenarios
-
+from krkn.scenario_plugins.node_actions.ibmcloud_node_scenarios import ibm_node_scenarios
 node_general = False


@@ -113,6 +113,11 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
            or node_scenario["cloud_type"].lower() == "vmware"
        ):
            return vmware_node_scenarios(kubecli, affected_nodes_status)
+        elif (
+            node_scenario["cloud_type"].lower() == "ibm"
+            or node_scenario["cloud_type"].lower() == "ibmcloud"
+        ):
+            return ibm_node_scenarios(kubecli, affected_nodes_status)
        else:
            logging.error(
                "Cloud type "
--- a/krkn/scenario_plugins/shut_down/shut_down_scenario_plugin.py
+++ b/krkn/scenario_plugins/shut_down/shut_down_scenario_plugin.py
@@ -13,7 +13,9 @@ from krkn.scenario_plugins.node_actions.aws_node_scenarios import AWS
 from krkn.scenario_plugins.node_actions.az_node_scenarios import Azure
 from krkn.scenario_plugins.node_actions.gcp_node_scenarios import GCP
 from krkn.scenario_plugins.node_actions.openstack_node_scenarios import OPENSTACKCLOUD
-from krkn.scenario_plugins.native.node_scenarios.ibmcloud_plugin import IbmCloud
+from krkn.scenario_plugins.node_actions.ibmcloud_node_scenarios import IbmCloud
+
+import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction

 from krkn_lib.models.k8s import AffectedNodeStatus, AffectedNode

@@ -38,7 +40,7 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
                    shut_down_config_scenario, lib_telemetry.get_lib_kubernetes(), affected_nodes_status
                )

-                scenario_telemetry.affected_nodes = affected_nodes_status
+                scenario_telemetry.affected_nodes = affected_nodes_status.affected_nodes
                end_time = int(time.time())
                cerberus.publish_kraken_status(krkn_config, [], start_time, end_time)
                return 0
@@ -56,7 +58,6 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
                pool = ThreadPool(processes=len(nodes))
            else:
                pool = ThreadPool(processes=processes)
-            logging.info("nodes type " + str(type(nodes[0])))
            if type(nodes[0]) is tuple:
                node_id = []
                node_info = []
@@ -105,9 +106,8 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
        node_id = []
        for node in nodes:
            instance_id = cloud_object.get_instance_id(node)
-            affected_nodes_status.affected_nodes.append(AffectedNode(node))
+            affected_nodes_status.affected_nodes.append(AffectedNode(node, node_id=instance_id))
            node_id.append(instance_id)
-        logging.info("node id list " + str(node_id))
        for _ in range(runs):
            logging.info("Starting cluster_shut_down scenario injection")
            stopping_nodes = set(node_id)
@@ -117,8 +117,7 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
            while len(stopping_nodes) > 0:
                for node in stopping_nodes:
                    affected_node = affected_nodes_status.get_affected_node_index(node)
-                    # need to add in time that is passing while waiting for other nodes to be stopped
-                    affected_node.set_cloud_stopping_time(time.time() - start_time)
+
                    if type(node) is tuple:
                        node_status = cloud_object.wait_until_stopped(
                            node[1], node[0], timeout, affected_node
@@ -129,6 +128,8 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
                    # Only want to remove node from stopping list
                    # when fully stopped/no error
                    if node_status:
+                        # need to add in time that is passing while waiting for other nodes to be stopped
+                        affected_node.set_cloud_stopping_time(time.time() - start_time)
                        stopped_nodes.remove(node)

                stopping_nodes = stopped_nodes.copy()
@@ -148,7 +149,7 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
                for node in not_running_nodes:
                    affected_node = affected_nodes_status.get_affected_node_index(node)
                    # need to add in time that is passing while waiting for other nodes to be running
-                    affected_node.set_cloud_running_time(time.time() - start_time)
+
                    if type(node) is tuple:
                        node_status = cloud_object.wait_until_running(
                            node[1], node[0], timeout, affected_node
@@ -156,8 +157,10 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
                    else:
                        node_status = cloud_object.wait_until_running(node, timeout, affected_node)
                    if node_status:
+                        affected_node.set_cloud_running_time(time.time() - start_time)
                        restarted_nodes.remove(node)
                not_running_nodes = restarted_nodes.copy()
+
            logging.info("Waiting for 150s to allow cluster component initialization")
            time.sleep(150)

--- a/krkn/utils/HealthChecker.py
+++ b/krkn/utils/HealthChecker.py
@@ -0,0 +1,83 @@
+import requests
+import time
+import logging
+import queue
+from datetime import datetime
+from krkn_lib.models.telemetry.models import HealthCheck
+
+class HealthChecker:
+    current_iterations: int = 0
+    ret_value = 0
+    def __init__(self, iterations):
+        self.iterations = iterations
+
+    def make_request(self, url, auth=None, headers=None):
+        response_data = {}
+        response = requests.get(url, auth=auth, headers=headers)
+        response_data["url"] = url
+        response_data["status"] = response.status_code == 200
+        response_data["status_code"] = response.status_code
+        return response_data
+
+
+    def run_health_check(self, health_check_config, health_check_telemetry_queue: queue.Queue):
+        if health_check_config and health_check_config["config"] and any(config.get("url") for config in health_check_config["config"]):
+            health_check_start_time_stamp = datetime.now()
+            health_check_telemetry = []
+            health_check_tracker = {}
+            interval = health_check_config["interval"] if health_check_config["interval"] else 2
+            response_tracker = {config["url"]:True for config in health_check_config["config"]}
+            while self.current_iterations < self.iterations:
+                for config in health_check_config.get("config"):
+                    auth, headers = None, None
+                    if config["url"]: url = config["url"]
+
+                    if config["bearer_token"]:
+                        bearer_token = "Bearer " + config["bearer_token"]
+                        headers = {"Authorization": bearer_token}
+
+                    if config["auth"]: auth = config["auth"]
+                    response = self.make_request(url, auth, headers)
+
+                    if response["status_code"] != 200:
+                        if config["url"] not in health_check_tracker:
+                            start_timestamp = datetime.now()
+                            health_check_tracker[config["url"]] = {
+                                "status_code": response["status_code"],
+                                "start_timestamp": start_timestamp
+                            }
+                            if response_tracker[config["url"]] != False: response_tracker[config["url"]] = False
+                            if config["exit_on_failure"] and config["exit_on_failure"] == True and self.ret_value==0: self.ret_value = 2
+                    else:
+                        if config["url"] in health_check_tracker:
+                            end_timestamp = datetime.now()
+                            start_timestamp = health_check_tracker[config["url"]]["start_timestamp"]
+                            previous_status_code = str(health_check_tracker[config["url"]]["status_code"])
+                            duration = (end_timestamp - start_timestamp).total_seconds()
+                            downtime_record = {
+                                "url": config["url"],
+                                "status": False,
+                                "status_code": previous_status_code,
+                                "start_timestamp": start_timestamp.isoformat(),
+                                "end_timestamp": end_timestamp.isoformat(),
+                                "duration": duration
+                            }
+                            health_check_telemetry.append(HealthCheck(downtime_record))
+                            del health_check_tracker[config["url"]]
+                    time.sleep(interval)
+            health_check_end_time_stamp = datetime.now()
+            for url, status in response_tracker.items():
+                if status == True:
+                    duration = (health_check_end_time_stamp - health_check_start_time_stamp).total_seconds()
+                    success_response = {
+                        "url": url,
+                        "status": True,
+                        "status_code": 200,
+                        "start_timestamp": health_check_start_time_stamp.isoformat(),
+                        "end_timestamp": health_check_end_time_stamp.isoformat(),
+                        "duration": duration
+                    }
+                    health_check_telemetry.append(HealthCheck(success_response))
+            health_check_telemetry_queue.put(health_check_telemetry)
+        else:
+            logging.info("health checks config is not defined, skipping them")
--- a/krkn/utils/functions.py
+++ b/krkn/utils/functions.py
@@ -3,10 +3,10 @@ from krkn_lib.k8s import KrknKubernetes
 from krkn_lib.models.telemetry import ScenarioTelemetry
 from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
 from tzlocal.unix import get_localzone
-
+import logging

 def populate_cluster_events(
-    scenario_telemetry: ScenarioTelemetry,
+    krkn_config: dict,
    scenario_config: dict,
    kubecli: KrknKubernetes,
    start_timestamp: int,
@@ -31,8 +31,12 @@ def populate_cluster_events(
                    namespace=namespace,
                )
            )
-
-    scenario_telemetry.set_cluster_events(events)
+    archive_path = krkn_config["telemetry"]["archive_path"]
+    file_path = archive_path + "/events.json"
+    with open(file_path, "w+") as f:
+        f.write("\n".join(str(item) for item in events))
+    logging.info(f'Find cluster events in file {file_path}' )
+    


 def collect_and_put_ocp_logs(
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,7 +6,7 @@ azure-identity==1.16.1
 azure-keyvault==4.2.0
 azure-mgmt-compute==30.5.0
 itsdangerous==2.0.1
-coverage==7.4.1
+coverage==7.6.12
 datetime==5.4
 docker==7.0.0
 gitpython==3.1.41
@@ -14,8 +14,8 @@ google-auth==2.37.0
 google-cloud-compute==1.22.0
 ibm_cloud_sdk_core==3.18.0
 ibm_vpc==0.20.0
-jinja2==3.1.5
-krkn-lib==4.0.7
+jinja2==3.1.6
+krkn-lib==5.0.0
 lxml==5.1.0
 kubernetes==28.1.0
 numpy==1.26.4
@@ -35,6 +35,7 @@ werkzeug==3.0.6
 wheel==0.42.0
 zope.interface==5.4.0

+
 git+https://github.com/krkn-chaos/arcaflow-plugin-kill-pod.git@v0.1.0
 git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.0.0
 cryptography>=42.0.4 # not directly required, pinned by Snyk to avoid a vulnerability
--- a/run_kraken.py
+++ b/run_kraken.py
@@ -9,6 +9,8 @@ import optparse
 import pyfiglet
 import uuid
 import time
+import queue
+import threading

 from krkn_lib.elastic.krkn_elastic import KrknElastic
 from krkn_lib.models.elastic import ElasticChaosRunTelemetry
@@ -26,6 +28,7 @@ from krkn_lib.utils import SafeLogger
 from krkn_lib.utils.functions import get_yaml_item_value, get_junit_test_case

 from krkn.utils import TeeLogHandler
+from krkn.utils.HealthChecker import HealthChecker
 from krkn.scenario_plugins.scenario_plugin_factory import (
    ScenarioPluginFactory,
    ScenarioPluginNotFound,
@@ -49,9 +52,7 @@ def main(cfg) -> int:
        with open(cfg, "r") as f:
            config = yaml.full_load(f)
        global kubeconfig_path, wait_duration, kraken_config
-        distribution = get_yaml_item_value(
-            config["kraken"], "distribution", "openshift"
-        )
+
        kubeconfig_path = os.path.expanduser(
            get_yaml_item_value(config["kraken"], "kubeconfig_path", "")
        )
@@ -90,13 +91,6 @@ def main(cfg) -> int:
        )
        # elastic search
        enable_elastic = get_yaml_item_value(config["elastic"], "enable_elastic", False)
-        elastic_collect_metrics = get_yaml_item_value(
-            config["elastic"], "collect_metrics", False
-        )
-
-        elastic_colllect_alerts = get_yaml_item_value(
-            config["elastic"], "collect_alerts", False
-        )

        elastic_url = get_yaml_item_value(config["elastic"], "elastic_url", "")

@@ -127,10 +121,11 @@ def main(cfg) -> int:
            config["performance_monitoring"], "check_critical_alerts", False
        )
        telemetry_api_url = config["telemetry"].get("api_url")
+        health_check_config = config["health_checks"]

        # Initialize clients
        if not os.path.isfile(kubeconfig_path) and not os.path.isfile(
-            "/var/run/secrets/kubernetes.io/serviceaccount/token"
+                "/var/run/secrets/kubernetes.io/serviceaccount/token"
        ):
            logging.error(
                "Cannot read the kubeconfig file at %s, please check" % kubeconfig_path
@@ -167,6 +162,11 @@ def main(cfg) -> int:
        except:
            kubecli.initialize_clients(None)

+        distribution = "kubernetes"
+        if ocpcli.is_openshift():
+            distribution = "openshift"
+        logging.info("Detected distribution %s" % (distribution))
+
        # find node kraken might be running on
        kubecli.find_kraken_node()

@@ -203,7 +203,7 @@ def main(cfg) -> int:
                    else:
                        # If can't make a connection, set alerts to false
                        enable_alerts = False
-                        critical_alerts = False
+                        check_critical_alerts = False
                except Exception:
                    logging.error(
                        "invalid distribution selected, running openshift scenarios against kubernetes cluster."
@@ -223,6 +223,7 @@ def main(cfg) -> int:
            safe_logger, ocpcli, telemetry_request_id, config["telemetry"]
        )
        if enable_elastic:
+            logging.info(f"Elastic collection enabled at: {elastic_url}:{elastic_port}")
            elastic_search = KrknElastic(
                safe_logger,
                elastic_url,
@@ -271,8 +272,8 @@ def main(cfg) -> int:
        classes_and_types: dict[str, list[str]] = {}
        for loaded in scenario_plugin_factory.loaded_plugins.keys():
            if (
-                scenario_plugin_factory.loaded_plugins[loaded].__name__
-                not in classes_and_types.keys()
+                    scenario_plugin_factory.loaded_plugins[loaded].__name__
+                    not in classes_and_types.keys()
            ):
                classes_and_types[
                    scenario_plugin_factory.loaded_plugins[loaded].__name__
@@ -299,6 +300,12 @@ def main(cfg) -> int:
                module_name, class_name, error = failed
                logging.error(f"⛔ Class: {class_name} Module: {module_name}")
                logging.error(f"⚠️ {error}\n")
+        health_check_telemetry_queue = queue.Queue()
+        health_checker = HealthChecker(iterations)
+        health_check_worker = threading.Thread(target=health_checker.run_health_check,
+                                               args=(health_check_config, health_check_telemetry_queue))
+        health_check_worker.start()
+
        # Loop to run the chaos starts here
        while int(iteration) < iterations and run_signal != "STOP":
            # Inject chaos scenarios specified in the config
@@ -359,12 +366,18 @@ def main(cfg) -> int:
                                break

            iteration += 1
+            health_checker.current_iterations += 1

        # telemetry
        # in order to print decoded telemetry data even if telemetry collection
        # is disabled, it's necessary to serialize the ChaosRunTelemetry object
        # to json, and recreate a new object from it.
        end_time = int(time.time())
+        health_check_worker.join()
+        try:
+            chaos_telemetry.health_checks = health_check_telemetry_queue.get_nowait()
+        except queue.Empty:
+            chaos_telemetry.health_checks = None

        # if platform is openshift will be collected
        # Cloud platform and network plugins metadata
@@ -419,9 +432,9 @@ def main(cfg) -> int:
                        )
                    else:
                        if (
-                            config["telemetry"]["prometheus_namespace"]
-                            and config["telemetry"]["prometheus_pod_name"]
-                            and config["telemetry"]["prometheus_container_name"]
+                                config["telemetry"]["prometheus_namespace"]
+                                and config["telemetry"]["prometheus_pod_name"]
+                                and config["telemetry"]["prometheus_container_name"]
                        ):
                            try:
                                prometheus_archive_files = (
@@ -470,8 +483,7 @@ def main(cfg) -> int:
                    start_time,
                    end_time,
                    alert_profile,
-                    elastic_colllect_alerts,
-                    elastic_alerts_index,
+                    elastic_alerts_index
                )

            else:
@@ -479,15 +491,15 @@ def main(cfg) -> int:
                return 1
                # sys.exit(1)
        if enable_metrics:
+            logging.info(f'Capturing metrics using file {metrics_profile}')
            prometheus_plugin.metrics(
                prometheus,
                elastic_search,
-                start_time,
                run_uuid,
+                start_time,
                end_time,
                metrics_profile,
-                elastic_collect_metrics,
-                elastic_metrics_index,
+                elastic_metrics_index
            )

        if post_critical_alerts > 0:
@@ -501,6 +513,9 @@ def main(cfg) -> int:
            )
            # sys.exit(2)
            return 2
+        if health_checker.ret_value != 0:
+            logging.error("Health check failed for the applications, Please check; exiting")
+            return health_checker.ret_value

        logging.info(
            "Successfully finished running Kraken. UUID for the run: "
@@ -643,4 +658,4 @@ if __name__ == "__main__":
        with open(junit_testcase_file_path, "w") as stream:
            stream.write(junit_testcase_xml)

-    sys.exit(retval)
+    sys.exit(retval)
--- a/scenarios/openshift/ibmcloud_node_scenarios.yml
+++ b/scenarios/openshift/ibmcloud_node_scenarios.yml
@@ -1,10 +1,16 @@
-# yaml-language-server: $schema=../plugin.schema.json
- id: <ibmcloud-node-terminate/ibmcloud-node-reboot/ibmcloud-node-stop/ibmcloud-node-start>
-  config:
-    name: ""        
-    label_selector: "node-role.kubernetes.io/worker"    # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection 
-    runs: 1                             # Number of times to inject each scenario under actions (will perform on same node each time)                                                           
-    instance_count: 1                   # Number of nodes to perform action/select that match the label selector                                             
-    timeout: 360                         # Duration to wait for completion of node scenario injection
-    duration: 120                       # Duration to stop the node before running the start action 
-    skip_openshift_checks: False        # Set to True if you don't want to wait for the status of the nodes to change on OpenShift before passing the scenario 
+node_scenarios:
+  - actions:
+    - node_stop_start_scenario
+    node_name:
+    label_selector: node-role.kubernetes.io/worker
+    instance_count: 1
+    timeout: 360
+    duration: 120
+    cloud_type: ibm
+  - actions:
+    - node_reboot_scenario
+    node_name:
+    label_selector: node-role.kubernetes.io/worker
+    instance_count: 1
+    timeout: 120
+    cloud_type: ibm
Author	SHA1	Message	Date
Tullio Sebastiani	cecaa1eda3	removed deprecated ES fields + removed host validator (#774 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m6s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped DCO Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>	2025-03-19 13:10:44 -04:00
Paige Patton	5450ecb914	adding scenario type (#758 ) Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-03-19 17:38:45 +01:00
Paige Patton	cad6b68f43	adding collecting metrics (#752 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 1m28s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-03-19 17:08:44 +01:00
Paige Patton	0eba329305	moving ibm node to non native Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-03-19 15:02:12 +00:00
Tullio Sebastiani	ce8593f2f0	random network policy name to allow parallel scenario run on the same cluster fix name	2025-03-19 14:28:35 +00:00
Paige Patton	9061ddbb5b	adding cluster events into file Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m30s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-03-18 15:28:45 +00:00
kattameghana	dd4d0d0389	Health checks implementation for application endpoints (#761 ) * Hog scenario porting from arcaflow to native (#748) * added new native hog scenario * removed arcaflow dependency + legacy hog scenarios * config update * changed hog configuration structure + added average samples * fix on cpu count * removes tripledes warning * changed selector format * changed selector syntax * number of nodes option * documentation * functional tests * exception handling on hog deployment thread Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Hog scenario porting from arcaflow to native (#748) * added new native hog scenario * removed arcaflow dependency + legacy hog scenarios * config update * changed hog configuration structure + added average samples * fix on cpu count * removes tripledes warning * changed selector format * changed selector syntax * number of nodes option * documentation * functional tests * exception handling on hog deployment thread Signed-off-by: Paige Patton <prubenda@redhat.com> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * adding vsphere updates to non native Signed-off-by: Paige Patton <prubenda@redhat.com> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * adding node id to affected node Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Fixed the spelling mistake Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * adding v4.0.8 version (#756) Signed-off-by: Paige Patton <prubenda@redhat.com> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Add autodetecting distribution (#753) Used is_openshift function from krkn lib Remove distribution from config Remove distribution from documentation Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes include health check doc and exit_on_failure config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes include health check doc and exit_on_failure config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Added the health check config in functional test config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Modified the health checks documentation Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for debugging the functional test failing Signed-off-by: kattameghana <meghanakatta8@gmail.com> * changed the code for debugging in run_test.sh Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removed the functional test running line Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removing the health check config in common_test_config for debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Fixing functional test fialure Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removing the changes that are added for debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * few modifications Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Renamed timestamp Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changed the start timestamp and end timestamp data type to the datetime Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes include health check doc and exit_on_failure config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Hog scenario porting from arcaflow to native (#748) * added new native hog scenario * removed arcaflow dependency + legacy hog scenarios * config update * changed hog configuration structure + added average samples * fix on cpu count * removes tripledes warning * changed selector format * changed selector syntax * number of nodes option * documentation * functional tests * exception handling on hog deployment thread Signed-off-by: Paige Patton <prubenda@redhat.com> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * adding node id to affected node Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes include health check doc and exit_on_failure config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Added the health check config in functional test config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Modified the health checks documentation Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for debugging the functional test failing Signed-off-by: kattameghana <meghanakatta8@gmail.com> * changed the code for debugging in run_test.sh Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removed the functional test running line Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removing the health check config in common_test_config for debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Fixing functional test fialure Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removing the changes that are added for debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * few modifications Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Renamed timestamp Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Hog scenario porting from arcaflow to native (#748) * added new native hog scenario * removed arcaflow dependency + legacy hog scenarios * config update * changed hog configuration structure + added average samples * fix on cpu count * removes tripledes warning * changed selector format * changed selector syntax * number of nodes option * documentation * functional tests * exception handling on hog deployment thread Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Hog scenario porting from arcaflow to native (#748) * added new native hog scenario * removed arcaflow dependency + legacy hog scenarios * config update * changed hog configuration structure + added average samples * fix on cpu count * removes tripledes warning * changed selector format * changed selector syntax * number of nodes option * documentation * functional tests * exception handling on hog deployment thread Signed-off-by: Paige Patton <prubenda@redhat.com> Signed-off-by: kattameghana <meghanakatta8@gmail.com> * adding node id to affected node Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes include health check doc and exit_on_failure config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * initial version of health checks Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for appending success response and health check config format Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Update config.yaml Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Added the health check config in functional test config Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changes for debugging the functional test failing Signed-off-by: kattameghana <meghanakatta8@gmail.com> * changed the code for debugging in run_test.sh Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removed the functional test running line Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removing the health check config in common_test_config for debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Fixing functional test fialure Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Removing the changes that are added for debugging Signed-off-by: kattameghana <meghanakatta8@gmail.com> * few modifications Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Renamed timestamp Signed-off-by: kattameghana <meghanakatta8@gmail.com> * passing the health check response as HealthCheck object Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Updated the krkn-lib version in requirements.txt Signed-off-by: kattameghana <meghanakatta8@gmail.com> * Changed the coverage Signed-off-by: kattameghana <meghanakatta8@gmail.com> --------- Signed-off-by: kattameghana <meghanakatta8@gmail.com> Signed-off-by: Paige Patton <prubenda@redhat.com> Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb> Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com> Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com> Co-authored-by: Paige Patton <prubenda@redhat.com> Co-authored-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb> Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com> Co-authored-by: jtydlack <139967002+jtydlack@users.noreply.github.com>	2025-03-18 12:08:30 +00:00
dependabot[bot]	0cabe5e91d	Bump jinja2 from 3.1.5 to 3.1.6 (#768 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m45s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>	2025-03-06 22:25:05 -05:00
Naga Ravi Chaitanya Elluri	32fe0223ff	Add recommendations around Pod Disruption Budgets Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m14s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped This commit adds recommendation to test and ensure Pod Disruption Budgets are set for critical applications to avoid downtime. Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>	2025-03-06 07:56:02 -05:00
jtydlack	a25736ad08	Add autodetecting distribution (#753 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m12s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Used is_openshift function from krkn lib Remove distribution from config Remove distribution from documentation Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>	2025-02-13 15:45:08 -05:00
Paige Patton	440890d252	adding v4.0.8 version (#756 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 3m50s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-02-05 13:46:58 -05:00
Meghana Katta	69bf20fc76	Fixed the spelling mistake Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>	2025-02-05 12:53:30 -05:00
Paige Patton	2a42a2dc31	adding node id to affected node Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m9s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped	2025-02-03 19:30:52 -05:00