Add exclude_label functionality to pod disruption scenarios (#910 )

* kill pod exclude label Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> * config alignment Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> --------- Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
adding virt checks to metric info (#918 )
2026-03-11 22:22:26 +00:00 · 2025-10-08 22:10:27 +02:00 · 2025-10-08 15:43:48 -04:00 · 2025-10-08 15:38:20 -04:00 · 2025-10-08 16:01:41 +02:00 · 2025-10-06 16:03:13 -04:00
27 changed files with 470 additions and 728 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -0,0 +1 @@
+*   @paigerube14 @tsebastiani @chaitanyaenr
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -16,6 +16,7 @@ jobs:
          PREVIOUS_TAG=$(git tag --sort=-creatordate | sed -n '2 p')
          echo $PREVIOUS_TAG 
          echo "PREVIOUS_TAG=$PREVIOUS_TAG" >> "$GITHUB_ENV"
+
      - name: generate release notes from template
        id: release-notes
        env:
@@ -45,3 +46,15 @@ jobs:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh release create ${{ github.ref_name }} --title "${{ github.ref_name }}"  -F release-notes.md
+
+      - name: Install Syft
+        run: |
+          curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sudo sh -s -- -b /usr/local/bin
+
+      - name: Generate SBOM
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          syft . --scope all-layers --output cyclonedx-json > sbom.json
+          echo "SBOM generated successfully!"
+          gh release upload ${{ github.ref_name }} sbom.json
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -16,14 +16,19 @@ jobs:
        uses: redhat-chaos/actions/kind@main
      - name: Deploy prometheus & Port Forwarding
        uses: redhat-chaos/actions/prometheus@main
-      
      - name: Deploy Elasticsearch
        with:
-          ELASTIC_URL: ${{ vars.ELASTIC_URL }}
-          ELASTIC_PORT: ${{ vars.ELASTIC_PORT }}
-          ELASTIC_USER: ${{ vars.ELASTIC_USER }}
-          ELASTIC_PASSWORD: ${{ vars.ELASTIC_PASSWORD }}
+          ELASTIC_PORT: ${{ env.ELASTIC_PORT }}
+          RUN_ID: ${{ github.run_id }}
        uses: redhat-chaos/actions/elastic@main
+      - name: Download elastic password
+        uses: actions/download-artifact@v4
+        with:
+          name: elastic_password_${{ github.run_id }}
+      - name: Set elastic password on env
+        run: |
+          ELASTIC_PASSWORD=$(cat elastic_password.txt)
+          echo "ELASTIC_PASSWORD=$ELASTIC_PASSWORD" >> "$GITHUB_ENV"
      - name: Install Python
        uses: actions/setup-python@v4
        with:
@@ -73,6 +78,7 @@ jobs:
            echo "test_app_outages" >> ./CI/tests/functional_tests
            echo "test_container"      >> ./CI/tests/functional_tests
            echo "test_pod" >> ./CI/tests/functional_tests
+            echo "test_customapp_pod" >> ./CI/tests/functional_tests
            echo "test_namespace"      >> ./CI/tests/functional_tests
            echo "test_net_chaos"      >> ./CI/tests/functional_tests
            echo "test_time"           >> ./CI/tests/functional_tests
@@ -108,6 +114,7 @@ jobs:
          echo "test_app_outages" >> ./CI/tests/functional_tests
          echo "test_container"      >> ./CI/tests/functional_tests
          echo "test_pod" >> ./CI/tests/functional_tests
+          echo "test_customapp_pod" >> ./CI/tests/functional_tests
          echo "test_namespace"      >> ./CI/tests/functional_tests
          echo "test_net_chaos"      >> ./CI/tests/functional_tests
          echo "test_time"           >> ./CI/tests/functional_tests
--- a/CI/tests/test_customapp_pod.sh
+++ b/CI/tests/test_customapp_pod.sh
@@ -0,0 +1,18 @@
+set -xeEo pipefail
+
+source CI/tests/common.sh
+
+trap error ERR
+trap finish EXIT
+
+function functional_test_customapp_pod_node_selector {
+  export scenario_type="pod_disruption_scenarios"
+  export scenario_file="scenarios/openshift/customapp_pod.yaml"
+  export post_config=""
+  envsubst < CI/config/common_test_config.yaml > CI/config/customapp_pod_config.yaml
+
+  python3 -m coverage run -a run_kraken.py -c CI/config/customapp_pod_config.yaml
+  echo "Pod disruption with node_label_selector test: Success"
+}
+
+functional_test_customapp_pod_node_selector
--- a/CI/tests/test_pod_network_filter.sh
+++ b/CI/tests/test_pod_network_filter.sh
@@ -11,6 +11,7 @@ function functional_pod_network_filter {
  yq -i '.[0].target="pod-network-filter-test"' scenarios/kube/pod-network-filter.yml
  yq -i '.[0].protocols=["tcp"]' scenarios/kube/pod-network-filter.yml
  yq -i '.[0].ports=[443]' scenarios/kube/pod-network-filter.yml
+  yq -i '.performance_monitoring.check_critical_alerts=False' CI/config/pod_network_filter.yaml

  ## Test webservice deployment
  kubectl apply -f ./CI/templates/pod_network_filter.yaml
--- a/README.md
+++ b/README.md
@@ -22,14 +22,8 @@ Kraken injects deliberate failures into Kubernetes clusters to check if it is re
 Instructions on how to setup, configure and run Kraken can be found in the [documentation](https://krkn-chaos.dev/docs/).


-### Blogs and other useful resources
- Blog post on introduction to Kraken: https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes
- Discussion and demo on how Kraken can be leveraged to ensure OpenShift is reliable, performant and scalable: https://www.youtube.com/watch?v=s1PvupI5sD0&ab_channel=OpenShift
- Blog post emphasizing the importance of making Chaos part of Performance and Scale runs to mimic the production environments: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests
- Blog post on findings from Chaos test runs: https://cloud.redhat.com/blog/openshift/kubernetes-chaos-stories
- Discussion with CNCF TAG App Delivery on Krkn workflow, features and addition to CNCF sandbox: [Github](https://github.com/cncf/sandbox/issues/44), [Tracker](https://github.com/cncf/tag-app-delivery/issues/465), [recording](https://www.youtube.com/watch?v=nXQkBFK_MWc&t=722s)
- Blog post on supercharging chaos testing using AI integration in Krkn: https://www.redhat.com/en/blog/supercharging-chaos-testing-using-ai
- Blog post announcing Krkn joining CNCF Sandbox: https://www.redhat.com/en/blog/krknchaos-joining-cncf-sandbox
+### Blogs, podcasts and interviews
+Additional resources, including blog posts, podcasts, and community interviews, can be found on the [website](https://krkn-chaos.dev/blog)


 ### Roadmap
--- a/containers/Dockerfile.template
+++ b/containers/Dockerfile.template
@@ -28,7 +28,7 @@ ENV KUBECONFIG /home/krkn/.kube/config

 # This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo
 RUN dnf update && dnf install -y --setopt=install_weak_deps=False \
-    git python39 jq yq gettext wget which ipmitool &&\
+    git python39 jq yq gettext wget which ipmitool openssh-server &&\
    dnf clean all

 # Virtctl 
--- a/containers/krknctl-input.json
+++ b/containers/krknctl-input.json
@@ -444,7 +444,7 @@
    "required": "false"
  },
  {
-    "name": "kubevirt-namespace",
+    "name": "kubevirt-name",
    "short_description": "KubeVirt regex names to watch",
    "description": "KubeVirt regex names to check VMs",
    "variable": "KUBE_VIRT_NAME",
--- a/krkn/prometheus/client.py
+++ b/krkn/prometheus/client.py
@@ -75,10 +75,12 @@ def alerts(
 def critical_alerts(
    prom_cli: KrknPrometheus,
    summary: ChaosRunAlertSummary,
+    elastic: KrknElastic,
    run_id,
    scenario,
    start_time,
    end_time,
+    elastic_alerts_index
 ):
    summary.scenario = scenario
    summary.run_id = run_id
@@ -113,7 +115,6 @@ def critical_alerts(
            summary.chaos_alerts.append(alert)

    post_critical_alerts = prom_cli.process_query(query)
-
    for alert in post_critical_alerts:
        if "metric" in alert:
            alertname = (
@@ -136,6 +137,21 @@ def critical_alerts(
            )
            alert = ChaosRunAlert(alertname, alertstate, namespace, severity)
            summary.post_chaos_alerts.append(alert)
+            if elastic:
+                elastic_alert = ElasticAlert(
+                    run_uuid=run_id,
+                    severity=severity,
+                    alert=alertname,
+                    created_at=end_time,
+                    namespace=namespace,
+                    alertstate=alertstate,
+                    phase="post_chaos"
+                )
+                result = elastic.push_alert(elastic_alert, elastic_alerts_index)
+                if result == -1:
+                    logging.error("failed to save alert on ElasticSearch")
+                pass
+

    during_critical_alerts_count = len(during_critical_alerts)
    post_critical_alerts_count = len(post_critical_alerts)
@@ -149,8 +165,8 @@ def critical_alerts(

    if not firing_alerts:
        logging.info("No critical alerts are firing!!")
-
-
+    
+   
 def metrics(
    prom_cli: KrknPrometheus,
    elastic: KrknElastic,
@@ -252,6 +268,14 @@ def metrics(
                        metric[k] = v
                        metric['timestamp'] = str(datetime.datetime.now())
                    metrics_list.append(metric.copy())
+        if telemetry_json['virt_checks']:
+            for virt_check in telemetry_json["virt_checks"]:
+                    metric_name = "virt_check_recovery"
+                    metric = {"metricName": metric_name}
+                    for k,v in virt_check.items():
+                        metric[k] = v
+                        metric['timestamp'] = str(datetime.datetime.now())
+                    metrics_list.append(metric.copy())

        save_metrics = False
        if elastic is not None and elastic_metrics_index is not None:
--- a/krkn/scenario_plugins/container/container_scenario_plugin.py
+++ b/krkn/scenario_plugins/container/container_scenario_plugin.py
@@ -1,10 +1,10 @@
 import logging
 import random
 import time
-
+from asyncio import Future
 import yaml
 from krkn_lib.k8s import KrknKubernetes
-from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
+from krkn_lib.k8s.pod_monitor import select_and_monitor_by_namespace_pattern_and_label
 from krkn_lib.models.telemetry import ScenarioTelemetry
 from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
 from krkn_lib.utils import get_yaml_item_value
@@ -22,30 +22,26 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
        lib_telemetry: KrknTelemetryOpenshift,
        scenario_telemetry: ScenarioTelemetry,
    ) -> int:
-        pool = PodsMonitorPool(lib_telemetry.get_lib_kubernetes())
        try:
            with open(scenario, "r") as f:
                cont_scenario_config = yaml.full_load(f)
                
                for kill_scenario in cont_scenario_config["scenarios"]:
-                    self.start_monitoring(
-                        kill_scenario, pool
+                    future_snapshot = self.start_monitoring(
+                        kill_scenario,
+                        lib_telemetry
                    )
-                    killed_containers = self.container_killing_in_pod(
+                    self.container_killing_in_pod(
                        kill_scenario, lib_telemetry.get_lib_kubernetes()
                    )
-                    result = pool.join()
-                if result.error:
-                    logging.error(
-                        logging.error(
-                            f"ContainerScenarioPlugin pods failed to recovery: {result.error}"
-                        )
-                    )
-                    return 1
-                scenario_telemetry.affected_pods = result
-
-        except (RuntimeError, Exception):
-            logging.error("ContainerScenarioPlugin exiting due to Exception %s")
+                    snapshot = future_snapshot.result()
+                    result = snapshot.get_pods_status()
+                    scenario_telemetry.affected_pods = result
+                    if len(result.unrecovered) > 0:
+                        logging.info("ContainerScenarioPlugin failed with unrecovered containers")
+                        return 1
+        except (RuntimeError, Exception) as e:
+            logging.error("ContainerScenarioPlugin exiting due to Exception %s" % e)
            return 1
        else:
            return 0
@@ -53,17 +49,18 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
    def get_scenario_types(self) -> list[str]:
        return ["container_scenarios"]

-    def start_monitoring(self, kill_scenario: dict, pool: PodsMonitorPool):
+    def start_monitoring(self, kill_scenario: dict, lib_telemetry: KrknTelemetryOpenshift) -> Future:
        
        namespace_pattern = f"^{kill_scenario['namespace']}$"
        label_selector = kill_scenario["label_selector"]
        recovery_time = kill_scenario["expected_recovery_time"]
-        pool.select_and_monitor_by_namespace_pattern_and_label(
+        future_snapshot = select_and_monitor_by_namespace_pattern_and_label(
            namespace_pattern=namespace_pattern,
            label_selector=label_selector,
            max_timeout=recovery_time,
-            field_selector="status.phase=Running"
+            v1_client=lib_telemetry.get_lib_kubernetes().cli
        )
+        return future_snapshot

    def container_killing_in_pod(self, cont_scenario, kubecli: KrknKubernetes):
        scenario_name = get_yaml_item_value(cont_scenario, "name", "")
--- a/krkn/scenario_plugins/kubevirt_vm_outage/kubevirt_vm_outage_scenario_plugin.py
+++ b/krkn/scenario_plugins/kubevirt_vm_outage/kubevirt_vm_outage_scenario_plugin.py
@@ -149,44 +149,48 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
            disable_auto_restart = params.get("disable_auto_restart", False)
            
            if not vm_name:
-                raise Exception("vm_name parameter is required")
+                logging.error("vm_name parameter is required")
+                return 1
+            self.pods_status = PodsStatus()
            vmis_list = self.get_vmis(vm_name,namespace)
-            if len(vmis_list) == 0:
-                raise Exception(f"No matching VMs with name {vm_name} in namespace {namespace}")
-            rand_int = random.randint(0, len(vmis_list) - 1)
-            vmi = vmis_list[rand_int]
+            for _ in range(kill_count):
                
-            logging.info(f"Starting KubeVirt VM outage scenario for VM: {vm_name} in namespace: {namespace}")
-            vmi_name = vmi.get("metadata").get("name")
-            if not self.validate_environment(vmi_name, namespace):
-                return self.pods_status
-                
-            vmi = self.get_vmi(vmi_name, namespace)
-            self.affected_pod = AffectedPod(
-                pod_name=vmi_name,
-                namespace=namespace,
-            )
-            if not vmi:
-                logging.error(f"VMI {vm_name} not found in namespace {namespace}")
-                return self.pods_status
-                
-            self.original_vmi = vmi
-            logging.info(f"Captured initial state of VMI: {vm_name}")
-            result = self.delete_vmi(vmi_name, namespace, disable_auto_restart)
-            if result != 0:
-                return self.pods_status
+                rand_int = random.randint(0, len(vmis_list) - 1)
+                vmi = vmis_list[rand_int]
+                    
+                logging.info(f"Starting KubeVirt VM outage scenario for VM: {vm_name} in namespace: {namespace}")
+                vmi_name = vmi.get("metadata").get("name")
+                if not self.validate_environment(vmi_name, namespace):
+                    return 1
+                    
+                vmi = self.get_vmi(vmi_name, namespace)
+                self.affected_pod = AffectedPod(
+                    pod_name=vmi_name,
+                    namespace=namespace,
+                )
+                if not vmi:
+                    logging.error(f"VMI {vm_name} not found in namespace {namespace}")
+                    return 1
+                    
+                self.original_vmi = vmi
+                logging.info(f"Captured initial state of VMI: {vm_name}")
+                result = self.delete_vmi(vmi_name, namespace, disable_auto_restart)
+                if result != 0:
+                    self.pods_status.unrecovered.append(self.affected_pod)
+                    continue

-            result = self.wait_for_running(vmi_name,namespace, timeout)
-            if result != 0:
-                return self.pods_status
-            
-            self.affected_pod.total_recovery_time = (
-                self.affected_pod.pod_readiness_time
-                + self.affected_pod.pod_rescheduling_time
-            )
+                result = self.wait_for_running(vmi_name,namespace, timeout)
+                if result != 0:
+                    self.pods_status.unrecovered.append(self.affected_pod)
+                    continue
+                
+                self.affected_pod.total_recovery_time = (
+                    self.affected_pod.pod_readiness_time
+                    + self.affected_pod.pod_rescheduling_time
+                )

-            self.pods_status.recovered.append(self.affected_pod)
-            logging.info(f"Successfully completed KubeVirt VM outage scenario for VM: {vm_name}")
+                self.pods_status.recovered.append(self.affected_pod)
+                logging.info(f"Successfully completed KubeVirt VM outage scenario for VM: {vm_name}")
            
            return self.pods_status
            
@@ -316,13 +320,13 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
                time.sleep(1)
                
            logging.error(f"Timed out waiting for VMI {vm_name} to be deleted")
-            self.pods_status.unrecovered = self.affected_pod
+            self.pods_status.unrecovered.append(self.affected_pod)
            return 1
            
        except Exception as e:
            logging.error(f"Error deleting VMI {vm_name}: {e}")
            log_exception(e)
-            self.pods_status.unrecovered = self.affected_pod
+            self.pods_status.unrecovered.append(self.affected_pod)
            return 1

    def wait_for_running(self, vm_name: str, namespace: str, timeout: int = 120) -> int: 
--- a/krkn/scenario_plugins/native/pod_network_outage/kubernetes_functions.py
+++ b/krkn/scenario_plugins/native/pod_network_outage/kubernetes_functions.py
@@ -23,8 +23,7 @@ def create_job(batch_cli, body, namespace="default"):
    """

    try:
-        api_response = batch_cli.create_namespaced_job(
-            body=body, namespace=namespace)
+        api_response = batch_cli.create_namespaced_job(body=body, namespace=namespace)
        return api_response
    except ApiException as api:
        logging.warning(
@@ -71,7 +70,8 @@ def create_pod(cli, body, namespace, timeout=120):
        end_time = time.time() + timeout
        while True:
            pod_stat = cli.read_namespaced_pod(
-                name=body["metadata"]["name"], namespace=namespace)
+                name=body["metadata"]["name"], namespace=namespace
+            )
            if pod_stat.status.phase == "Running":
                break
            if time.time() > end_time:
@@ -121,16 +121,18 @@ def exec_cmd_in_pod(cli, command, pod_name, namespace, container=None):
    return ret


-def list_pods(cli, namespace, label_selector=None):
+def list_pods(cli, namespace, label_selector=None, exclude_label=None):
    """
-    Function used to list pods in a given namespace and having a certain label
+    Function used to list pods in a given namespace and having a certain label and excluding pods with exclude_label
+    and excluding pods with exclude_label
    """

    pods = []
    try:
        if label_selector:
            ret = cli.list_namespaced_pod(
-                namespace, pretty=True, label_selector=label_selector)
+                namespace, pretty=True, label_selector=label_selector
+            )
        else:
            ret = cli.list_namespaced_pod(namespace, pretty=True)
    except ApiException as e:
@@ -140,7 +142,16 @@ def list_pods(cli, namespace, label_selector=None):
            % e
        )
        raise e
+
    for pod in ret.items:
+        # Skip pods with the exclude label if specified
+        if exclude_label and pod.metadata.labels:
+            exclude_key, exclude_value = exclude_label.split("=", 1)
+            if (
+                exclude_key in pod.metadata.labels
+                and pod.metadata.labels[exclude_key] == exclude_value
+            ):
+                continue
        pods.append(pod.metadata.name)

    return pods
@@ -152,8 +163,7 @@ def get_job_status(batch_cli, name, namespace="default"):
    """

    try:
-        return batch_cli.read_namespaced_job_status(
-            name=name, namespace=namespace)
+        return batch_cli.read_namespaced_job_status(name=name, namespace=namespace)
    except Exception as e:
        logging.error(
            "Exception when calling \
@@ -169,7 +179,10 @@ def get_pod_log(cli, name, namespace="default"):
    """

    return cli.read_namespaced_pod_log(
-        name=name, namespace=namespace, _return_http_data_only=True, _preload_content=False
+        name=name,
+        namespace=namespace,
+        _return_http_data_only=True,
+        _preload_content=False,
    )


@@ -191,7 +204,8 @@ def delete_job(batch_cli, name, namespace="default"):
            name=name,
            namespace=namespace,
            body=client.V1DeleteOptions(
-                propagation_policy="Foreground", grace_period_seconds=0),
+                propagation_policy="Foreground", grace_period_seconds=0
+            ),
        )
        logging.debug("Job deleted. status='%s'" % str(api_response.status))
        return api_response
@@ -247,11 +261,8 @@ def get_node(node_name, label_selector, instance_kill_count, cli):
        )
    nodes = list_ready_nodes(cli, label_selector)
    if not nodes:
-        raise Exception(
-            "Ready nodes with the provided label selector do not exist")
-    logging.info(
-        "Ready nodes with the label selector %s: %s" % (label_selector, nodes)
-    )
+        raise Exception("Ready nodes with the provided label selector do not exist")
+    logging.info("Ready nodes with the label selector %s: %s" % (label_selector, nodes))
    number_of_nodes = len(nodes)
    if instance_kill_count == number_of_nodes:
        return nodes
--- a/krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
+++ b/krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
@@ -19,7 +19,11 @@ from . import cerberus


 def get_test_pods(
-    pod_name: str, pod_label: str, namespace: str, kubecli: KrknKubernetes
+    pod_name: str,
+    pod_label: str,
+    namespace: str,
+    kubecli: KrknKubernetes,
+    exclude_label: str = None,
 ) -> typing.List[str]:
    """
    Function that returns a list of pods to apply network policy
@@ -38,11 +42,16 @@ def get_test_pods(
        kubecli (KrknKubernetes)
            - Object to interact with Kubernetes Python client

+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        pod names (string) in the namespace
    """
    pods_list = []
-    pods_list = kubecli.list_pods(label_selector=pod_label, namespace=namespace)
+    pods_list = kubecli.list_pods(
+        label_selector=pod_label, namespace=namespace, exclude_label=exclude_label
+    )
    if pod_name and pod_name not in pods_list:
        raise Exception("pod name not found in namespace ")
    elif pod_name and pod_name in pods_list:
@@ -226,6 +235,10 @@ def apply_outage_policy(

        image (string)
            - Image of network chaos tool
+            
+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        The name of the job created that executes the commands on a node
        for ingress chaos scenario
@@ -324,6 +337,9 @@ def apply_ingress_policy(
        test_execution (String)
            - The order in which the filters are applied

+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        The name of the job created that executes the traffic shaping
        filter
@@ -407,6 +423,9 @@ def apply_net_policy(
        test_execution (String)
            - The order in which the filters are applied

+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        The name of the job created that executes the traffic shaping
        filter
@@ -466,6 +485,9 @@ def get_ingress_cmd(
        duration (str):
            - Duration for which the traffic control is to be done

+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        str: ingress filter
    """
@@ -517,6 +539,9 @@ def get_egress_cmd(
        duration (str):
            - Duration for which the traffic control is to be done

+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        str: egress filter
    """
@@ -652,6 +677,10 @@ def list_bridges(node: str, pod_template, kubecli: KrknKubernetes, image: str) -

        image (string)
            - Image of network chaos tool
+            
+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        List of bridges on the node.
    """
@@ -829,6 +858,9 @@ def check_bridge_interface(
        kubecli (KrknKubernetes)
            - Object to interact with Kubernetes Python client

+        exclude_label (string)
+            - pods matching this label will be excluded from the outage
+
    Returns:
        Returns True if the bridge is found in the  node.
    """
@@ -922,6 +954,15 @@ class InputParams:
        },
    )

+    exclude_label: typing.Optional[str] = field(
+        default=None,
+        metadata={
+            "name": "Exclude label",
+            "description": "Kubernetes label selector for pods to exclude from the chaos. "
+            "Pods matching this label will be excluded even if they match the label_selector",
+        },
+    )
+
    kraken_config: typing.Dict[str, typing.Any] = field(
        default=None,
        metadata={
@@ -1055,7 +1096,11 @@ def pod_outage(

        br_name = get_bridge_name(api_ext, custom_obj)
        pods_list = get_test_pods(
-            test_pod_name, test_label_selector, test_namespace, kubecli
+            test_pod_name,
+            test_label_selector,
+            test_namespace,
+            kubecli,
+            params.exclude_label,
        )

        while not len(pods_list) <= params.instance_count:
@@ -1176,6 +1221,15 @@ class EgressParams:
        },
    )

+    exclude_label: typing.Optional[str] = field(
+        default=None,
+        metadata={
+            "name": "Exclude label",
+            "description": "Kubernetes label selector for pods to exclude from the chaos. "
+            "Pods matching this label will be excluded even if they match the label_selector",
+        },
+    )
+
    kraken_config: typing.Dict[str, typing.Any] = field(
        default=None,
        metadata={
@@ -1314,7 +1368,11 @@ def pod_egress_shaping(

        br_name = get_bridge_name(api_ext, custom_obj)
        pods_list = get_test_pods(
-            test_pod_name, test_label_selector, test_namespace, kubecli
+            test_pod_name,
+            test_label_selector,
+            test_namespace,
+            kubecli,
+            params.exclude_label,
        )

        while not len(pods_list) <= params.instance_count:
@@ -1450,6 +1508,15 @@ class IngressParams:
        },
    )

+    exclude_label: typing.Optional[str] = field(
+        default=None,
+        metadata={
+            "name": "Exclude label",
+            "description": "Kubernetes label selector for pods to exclude from the chaos. "
+            "Pods matching this label will be excluded even if they match the label_selector",
+        },
+    )
+
    kraken_config: typing.Dict[str, typing.Any] = field(
        default=None,
        metadata={
@@ -1589,7 +1656,11 @@ def pod_ingress_shaping(

        br_name = get_bridge_name(api_ext, custom_obj)
        pods_list = get_test_pods(
-            test_pod_name, test_label_selector, test_namespace, kubecli
+            test_pod_name,
+            test_label_selector,
+            test_namespace,
+            kubecli,
+            params.exclude_label,
        )

        while not len(pods_list) <= params.instance_count:
--- a/krkn/scenario_plugins/pod_disruption/models/models.py
+++ b/krkn/scenario_plugins/pod_disruption/models/models.py
@@ -11,6 +11,9 @@ class InputParams:
            self.label_selector = config["label_selector"] if "label_selector" in config else ""
            self.namespace_pattern = config["namespace_pattern"] if "namespace_pattern" in config else ""
            self.name_pattern = config["name_pattern"] if "name_pattern" in config else ""
+            self.node_label_selector = config["node_label_selector"] if "node_label_selector" in config else ""
+            self.node_names = config["node_names"] if "node_names" in config else []
+            self.exclude_label = config["exclude_label"] if "exclude_label" in config else ""

    namespace_pattern: str
    krkn_pod_recovery_time: int
@@ -18,4 +21,7 @@ class InputParams:
    duration: int
    kill: int
    label_selector: str
-    name_pattern: str
+    name_pattern: str
+    node_label_selector: str
+    node_names: list
+    exclude_label: str
--- a/krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py
+++ b/krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py
@@ -1,14 +1,16 @@
 import logging
 import random
 import time
+from asyncio import Future

 import yaml
 from krkn_lib.k8s import KrknKubernetes
-from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
+from krkn_lib.k8s.pod_monitor import select_and_monitor_by_namespace_pattern_and_label, \
+    select_and_monitor_by_name_pattern_and_namespace_pattern
+
 from krkn.scenario_plugins.pod_disruption.models.models import InputParams
 from krkn_lib.models.telemetry import ScenarioTelemetry
 from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
-from krkn_lib.utils import get_yaml_item_value
 from datetime import datetime
 from dataclasses import dataclass

@@ -29,31 +31,25 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
        lib_telemetry: KrknTelemetryOpenshift,
        scenario_telemetry: ScenarioTelemetry,
    ) -> int:
-        pool = PodsMonitorPool(lib_telemetry.get_lib_kubernetes())
        try:
            with open(scenario, "r") as f:
                cont_scenario_config = yaml.full_load(f)
                for kill_scenario in cont_scenario_config:
                    kill_scenario_config = InputParams(kill_scenario["config"])
-                    self.start_monitoring(
-                        kill_scenario_config, pool
+                    future_snapshot=self.start_monitoring(
+                        kill_scenario_config,
+                        lib_telemetry
                    )
-                    return_status = self.killing_pods(
+                    self.killing_pods(
                        kill_scenario_config, lib_telemetry.get_lib_kubernetes()
                    )
-                    if return_status != 0: 
-                        result = pool.cancel()
-                    else:
-                        result = pool.join()
-                if result.error:
-                    logging.error(
-                        logging.error(
-                            f"PodDisruptionScenariosPlugin pods failed to recovery: {result.error}"
-                        )
-                    )
-                    return 1
-                
-                scenario_telemetry.affected_pods = result
+
+                    snapshot = future_snapshot.result()
+                    result = snapshot.get_pods_status()
+                    scenario_telemetry.affected_pods = result
+                    if len(result.unrecovered) > 0:
+                        logging.info("PodDisruptionScenarioPlugin failed with unrecovered pods")
+                        return 1

        except (RuntimeError, Exception) as e:
            logging.error("PodDisruptionScenariosPlugin exiting due to Exception %s" % e)
@@ -64,7 +60,7 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
    def get_scenario_types(self) -> list[str]:
        return ["pod_disruption_scenarios"]

-    def start_monitoring(self, kill_scenario: InputParams, pool: PodsMonitorPool):
+    def start_monitoring(self, kill_scenario: InputParams, lib_telemetry: KrknTelemetryOpenshift) -> Future:

        recovery_time = kill_scenario.krkn_pod_recovery_time
        if (
@@ -73,16 +69,17 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
        ):
            namespace_pattern = kill_scenario.namespace_pattern
            label_selector = kill_scenario.label_selector
-            pool.select_and_monitor_by_namespace_pattern_and_label(
+            future_snapshot = select_and_monitor_by_namespace_pattern_and_label(
                namespace_pattern=namespace_pattern,
                label_selector=label_selector,
                max_timeout=recovery_time,
-                field_selector="status.phase=Running"
+                v1_client=lib_telemetry.get_lib_kubernetes().cli
            )
            logging.info(
                f"waiting up to {recovery_time} seconds for pod recovery, "
                f"pod label pattern: {label_selector} namespace pattern: {namespace_pattern}"
            )
+            return future_snapshot

        elif (
            kill_scenario.namespace_pattern
@@ -90,32 +87,101 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
        ):
            namespace_pattern = kill_scenario.namespace_pattern
            name_pattern = kill_scenario.name_pattern
-            pool.select_and_monitor_by_name_pattern_and_namespace_pattern(
+            future_snapshot = select_and_monitor_by_name_pattern_and_namespace_pattern(
                pod_name_pattern=name_pattern,
                namespace_pattern=namespace_pattern,
                max_timeout=recovery_time,
-                field_selector="status.phase=Running"
+                v1_client=lib_telemetry.get_lib_kubernetes().cli
            )
            logging.info(
                f"waiting up to {recovery_time} seconds for pod recovery, "
                f"pod name pattern: {name_pattern} namespace pattern: {namespace_pattern}"
            )
+            return future_snapshot
        else:
            raise Exception(
                f"impossible to determine monitor parameters, check {kill_scenario} configuration"
            )
+    
+    def _select_pods_with_field_selector(self, name_pattern, label_selector, namespace, kubecli: KrknKubernetes, field_selector: str, node_name: str = None):
+        """Helper function to select pods using either label_selector or name_pattern with field_selector, optionally filtered by node"""
+        # Combine field selectors if node targeting is specified
+        if node_name:
+            node_field_selector = f"spec.nodeName={node_name}"
+            if field_selector:
+                combined_field_selector = f"{field_selector},{node_field_selector}"
+            else:
+                combined_field_selector = node_field_selector
+        else:
+            combined_field_selector = field_selector
        
+        if label_selector:
+            return kubecli.select_pods_by_namespace_pattern_and_label(
+                label_selector=label_selector, 
+                namespace_pattern=namespace, 
+                field_selector=combined_field_selector
+            )
+        else:  # name_pattern
+            return kubecli.select_pods_by_name_pattern_and_namespace_pattern(
+                pod_name_pattern=name_pattern, 
+                namespace_pattern=namespace, 
+                field_selector=combined_field_selector
+            )

-    def get_pods(self, name_pattern, label_selector,namespace, kubecli: KrknKubernetes, field_selector: str =None): 
+    def get_pods(self, name_pattern, label_selector, namespace, kubecli: KrknKubernetes, field_selector: str = None, node_label_selector: str = None, node_names: list = None, quiet: bool = False): 
        if label_selector and name_pattern: 
            logging.error('Only, one of name pattern or label pattern can be specified')
-        elif label_selector:
-            pods = kubecli.select_pods_by_namespace_pattern_and_label(label_selector=label_selector,namespace_pattern=namespace, field_selector=field_selector)
-        elif name_pattern:
-            pods = kubecli.select_pods_by_name_pattern_and_namespace_pattern(pod_name_pattern=name_pattern, namespace_pattern=namespace, field_selector=field_selector)
-        else: 
+            return []
+        
+        if not label_selector and not name_pattern:
            logging.error('Name pattern or label pattern must be specified ')
-        return pods
+            return []
+        
+        # If specific node names are provided, make multiple calls with field selector
+        if node_names:
+            if not quiet:
+                logging.info(f"Targeting pods on {len(node_names)} specific nodes")
+            all_pods = []
+            for node_name in node_names:
+                pods = self._select_pods_with_field_selector(
+                    name_pattern, label_selector, namespace, kubecli, field_selector, node_name
+                )
+                
+                if pods:
+                    all_pods.extend(pods)
+            
+            if not quiet:
+                logging.info(f"Found {len(all_pods)} target pods across {len(node_names)} nodes")
+            return all_pods
+        
+        #  Node label selector approach - use field selectors
+        if node_label_selector:
+            # Get nodes matching the label selector first
+            nodes_with_label = kubecli.list_nodes(label_selector=node_label_selector)
+            if not nodes_with_label:
+                logging.info(f"No nodes found with label selector: {node_label_selector}")
+                return []
+            
+            if not quiet:
+                logging.info(f"Targeting pods on {len(nodes_with_label)} nodes with label: {node_label_selector}")
+            # Use field selector for each node
+            all_pods = []
+            for node_name in nodes_with_label:
+                pods = self._select_pods_with_field_selector(
+                    name_pattern, label_selector, namespace, kubecli, field_selector, node_name
+                )
+                
+                if pods:
+                    all_pods.extend(pods)
+            
+            if not quiet:
+                logging.info(f"Found {len(all_pods)} target pods across {len(nodes_with_label)} nodes")
+            return all_pods
+        
+        # Standard pod selection (no node targeting)
+        return self._select_pods_with_field_selector(
+            name_pattern, label_selector, namespace, kubecli, field_selector
+        )
    
    def killing_pods(self, config: InputParams, kubecli: KrknKubernetes):
        # region Select target pods
@@ -124,7 +190,14 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
        if not namespace: 
            logging.error('Namespace pattern must be specified')

-        pods = self.get_pods(config.name_pattern,config.label_selector,config.namespace_pattern, kubecli, field_selector="status.phase=Running")
+        pods = self.get_pods(config.name_pattern,config.label_selector,config.namespace_pattern, kubecli, field_selector="status.phase=Running", node_label_selector=config.node_label_selector, node_names=config.node_names)
+        exclude_pods = set()
+        if config.exclude_label:
+            _exclude_pods = self.get_pods("",config.exclude_label,config.namespace_pattern, kubecli, field_selector="status.phase=Running", node_label_selector=config.node_label_selector, node_names=config.node_names)
+            for pod in _exclude_pods:
+                exclude_pods.add(pod[0])
+
+
        pods_count = len(pods)
        if len(pods) < config.kill:
            logging.error("Not enough pods match the criteria, expected {} but found only {} pods".format(
@@ -133,23 +206,25 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
        
        random.shuffle(pods)
        for i in range(config.kill):
-
            pod = pods[i]
            logging.info(pod)
-            logging.info(f'Deleting pod {pod[0]}')
-            kubecli.delete_pod(pod[0], pod[1])
+            if pod[0] in exclude_pods:
+                logging.info(f"Excluding {pod[0]} from chaos")
+            else:
+                logging.info(f'Deleting pod {pod[0]}')
+                kubecli.delete_pod(pod[0], pod[1])
        
-        self.wait_for_pods(config.label_selector,config.name_pattern,config.namespace_pattern, pods_count, config.duration, config.timeout, kubecli)
+        self.wait_for_pods(config.label_selector,config.name_pattern,config.namespace_pattern, pods_count, config.duration, config.timeout, kubecli, config.node_label_selector, config.node_names)
        return 0

    def wait_for_pods(
-        self, label_selector, pod_name, namespace, pod_count, duration, wait_timeout, kubecli: KrknKubernetes
+        self, label_selector, pod_name, namespace, pod_count, duration, wait_timeout, kubecli: KrknKubernetes, node_label_selector, node_names
    ):
        timeout = False
        start_time = datetime.now()

        while not timeout:
-            pods = self.get_pods(name_pattern=pod_name, label_selector=label_selector,namespace=namespace, field_selector="status.phase=Running", kubecli=kubecli)
+            pods = self.get_pods(name_pattern=pod_name, label_selector=label_selector,namespace=namespace, field_selector="status.phase=Running", kubecli=kubecli, node_label_selector=node_label_selector, node_names=node_names, quiet=True)
            if pod_count == len(pods):
                return
               
--- a/krkn/utils/VirtChecker.py
+++ b/krkn/utils/VirtChecker.py
@@ -57,7 +57,7 @@ class VirtChecker:
        :param namespace:
        :return: virtctl_status 'True' if successful, or an error message if it fails.
        """
-        virtctl_vm_cmd = f"virtctl ssh --local-ssh-opts='-o BatchMode=yes' --local-ssh-opts='-o PasswordAuthentication=no' --local-ssh-opts='-o ConnectTimeout=2' root@{vm_name} -n {namespace}"
+        virtctl_vm_cmd = f"virtctl ssh --local-ssh-opts='-o BatchMode=yes' --local-ssh-opts='-o PasswordAuthentication=no' --local-ssh-opts='-o ConnectTimeout=2' root@vmi/{vm_name} -n {namespace} 2>&1 |egrep 'denied|verification failed'  && echo 'True' || echo 'False'"
        check_virtctl_vm_cmd = f"virtctl ssh --local-ssh-opts='-o BatchMode=yes' --local-ssh-opts='-o PasswordAuthentication=no' --local-ssh-opts='-o ConnectTimeout=2' root@{vm_name} -n {namespace} 2>&1 |egrep 'denied|verification failed'  && echo 'True' || echo 'False'"
        if 'True' in invoke_no_exit(check_virtctl_vm_cmd):
            return True
--- a/requirements.txt
+++ b/requirements.txt
@@ -16,7 +16,7 @@ google-cloud-compute==1.22.0
 ibm_cloud_sdk_core==3.18.0
 ibm_vpc==0.20.0
 jinja2==3.1.6
-krkn-lib==5.1.2
+krkn-lib==5.1.8
 lxml==5.1.0
 kubernetes==28.1.0
 numpy==1.26.4
--- a/run_kraken.py
+++ b/run_kraken.py
@@ -375,10 +375,12 @@ def main(options, command: Optional[str]) -> int:
                            prometheus_plugin.critical_alerts(
                                prometheus,
                                summary,
+                                elastic_search,
                                run_uuid,
                                scenario_type,
                                start_time,
                                datetime.datetime.now(),
+                                elastic_alerts_index
                            )

                            chaos_output.critical_alerts = summary
--- a/scenarios/openshift/customapp_pod.yaml
+++ b/scenarios/openshift/customapp_pod.yaml
@@ -1,6 +1,15 @@
 # yaml-language-server: $schema=../plugin.schema.json
 - id: kill-pods
  config:
-    namespace_pattern: ^acme-air$
+    namespace_pattern: "kube-system"
    name_pattern: .*
-    krkn_pod_recovery_time: 120
+    krkn_pod_recovery_time: 60
+    kill: 1     # num of pods to kill
+    #Not needed by default, but can be used if you want to target pods on specific nodes
+    # Option 1: Target pods on nodes with specific labels [master/worker nodes]
+    node_label_selector: node-role.kubernetes.io/control-plane=      # Target control-plane nodes (works on both k8s and openshift) 
+    # Option 2: Target pods of specific nodes (testing mixed node types)
+    # node_names: 
+    #   - ip-10-0-31-8.us-east-2.compute.internal      # Worker node 1
+    #   - ip-10-0-48-188.us-east-2.compute.internal    # Worker node 2  
+    #   - ip-10-0-14-59.us-east-2.compute.internal     # Master node 1
--- a/scenarios/openshift/etcd.yml
+++ b/scenarios/openshift/etcd.yml
@@ -4,3 +4,4 @@
    namespace_pattern: ^openshift-etcd$
    label_selector: k8s-app=etcd
    krkn_pod_recovery_time: 120
+    exclude_label: "" # excludes pods marked with this label from chaos
--- a/scenarios/openshift/openshift-apiserver.yml
+++ b/scenarios/openshift/openshift-apiserver.yml
@@ -4,4 +4,5 @@
    namespace_pattern: ^openshift-apiserver$
    label_selector: app=openshift-apiserver-a
    krkn_pod_recovery_time: 120
+    exclude_label: "" # excludes pods marked with this label from chaos

--- a/scenarios/openshift/openshift-kube-apiserver.yml
+++ b/scenarios/openshift/openshift-kube-apiserver.yml
@@ -4,4 +4,5 @@
    namespace_pattern: ^openshift-kube-apiserver$
    label_selector: app=openshift-kube-apiserver
    krkn_pod_recovery_time: 120
+    exclude_label: "" # excludes pods marked with this label from chaos

--- a/scenarios/openshift/prom_kill.yml
+++ b/scenarios/openshift/prom_kill.yml
@@ -2,4 +2,5 @@
  config:
    namespace_pattern: ^openshift-monitoring$
    label_selector: statefulset.kubernetes.io/pod-name=prometheus-k8s-0
-    krkn_pod_recovery_time: 120
+    krkn_pod_recovery_time: 120
+    exclude_label: "" # excludes pods marked with this label from chaos
--- a/scenarios/openshift/regex_openshift_pod_kill.yml
+++ b/scenarios/openshift/regex_openshift_pod_kill.yml
@@ -5,3 +5,4 @@
    name_pattern: .*
    kill: 3
    krkn_pod_recovery_time: 120
+    exclude_label: "" # excludes pods marked with this label from chaos
--- a/scenarios/plugin.schema.README.md
+++ b/scenarios/plugin.schema.README.md
@@ -1,5 +0,0 @@
-This file is generated by running the "plugins" module in the kraken project:
-
-```
-python -m kraken.plugins >scenarios/plugin.schema.json
-```
--- a/scenarios/plugin.schema.json
+++ b/scenarios/plugin.schema.json
@@ -1,584 +0,0 @@
-{
-	"$id": "https://github.com/redhat-chaos/krkn/",
-	"$schema": "https://json-schema.org/draft/2020-12/schema",
-	"title": "Kraken Arcaflow scenarios",
-	"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
-	"type": "array",
-	"minContains": 1,
-	"items": {
-		"oneOf": [
-			{
-				"type": "object",
-				"properties": {
-					"id": {
-						"type": "string",
-						"const": "run_python"
-					},
-					"config": {
-						"$defs": {
-							"RunPythonFileInput": {
-								"type": "object",
-								"properties": {
-									"filename": {
-										"type": "string"
-									}
-								},
-								"required": [
-									"filename"
-								],
-								"additionalProperties": false,
-								"dependentRequired": {}
-							}
-						},
-						"type": "object",
-						"properties": {
-							"filename": {
-								"type": "string"
-							}
-						},
-						"required": [
-							"filename"
-						],
-						"additionalProperties": false,
-						"dependentRequired": {}
-					}
-				},
-				"required": [
-					"id",
-					"config"
-				]
-			},
-			{
-				"type": "object",
-				"title": "pod_network_outage Arcaflow scenarios",
-				"properties": {
-					"id": {
-						"type": "string",
-						"const": "pod_network_outage"
-					},
-					"config": {
-						"$defs": {
-							"InputParams": {
-								"type": "object",
-								"properties": {
-									"namespace": {
-										"type": "string",
-										"minLength": 1,
-										"title": "Namespace",
-										"description": "Namespace of the pod to which filter need to be appliedfor details."
-									},
-									"image": {
-										"type": "string",
-										"minLength": 1,
-										"title": "Image",
-										"default": "image: quay.io/krkn-chaos/krkn:tools",
-										"description": "Image of the krkn tools to run network outage."
-									},
-									"direction": {
-										"type": "array",
-										"items": {
-											"type": "string"
-										},
-										"default": [
-											"ingress",
-											"egress"
-										],
-										"title": "Direction",
-										"description": "List of directions to apply filtersDefault both egress and ingress."
-									},
-									"ingress_ports": {
-										"type": "array",
-										"items": {
-											"type": "integer"
-										},
-										"default": [],
-										"title": "Ingress ports",
-										"description": "List of ports to block traffic onDefault [], i.e. all ports"
-									},
-									"egress_ports": {
-										"type": "array",
-										"items": {
-											"type": "integer"
-										},
-										"default": [],
-										"title": "Egress ports",
-										"description": "List of ports to block traffic onDefault [], i.e. all ports"
-									},
-									"kubeconfig_path": {
-										"type": "string",
-										"title": "Kubeconfig path",
-										"description": "Kubeconfig file as string\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
-									},
-									"pod_name": {
-										"type": "string",
-										"title": "Pod name",
-										"description": "When label_selector is not specified, pod matching the name will beselected for the chaos scenario"
-									},
-									"label_selector": {
-										"type": "string",
-										"title": "Label selector",
-										"description": "Kubernetes label selector for the target pod. When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
-									},
-									"kraken_config": {
-										"type": "string",
-										"title": "Kraken Config",
-										"description": "Path to the config file of Kraken. Set this field if you wish to publish status onto Cerberus"
-									},
-									"test_duration": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 120,
-										"title": "Test duration",
-										"description": "Duration for which each step of the ingress chaos testing is to be performed."
-									},
-									"wait_duration": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 300,
-										"title": "Wait Duration",
-										"description": "Wait duration for finishing a test and its cleanup.Ensure that it is significantly greater than wait_duration"
-									},
-									"instance_count": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 1,
-										"title": "Instance Count",
-										"description": "Number of pods to perform action/select that match the label selector."
-									}
-								},
-								"required": [
-									"namespace"
-								],
-								"additionalProperties": false,
-								"dependentRequired": {}
-							}
-						},
-						"type": "object",
-						"properties": {
-							"namespace": {
-								"type": "string",
-								"minLength": 1,
-								"title": "Namespace",
-								"description": "Namespace of the pod to which filter need to be appliedfor details."
-							},
-							"direction": {
-								"type": "array",
-								"items": {
-									"type": "string"
-								},
-								"default": [
-									"ingress",
-									"egress"
-								],
-								"title": "Direction",
-								"description": "List of directions to apply filtersDefault both egress and ingress."
-							},
-							"ingress_ports": {
-								"type": "array",
-								"items": {
-									"type": "integer"
-								},
-								"default": [],
-								"title": "Ingress ports",
-								"description": "List of ports to block traffic onDefault [], i.e. all ports"
-							},
-							"egress_ports": {
-								"type": "array",
-								"items": {
-									"type": "integer"
-								},
-								"default": [],
-								"title": "Egress ports",
-								"description": "List of ports to block traffic onDefault [], i.e. all ports"
-							},
-							"kubeconfig_path": {
-								"type": "string",
-								"title": "Kubeconfig path",
-								"description": "Kubeconfig file as string\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
-							},
-							"pod_name": {
-								"type": "string",
-								"title": "Pod name",
-								"description": "When label_selector is not specified, pod matching the name will beselected for the chaos scenario"
-							},
-							"label_selector": {
-								"type": "string",
-								"title": "Label selector",
-								"description": "Kubernetes label selector for the target pod. When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
-							},
-							"kraken_config": {
-								"type": "string",
-								"title": "Kraken Config",
-								"description": "Path to the config file of Kraken. Set this field if you wish to publish status onto Cerberus"
-							},
-							"test_duration": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 120,
-								"title": "Test duration",
-								"description": "Duration for which each step of the ingress chaos testing is to be performed."
-							},
-							"wait_duration": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 300,
-								"title": "Wait Duration",
-								"description": "Wait duration for finishing a test and its cleanup.Ensure that it is significantly greater than wait_duration"
-							},
-							"instance_count": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 1,
-								"title": "Instance Count",
-								"description": "Number of pods to perform action/select that match the label selector."
-							}
-						},
-						"required": [
-							"namespace"
-						],
-						"additionalProperties": false,
-						"dependentRequired": {}
-					}
-				},
-				"required": [
-					"id",
-					"config"
-				]
-			},
-			{
-				"type": "object",
-				"title": "pod_egress_shaping Arcaflow scenarios",
-				"properties": {
-					"id": {
-						"type": "string",
-						"const": "pod_egress_shaping"
-					},
-					"config": {
-						"$defs": {
-							"EgressParams": {
-								"type": "object",
-								"properties": {
-									"namespace": {
-										"type": "string",
-										"minLength": 1,
-										"title": "Namespace",
-										"description": "Namespace of the pod to which filter need to be appliedfor details."
-									},
-									"image": {
-										"type": "string",
-										"minLength": 1,
-										"title": "Image",
-										"default": "image: quay.io/krkn-chaos/krkn:tools",
-										"description": "Image of the krkn tools to run network outage."
-									},
-									"kubeconfig_path": {
-										"type": "string",
-										"title": "Kubeconfig path",
-										"description": "Kubeconfig file as string\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
-									},
-									"pod_name": {
-										"type": "string",
-										"title": "Pod name",
-										"description": "When label_selector is not specified, pod matching the name will beselected for the chaos scenario"
-									},
-									"label_selector": {
-										"type": "string",
-										"title": "Label selector",
-										"description": "Kubernetes label selector for the target pod. When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
-									},
-									"kraken_config": {
-										"type": "string",
-										"title": "Kraken Config",
-										"description": "Path to the config file of Kraken. Set this field if you wish to publish status onto Cerberus"
-									},
-									"test_duration": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 90,
-										"title": "Test duration",
-										"description": "Duration for which each step of the ingress chaos testing is to be performed."
-									},
-									"wait_duration": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 300,
-										"title": "Wait Duration",
-										"description": "Wait duration for finishing a test and its cleanup.Ensure that it is significantly greater than wait_duration"
-									},
-									"instance_count": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 1,
-										"title": "Instance Count",
-										"description": "Number of pods to perform action/select that match the label selector."
-									},
-									"execution_type": {
-										"type": "string",
-										"default": "parallel",
-										"title": "Execution Type",
-										"description": "The order in which the ingress filters are applied. Execution type can be 'serial' or 'parallel'"
-									},
-									"network_params": {
-										"type": "object",
-										"propertyNames": {},
-										"additionalProperties": {
-											"type": "string"
-										},
-										"title": "Network Parameters",
-										"description": "The network filters that are applied on the interface. The currently supported filters are latency, loss and bandwidth"
-									}
-								},
-								"required": [
-									"namespace"
-								],
-								"additionalProperties": false,
-								"dependentRequired": {}
-							}
-						},
-						"type": "object",
-						"properties": {
-							"namespace": {
-								"type": "string",
-								"minLength": 1,
-								"title": "Namespace",
-								"description": "Namespace of the pod to which filter need to be appliedfor details."
-							},
-							"kubeconfig_path": {
-								"type": "string",
-								"title": "Kubeconfig path",
-								"description": "Kubeconfig file as string\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
-							},
-							"pod_name": {
-								"type": "string",
-								"title": "Pod name",
-								"description": "When label_selector is not specified, pod matching the name will beselected for the chaos scenario"
-							},
-							"label_selector": {
-								"type": "string",
-								"title": "Label selector",
-								"description": "Kubernetes label selector for the target pod. When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
-							},
-							"kraken_config": {
-								"type": "string",
-								"title": "Kraken Config",
-								"description": "Path to the config file of Kraken. Set this field if you wish to publish status onto Cerberus"
-							},
-							"test_duration": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 90,
-								"title": "Test duration",
-								"description": "Duration for which each step of the ingress chaos testing is to be performed."
-							},
-							"wait_duration": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 300,
-								"title": "Wait Duration",
-								"description": "Wait duration for finishing a test and its cleanup.Ensure that it is significantly greater than wait_duration"
-							},
-							"instance_count": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 1,
-								"title": "Instance Count",
-								"description": "Number of pods to perform action/select that match the label selector."
-							},
-							"execution_type": {
-								"type": "string",
-								"default": "parallel",
-								"title": "Execution Type",
-								"description": "The order in which the ingress filters are applied. Execution type can be 'serial' or 'parallel'"
-							},
-							"network_params": {
-								"type": "object",
-								"propertyNames": {},
-								"additionalProperties": {
-									"type": "string"
-								},
-								"title": "Network Parameters",
-								"description": "The network filters that are applied on the interface. The currently supported filters are latency, loss and bandwidth"
-							}
-						},
-						"required": [
-							"namespace"
-						],
-						"additionalProperties": false,
-						"dependentRequired": {}
-					}
-				},
-				"required": [
-					"id",
-					"config"
-				]
-			},
-			{
-				"type": "object",
-				"title": "pod_ingress_shaping Arcaflow scenarios",
-				"properties": {
-					"id": {
-						"type": "string",
-						"const": "pod_ingress_shaping"
-					},
-					"config": {
-						"$defs": {
-							"IngressParams": {
-								"type": "object",
-								"properties": {
-									"namespace": {
-										"type": "string",
-										"minLength": 1,
-										"title": "Namespace",
-										"description": "Namespace of the pod to which filter need to be appliedfor details."
-									},
-									"image": {
-										"type": "string",
-										"minLength": 1,
-										"title": "Image",
-										"default": "image: quay.io/krkn-chaos/krkn:tools",
-										"description": "Image of the krkn tools to run network outage."
-									},
-									"network_params": {
-										"type": "object",
-										"propertyNames": {},
-										"additionalProperties": {
-											"type": "string"
-										},
-										"title": "Network Parameters",
-										"description": "The network filters that are applied on the interface. The currently supported filters are latency, loss and bandwidth"
-									},
-									"kubeconfig_path": {
-										"type": "string",
-										"title": "Kubeconfig path",
-										"description": "Kubeconfig file as string\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
-									},
-									"pod_name": {
-										"type": "string",
-										"title": "Pod name",
-										"description": "When label_selector is not specified, pod matching the name will beselected for the chaos scenario"
-									},
-									"label_selector": {
-										"type": "string",
-										"title": "Label selector",
-										"description": "Kubernetes label selector for the target pod. When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
-									},
-									"kraken_config": {
-										"type": "string",
-										"title": "Kraken Config",
-										"description": "Path to the config file of Kraken. Set this field if you wish to publish status onto Cerberus"
-									},
-									"test_duration": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 90,
-										"title": "Test duration",
-										"description": "Duration for which each step of the ingress chaos testing is to be performed."
-									},
-									"wait_duration": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 300,
-										"title": "Wait Duration",
-										"description": "Wait duration for finishing a test and its cleanup.Ensure that it is significantly greater than wait_duration"
-									},
-									"instance_count": {
-										"type": "integer",
-										"minimum": 1,
-										"default": 1,
-										"title": "Instance Count",
-										"description": "Number of pods to perform action/select that match the label selector."
-									},
-									"execution_type": {
-										"type": "string",
-										"default": "parallel",
-										"title": "Execution Type",
-										"description": "The order in which the ingress filters are applied. Execution type can be 'serial' or 'parallel'"
-									}
-								},
-								"required": [
-									"namespace"
-								],
-								"additionalProperties": false,
-								"dependentRequired": {}
-							}
-						},
-						"type": "object",
-						"properties": {
-							"namespace": {
-								"type": "string",
-								"minLength": 1,
-								"title": "Namespace",
-								"description": "Namespace of the pod to which filter need to be appliedfor details."
-							},
-							"network_params": {
-								"type": "object",
-								"propertyNames": {},
-								"additionalProperties": {
-									"type": "string"
-								},
-								"title": "Network Parameters",
-								"description": "The network filters that are applied on the interface. The currently supported filters are latency, loss and bandwidth"
-							},
-							"kubeconfig_path": {
-								"type": "string",
-								"title": "Kubeconfig path",
-								"description": "Kubeconfig file as string\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
-							},
-							"pod_name": {
-								"type": "string",
-								"title": "Pod name",
-								"description": "When label_selector is not specified, pod matching the name will beselected for the chaos scenario"
-							},
-							"label_selector": {
-								"type": "string",
-								"title": "Label selector",
-								"description": "Kubernetes label selector for the target pod. When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
-							},
-							"kraken_config": {
-								"type": "string",
-								"title": "Kraken Config",
-								"description": "Path to the config file of Kraken. Set this field if you wish to publish status onto Cerberus"
-							},
-							"test_duration": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 90,
-								"title": "Test duration",
-								"description": "Duration for which each step of the ingress chaos testing is to be performed."
-							},
-							"wait_duration": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 300,
-								"title": "Wait Duration",
-								"description": "Wait duration for finishing a test and its cleanup.Ensure that it is significantly greater than wait_duration"
-							},
-							"instance_count": {
-								"type": "integer",
-								"minimum": 1,
-								"default": 1,
-								"title": "Instance Count",
-								"description": "Number of pods to perform action/select that match the label selector."
-							},
-							"execution_type": {
-								"type": "string",
-								"default": "parallel",
-								"title": "Execution Type",
-								"description": "The order in which the ingress filters are applied. Execution type can be 'serial' or 'parallel'"
-							}
-						},
-						"required": [
-							"namespace"
-						],
-						"additionalProperties": false,
-						"dependentRequired": {}
-					}
-				},
-				"required": [
-					"id",
-					"config"
-				]
-			}
-		]
-	}
-}
--- a/tests/test_pod_network_outage.py
+++ b/tests/test_pod_network_outage.py
@@ -0,0 +1,93 @@
+import unittest
+from unittest.mock import MagicMock, patch
+import sys
+import os
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+
+from krkn.scenario_plugins.native.pod_network_outage.kubernetes_functions import (
+    list_pods,
+)
+from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin import (
+    get_test_pods,
+)
+
+
+class TestPodNetworkOutage(unittest.TestCase):
+    def test_list_pods_with_exclude_label(self):
+        """Test that list_pods correctly excludes pods with matching exclude_label"""
+        # Create mock pod items
+        pod1 = MagicMock()
+        pod1.metadata.name = "pod1"
+        pod1.metadata.labels = {"app": "test", "skip": "true"}
+
+        pod2 = MagicMock()
+        pod2.metadata.name = "pod2"
+        pod2.metadata.labels = {"app": "test"}
+
+        pod3 = MagicMock()
+        pod3.metadata.name = "pod3"
+        pod3.metadata.labels = {"app": "test", "skip": "false"}
+
+        # Create mock API response
+        mock_response = MagicMock()
+        mock_response.items = [pod1, pod2, pod3]
+
+        # Create mock client
+        mock_cli = MagicMock()
+        mock_cli.list_namespaced_pod.return_value = mock_response
+
+        # Test without exclude_label
+        result = list_pods(mock_cli, "test-namespace", "app=test")
+        self.assertEqual(result, ["pod1", "pod2", "pod3"])
+
+        # Test with exclude_label
+        result = list_pods(mock_cli, "test-namespace", "app=test", "skip=true")
+        self.assertEqual(result, ["pod2", "pod3"])
+
+    def test_get_test_pods_with_exclude_label(self):
+        """Test that get_test_pods passes exclude_label to list_pods correctly"""
+        # Create mock kubecli
+        mock_kubecli = MagicMock()
+        mock_kubecli.list_pods.return_value = ["pod2", "pod3"]
+
+        # Test get_test_pods with exclude_label
+        result = get_test_pods(
+            None, "app=test", "test-namespace", mock_kubecli, "skip=true"
+        )
+
+        # Verify list_pods was called with the correct parameters
+        mock_kubecli.list_pods.assert_called_once_with(
+            label_selector="app=test",
+            namespace="test-namespace",
+            exclude_label="skip=true",
+        )
+
+        # Verify the result
+        self.assertEqual(result, ["pod2", "pod3"])
+
+    def test_get_test_pods_with_pod_name_and_exclude_label(self):
+        """Test that get_test_pods prioritizes pod_name over label filters"""
+        # Create mock kubecli
+        mock_kubecli = MagicMock()
+        mock_kubecli.list_pods.return_value = ["pod1", "pod2", "pod3"]
+
+        # Test get_test_pods with both pod_name and exclude_label
+        # The pod_name should take precedence
+        result = get_test_pods(
+            "pod1", "app=test", "test-namespace", mock_kubecli, "skip=true"
+        )
+
+        # Verify list_pods was called with the correct parameters
+        mock_kubecli.list_pods.assert_called_once_with(
+            label_selector="app=test",
+            namespace="test-namespace",
+            exclude_label="skip=true",
+        )
+
+        # Verify the result contains only the specified pod
+        self.assertEqual(result, ["pod1"])
+
+
+if __name__ == "__main__":
+    unittest.main()
Author	SHA1	Message	Date
Tullio Sebastiani	543729b18a	Add exclude_label functionality to pod disruption scenarios (#910 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m15s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped * kill pod exclude label Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> * config alignment Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> --------- Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>	2025-10-08 22:10:27 +02:00
Paige Patton	a0ea4dc749	adding virt checks to metric info (#918 ) Signed-off-by: Paige Patton <prubenda@redhat.com> Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>	2025-10-08 15:43:48 -04:00
Paige Patton	a5459792ef	adding critical alerts to post to elastic search Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-10-08 15:38:20 -04:00
Tullio Sebastiani	d434bb26fa	Feature/add exclude label pod network chaos (#921 ) * feat: Add exclude_label feature to pod network outage scenarios This feature enables filtering out specific pods from network outage chaos testing based on label selectors. Users can now target all pods in a namespace except critical ones by specifying exclude_label. - Added exclude_label parameter to list_pods() function - Updated get_test_pods() to pass the exclude parameter - Added exclude_label field to all relevant plugin classes - Updated schema.json with the new parameter - Added documentation and examples - Created comprehensive unit tests Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com> * krkn-lib update Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> * removed plugin schema Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> --------- Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com> Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> Co-authored-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>	2025-10-08 16:01:41 +02:00
Paige Patton	fee41d404e	adding code owners (#920 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 11m6s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-10-06 16:03:13 -04:00
Tullio Sebastiani	8663ee8893	new elasticsearch action (#919 ) fix Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>	2025-10-06 12:58:26 -04:00
Paige Patton	a072f0306a	adding failure if unrecoverd pod (#908 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m48s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-09-17 11:59:45 -04:00
Paige Patton	8221392356	adding kill count Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m29s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-09-17 09:46:32 -04:00
Sahil Shah	671fc581dd	Adding node_label_selector for pod scenarios (#888 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m38s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped * Adding node_label_selector for pod scenarios Signed-off-by: Sahil Shah <sahshah@redhat.com> * using kubernetes function, adding node_name and removing extra config Signed-off-by: Sahil Shah <sahshah@redhat.com> * adding CI test for custom pod scenario Signed-off-by: Sahil Shah <sahshah@redhat.com> * fixing comment * adding test to workflow * adding list parsing logic for krkn hub * parsing not needed, as input is always [] --------- Signed-off-by: Sahil Shah <sahshah@redhat.com>	2025-09-15 16:52:08 -04:00
Naga Ravi Chaitanya Elluri	11508ce017	Deprecate blog post links in favor of the website Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m40s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>	2025-09-08 15:04:53 -04:00
Paige Patton	0d78139fb6	increasing krkn lib version (#906 ) Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-09-08 09:05:53 -04:00
Paige Patton	a3baffe8ee	adding vm name option (#904 ) Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m5s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-09-05 12:43:49 -04:00
Tullio Sebastiani	438b08fcd5	[CNCF Incubation] SBOM generation (#900 ) fix Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>	2025-09-05 12:43:37 -04:00
Tullio Sebastiani	9b930a02a5	Implemented the new pod monitoring api on kill pod and kill container scenario (#896 ) * implemented the new pod monitoring api Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> * minor refactoring Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> * krkn-lib 5.1.5 update Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> --------- Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>	2025-09-05 12:42:57 -04:00
Tullio Sebastiani	194e3b87ee	fixed test_pod_network_filter flaky test (#905 ) syntax syntax fix fix fix Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>	2025-09-05 11:59:30 -04:00
Paige Patton	8c05e44c23	adding ssh install and virtctl version Some checks failed Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m59s Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-09-04 13:57:34 -07:00
Paige Patton	88f8cf49f1	fixing kubevirt name not duplicate namespace Signed-off-by: Paige Patton <prubenda@redhat.com>	2025-09-04 12:45:05 -07:00