Compare commits

...

20 Commits

Author SHA1 Message Date
Paige Patton
edd0159251 adding health check global variables (#798)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m12s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-07 15:47:03 +02:00
Naga Ravi Chaitanya Elluri
cf9f7702ed fix: requirements.txt to reduce vulnerabilities (#795)
The following vulnerabilities are fixed by pinning transitive dependencies:
- https://snyk.io/vuln/SNYK-PYTHON-SETUPTOOLS-9964606

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2025-05-07 15:46:16 +02:00
Tullio Sebastiani
cfe624f153 changed get_node_ip to krkn-lib and removed kubectl dependency (#799)
* changed get_node_ip to krkn-lib and removed kubectl dependency

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated krkn-lib to 5.0.1

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-05-07 15:43:27 +02:00
Paige Patton
62f50db195 removing litmus sa (#797)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-07 15:41:49 +02:00
yogananth subramanian
aee838d3ac Fix: Add support for tains (#790) (#791)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
2025-05-06 12:51:59 -04:00
Tullio Sebastiani
3b4d8a13f9 network_chaos_ng_scenarios configuration fixes (#794)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-05-02 17:53:14 +02:00
Naga Ravi Chaitanya Elluri
a86bb6ab95 Refactor docs to point to krkn-chaos.dev
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 56s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-05-01 09:19:35 -04:00
Paige Patton
7f0110972b updating tuple type for health checks
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m58s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-04-28 08:24:14 -04:00
Paige Patton
126f4ebb35 logging getting into ingress shaping file
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 21s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-04-21 13:36:11 -04:00
Paige Patton
83d99bbb02 two types of zone outage
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-04-14 13:13:37 -04:00
Tullio Sebastiani
2624102d65 Node Network Filtering Scenario + Network Chaos NG modular architecture (#766)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
* network chaos NG modular architecture

error handling

* first working version (missing protocols, number of instances, wait duration)

* added instance_count + sleep + methods documentation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-04-10 16:47:29 +02:00
briankwyu2
02587bcbe6 Update ADOPTERS.md
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
2025-04-09 12:40:02 -04:00
Sahil Shah
c51bf04f9e Removing Krkn Documentation (#770)
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
2025-04-08 18:13:42 -04:00
Naga Ravi Chaitanya Elluri
41195b1a60 Add placeholder for capturing adopters
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
This will enable users and organizations to share their Krkn adoption
journey for their chaos engineering use cases.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-04-08 14:03:03 -04:00
Sahil Shah
ab80acbee7 Adding github-workflow to maintain documentation (#775)
* Adding githubworkflow to maintain documentation

* adding hyperlink
2025-04-08 06:43:47 -04:00
Gareth Healy
3573d13ea9 Fixed deadlink in README.md
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
Signed-off-by: Gareth Healy <garethahealy@gmail.com>
2025-04-07 14:12:38 -04:00
Tullio Sebastiani
9c5251d52f setuptools + golang stdlib (#781)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m54s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* setuptools + golang stdlib

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* equals

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-03-24 14:41:25 +01:00
Paige Patton
a0bba27edc triming down metrics
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m8s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-03-24 10:01:50 +00:00
Tullio Sebastiani
0d0143d1e0 added metrics-patch global krknctl flag
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m41s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

indent

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-03-21 14:29:24 +00:00
Naga Ravi Chaitanya Elluri
0004c05f81 Add security policy
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m15s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds a policy on how Krkn follows best practices and
addresses security vulnerabilities.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-03-20 17:40:23 +00:00
33 changed files with 813 additions and 432 deletions

10
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,10 @@
## Description
<!-- Provide a brief description of the changes made in this PR. -->
## Documentation
- [ ] **Is documentation needed for this update?**
If checked, a documentation PR must be created and merged in the [website repository](https://github.com/krkn-chaos/website/).
## Related Documentation PR (if applicable)
<!-- Add the link to the corresponding documentation PR in the website repository -->

45
.github/workflows/require-docs.yml vendored Normal file
View File

@@ -0,0 +1,45 @@
name: Require Documentation Update
on:
pull_request:
types: [opened, edited, synchronize]
branches:
- main
jobs:
check-docs:
name: Check Documentation Update
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Check if Documentation is Required
id: check_docs
run: |
echo "Checking PR body for documentation checkbox..."
# Read the PR body from the GitHub event payload
if echo "${{ github.event.pull_request.body }}" | grep -qi '\[x\].*documentation needed'; then
echo "Documentation required detected."
echo "docs_required=true" >> $GITHUB_OUTPUT
else
echo "Documentation not required."
echo "docs_required=false" >> $GITHUB_OUTPUT
fi
- name: Enforce Documentation Update (if required)
if: steps.check_docs.outputs.docs_required == 'true'
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Retrieve feature branch and repository owner from the GitHub context
FEATURE_BRANCH="${{ github.head_ref }}"
REPO_OWNER="${{ github.repository_owner }}"
WEBSITE_REPO="website"
echo "Searching for a merged documentation PR for feature branch: $FEATURE_BRANCH in $REPO_OWNER/$WEBSITE_REPO..."
MERGED_PR=$(gh pr list --repo "$REPO_OWNER/$WEBSITE_REPO" --state merged --json headRefName,title,url | jq -r \
--arg FEATURE_BRANCH "$FEATURE_BRANCH" '.[] | select(.title | contains($FEATURE_BRANCH)) | .url')
if [[ -z "$MERGED_PR" ]]; then
echo ":x: Documentation PR for branch '$FEATURE_BRANCH' is required and has not been merged."
exit 1
else
echo ":white_check_mark: Found merged documentation PR: $MERGED_PR"
fi

7
ADOPTERS.md Normal file
View File

@@ -0,0 +1,7 @@
# Krkn Adopters
This is a list of organizations that have publicly acknowledged usage of Krkn and shared details of how they are leveraging it in their environment for chaos engineering use cases. Do you want to add yourself to this list? Please fork the repository and open a PR with the required change.
| Organization | Since | Website | Use-Case |
|:-|:-|:-|:-|
| MarketAxess | 2024 | https://www.marketaxess.com/ | Kraken enables us to achieve our goal of increasing the reliability of our cloud products on Kubernetes. The tool allows us to automatically run various chaos scenarios, identify resilience and performance bottlenecks, and seamlessly restore the system to its original state once scenarios finish. These chaos scenarios include pod disruptions, node (EC2) outages, simulating availability zone (AZ) outages, and filling up storage spaces like EBS and EFS. The community is highly responsive to requests and works on expanding the tool's capabilities. MarketAxess actively contributes to the project, adding features such as the ability to leverage existing network ACLs and proposing several feature improvements to enhance test coverage. |

View File

@@ -10,92 +10,15 @@ Kraken injects deliberate failures into Kubernetes clusters to check if it is re
### Workflow
![Kraken workflow](media/kraken-workflow.png)
### Demo
[![Kraken demo](media/KrakenStarting.png)](https://youtu.be/LN-fZywp_mo "Kraken Demo - Click to Watch!")
![Kraken workflow](media/kraken-workflow.png)
### Chaos Testing Guide
[Guide](docs/index.md) encapsulates:
- Test methodology that needs to be embraced.
- Best practices that an Kubernetes cluster, platform and applications running on top of it should take into account for best user experience, performance, resilience and reliability.
- Tooling.
- Scenarios supported.
- Test environment recommendations as to how and where to run chaos tests.
- Chaos testing in practice.
The guide is hosted at https://krkn-chaos.github.io/krkn.
<!-- ### Demo
[![Kraken demo](media/KrakenStarting.png)](https://youtu.be/LN-fZywp_mo "Kraken Demo - Click to Watch!") -->
### How to Get Started
Instructions on how to setup, configure and run Kraken can be found at [Installation](docs/installation.md).
You may consider utilizing the chaos recommendation tool prior to initiating the chaos runs to profile the application service(s) under test. This tool discovers a list of Krkn scenarios with a high probability of causing failures or disruptions to your application service(s). The tool can be accessed at [Chaos-Recommender](utils/chaos_recommender/README.md).
See the [getting started doc](docs/getting_started.md) on support on how to get started with your own custom scenario or editing current scenarios for your specific usage.
After installation, refer back to the below sections for supported scenarios and how to tweak the kraken config to load them on your cluster.
#### Running Kraken with minimal configuration tweaks
For cases where you want to run Kraken with minimal configuration changes, refer to [krkn-hub](https://github.com/krkn-chaos/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
### Config
Instructions on how to setup the config and the options supported can be found at [Config](docs/config.md).
### Kubernetes chaos scenarios supported
Scenario type | Kubernetes
--------------------------- | ------------- |
[Pod Scenarios](docs/pod_scenarios.md) | :heavy_check_mark: |
[Pod Network Scenarios](docs/pod_network_scenarios.md) | :x: |
[Container Scenarios](docs/container_scenarios.md) | :heavy_check_mark: |
[Node Scenarios](docs/node_scenarios.md) | :heavy_check_mark: |
[Time Scenarios](docs/time_scenarios.md) | :heavy_check_mark: |
[Hog Scenarios: CPU, Memory](docs/hog_scenarios.md) | :heavy_check_mark: |
[Cluster Shut Down Scenarios](docs/cluster_shut_down_scenarios.md) | :heavy_check_mark: |
[Service Disruption Scenarios](docs/service_disruption_scenarios.md.md) | :heavy_check_mark: |
[Zone Outage Scenarios](docs/zone_outage.md) | :heavy_check_mark: |
[Application_outages](docs/application_outages.md) | :heavy_check_mark: |
[PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: |
[Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: |
[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: |
[Service Hijacking Scenarios](docs/service_hijacking_scenarios.md) | :heavy_check_mark: |
[SYN Flood Scenarios](docs/syn_flood_scenarios.md) | :heavy_check_mark: |
### Kraken scenario pass/fail criteria and report
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
- Leveraging [Cerberus](https://github.com/krkn-chaos/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/krkn-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
- Leveraging built-in alert collection feature to fail the runs in case of critical alerts.
- Utilizing health check endpoints to observe application behavior during chaos injection [Health checks](docs/health_checks.md)
### Signaling
In CI runs or any external job it is useful to stop Kraken once a certain test or state gets reached. We created a way to signal to kraken to pause the chaos or stop it completely using a signal posted to a port of your choice.
For example if we have a test run loading the cluster running and kraken separately running; we want to be able to know when to start/stop the kraken run based on when the test run completes or gets to a certain loaded state.
More detailed information on enabling and leveraging this feature can be found [here](docs/signal.md).
### Performance monitoring
Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chaos scenarios on various components is key to find out the bottlenecks as it is important to make sure the cluster is healthy in terms if both recovery as well as performance during/after the failure has been injected. Instructions on enabling it can be found [here](docs/performance_dashboards.md).
### SLOs validation during and post chaos
- In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics.
- Kraken also provides ability to check if any critical alerts are firing in the cluster post chaos and pass/fail's.
Information on enabling and leveraging this feature can be found [here](docs/SLOs_validation.md)
### OCM / ACM integration
Kraken supports injecting faults into [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.krkn.com/en/technologies/management/advanced-cluster-management) managed clusters through [ManagedCluster Scenarios](docs/managedcluster_scenarios.md).
Instructions on how to setup, configure and run Kraken can be found in the [documentation](https://krkn-chaos.dev/docs/).
### Blogs and other useful resources
@@ -107,6 +30,7 @@ Kraken supports injecting faults into [Open Cluster Management (OCM)](https://op
- Blog post on supercharging chaos testing using AI integration in Krkn: https://www.redhat.com/en/blog/supercharging-chaos-testing-using-ai
- Blog post announcing Krkn joining CNCF Sandbox: https://www.redhat.com/en/blog/krknchaos-joining-cncf-sandbox
### Roadmap
Enhancements being planned can be found in the [roadmap](ROADMAP.md).
@@ -114,17 +38,8 @@ Enhancements being planned can be found in the [roadmap](ROADMAP.md).
### Contributions
We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.
[More information on how to Contribute](docs/contribute.md)
[More information on how to Contribute](https://krkn-chaos.dev/docs/contribution-guidelines/contribute/)
If adding a new scenario or tweaking the main config, be sure to add in updates into the CI to be sure the CI is up to date.
Please read [this file]((CI/README.md#adding-a-test-case)) for more information on updates.
### Scenario Plugin Development
If you're gearing up to develop new scenarios, take a moment to review our
[Scenario Plugin API Documentation](docs/scenario_plugin_api.md).
Its the perfect starting point to tap into your chaotic creativity!
### Community
Key Members(slack_usernames/full name): paigerube14/Paige Rubendall, mffiedler/Mike Fiedler, tsebasti/Tullio Sebastiani, yogi/Yogananth Subramanian, sahil/Sahil Shah, pradeep/Pradeep Surisetty and ravielluri/Naga Ravi Chaitanya Elluri.

43
SECURITY.md Normal file
View File

@@ -0,0 +1,43 @@
# Security Policy
We attach great importance to code security. We are very grateful to the users, security vulnerability researchers, etc. for reporting security vulnerabilities to the Krkn community. All reported security vulnerabilities will be carefully assessed and addressed in a timely manner.
## Security Checks
Krkn leverages [Snyk](https://snyk.io/) to ensure that any security vulnerabilities found
in the code base and dependencies are fixed and published in the latest release. Security
vulnerability checks are enabled for each pull request to enable developers to get insights
and proactively fix them.
## Reporting a Vulnerability
The Krkn project treats security vulnerabilities seriously, so we
strive to take action quickly when required.
The project requests that security issues be disclosed in a responsible
manner to allow adequate time to respond. If a security issue or
vulnerability has been found, please disclose the details to our
dedicated email address:
cncf-krkn-maintainers@lists.cncf.io
You can also use the [GitHub vulnerability report mechanism](https://docs.github.com/en/code-security/security-advisories/guidance-on-reporting-and-writing-information-about-vulnerabilities/privately-reporting-a-security-vulnerability#privately-reporting-a-security-vulnerability) to report the security vulnerability.
Please include as much information as possible with the report. The
following details assist with analysis efforts:
- Description of the vulnerability
- Affected component (version, commit, branch etc)
- Affected code (file path, line numbers)
- Exploit code
## Security Team
The security team currently consists of the [Maintainers of Krkn](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md)
## Process and Supported Releases
The Krkn security team will investigate and provide a fix in a timely mannner depending on the severity. The fix will be included in the new release of Krkn and details will be included in the release notes.

View File

@@ -45,6 +45,8 @@ kraken:
- scenarios/kube/service_hijacking.yaml
- syn_flood_scenarios:
- scenarios/kube/syn_flood.yaml
- network_chaos_ng_scenarios:
- scenarios/kube/network-filter.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed

View File

@@ -1,133 +1,126 @@
metrics:
# API server
- query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
metricName: API99thLatency
- query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
metricName: APIRequestRate
instant: True
- query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
metricName: APIInflightRequests
instant: True
- query: histogram_quantile(0.99, rate(apiserver_current_inflight_requests[5m]))
metricName: APIInflightRequests
instant: True
# Container & pod metrics
- query: (sum(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(etcd|oauth-apiserver|.*apiserver|ovn-kubernetes|sdn|ingress|authentication|.*controller-manager|.*scheduler)"}) by (container, pod, namespace, node) and on (node) kube_node_role{role="master"}) > 0
metricName: containerMemory-Masters
instant: true
- query: (sum(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(etcd|oauth-apiserver|sdn|ovn-kubernetes|.*apiserver|authentication|.*controller-manager|.*scheduler)"}[2m]) * 100) by (container, pod, namespace, node) and on (node) kube_node_role{role="master"}) > 0
metricName: containerCPU-Masters
instant: true
- query: (sum(irate(container_cpu_usage_seconds_total{pod!="",container="prometheus",namespace="openshift-monitoring"}[2m]) * 100) by (container, pod, namespace, node) and on (node) kube_node_role{role="infra"}) > 0
metricName: containerCPU-Prometheus
instant: true
- query: (avg(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress)"}[2m]) * 100 and on (node) kube_node_role{role="worker"}) by (namespace, container)) > 0
metricName: containerCPU-AggregatedWorkers
instant: true
- query: (avg(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress|monitoring|image-registry|logging)"}[2m]) * 100 and on (node) kube_node_role{role="infra"}) by (namespace, container)) > 0
metricName: containerCPU-AggregatedInfra
- query: (sum(container_memory_rss{pod!="",namespace="openshift-monitoring",name!="",container="prometheus"}) by (container, pod, namespace, node) and on (node) kube_node_role{role="infra"}) > 0
metricName: containerMemory-Prometheus
instant: True
- query: avg(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress)"} and on (node) kube_node_role{role="worker"}) by (container, namespace)
metricName: containerMemory-AggregatedWorkers
instant: True
- query: avg(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress|monitoring|image-registry|logging)"} and on (node) kube_node_role{role="infra"}) by (container, namespace)
metricName: containerMemory-AggregatedInfra
instant: True
# Node metrics
- query: (sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) > 0
metricName: nodeCPU-Masters
instant: True
- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: maxCPU-Masters
instant: true
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Masters
instant: true
- query: (avg((sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))) by (mode)) > 0
metricName: nodeCPU-AggregatedWorkers
instant: True
- query: (avg((sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))) by (mode)) > 0
metricName: nodeCPU-AggregatedInfra
instant: True
- query: avg(node_memory_MemAvailable_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeMemoryAvailable-Masters
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Masters
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: maxMemory-Masters
instant: true
- query: avg(node_memory_MemAvailable_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryAvailable-AggregatedWorkers
instant: True
- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: maxCPU-Workers
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: maxMemory-Workers
instant: true
- query: avg(node_memory_MemAvailable_bytes and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryAvailable-AggregatedInfra
instant: True
- query: avg(node_memory_Active_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeMemoryActive-Masters
instant: True
- query: avg(node_memory_Active_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryActive-AggregatedWorkers
instant: True
- query: avg(avg(node_memory_Active_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryActive-AggregatedInfra
- query: avg(node_memory_Cached_bytes) by (instance) + avg(node_memory_Buffers_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeMemoryCached+nodeMemoryBuffers-Masters
- query: avg(node_memory_Cached_bytes + node_memory_Buffers_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryCached+nodeMemoryBuffers-AggregatedWorkers
- query: avg(node_memory_Cached_bytes + node_memory_Buffers_bytes and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryCached+nodeMemoryBuffers-AggregatedInfra
- query: irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: rxNetworkBytes-Masters
- query: avg(irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: rxNetworkBytes-AggregatedWorkers
- query: avg(irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: rxNetworkBytes-AggregatedInfra
- query: irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: txNetworkBytes-Masters
- query: avg(irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: txNetworkBytes-AggregatedWorkers
- query: avg(irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: txNetworkBytes-AggregatedInfra
- query: rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeDiskWrittenBytes-Masters
- query: avg(rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskWrittenBytes-AggregatedWorkers
- query: avg(rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskWrittenBytes-AggregatedInfra
- query: rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeDiskReadBytes-Masters
- query: avg(rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskReadBytes-AggregatedWorkers
- query: avg(rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskReadBytes-AggregatedInfra
instant: True
# Etcd metrics
- query: sum(rate(etcd_server_leader_changes_seen_total[2m]))
metricName: etcdLeaderChangesRate
instant: True
- query: etcd_server_is_leader > 0
metricName: etcdServerIsLeader
instant: True
- query: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[2m]))
metricName: 99thEtcdDiskBackendCommitDurationSeconds
instant: True
- query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
metricName: 99thEtcdDiskWalFsyncDurationSeconds
instant: True
- query: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))
metricName: 99thEtcdRoundTripTimeSeconds
- query: etcd_mvcc_db_total_size_in_bytes
metricName: etcdDBPhysicalSizeBytes
- query: etcd_mvcc_db_total_size_in_use_in_bytes
metricName: etcdDBLogicalSizeBytes
instant: True
- query: sum by (cluster_version)(etcd_cluster_version)
metricName: etcdVersion
@@ -135,83 +128,16 @@ metrics:
- query: sum(rate(etcd_object_counts{}[5m])) by (resource) > 0
metricName: etcdObjectCount
instant: True
- query: histogram_quantile(0.99,sum(rate(etcd_request_duration_seconds_bucket[2m])) by (le,operation,apiserver)) > 0
metricName: P99APIEtcdRequestLatency
- query: sum(grpc_server_started_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})
metricName: ActiveWatchStreams
- query: sum(grpc_server_started_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})
metricName: ActiveLeaseStreams
- query: sum(rate(etcd_debugging_snap_save_total_duration_seconds_sum{namespace="openshift-etcd"}[2m]))
metricName: snapshotSaveLatency
- query: sum(rate(etcd_server_heartbeat_send_failures_total{namespace="openshift-etcd"}[2m]))
metricName: HeartBeatFailures
- query: sum(rate(etcd_server_health_failures{namespace="openshift-etcd"}[2m]))
metricName: HealthFailures
- query: sum(rate(etcd_server_slow_apply_total{namespace="openshift-etcd"}[2m]))
metricName: SlowApplies
- query: sum(rate(etcd_server_slow_read_indexes_total{namespace="openshift-etcd"}[2m]))
metricName: SlowIndexRead
- query: sum(etcd_server_proposals_pending)
metricName: PendingProposals
- query: histogram_quantile(1.0, sum(rate(etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket[1m])) by (le, instance))
metricName: CompactionMaxPause
instant: True
- query: sum by (instance) (apiserver_storage_objects)
metricName: etcdTotalObjectCount
instant: True
- query: topk(500, max by(resource) (apiserver_storage_objects))
metricName: etcdTopObectCount
# Cluster metrics
- query: count(kube_namespace_created)
metricName: namespaceCount
- query: sum(kube_pod_status_phase{}) by (phase)
metricName: podStatusCount
- query: count(kube_secret_info{})
metricName: secretCount
- query: count(kube_deployment_labels{})
metricName: deploymentCount
- query: count(kube_configmap_info{})
metricName: configmapCount
- query: count(kube_service_info{})
metricName: serviceCount
- query: kube_node_role
metricName: nodeRoles
instant: true
- query: sum(kube_node_status_condition{status="true"}) by (condition)
metricName: nodeStatus
- query: (sum(rate(container_fs_writes_bytes_total{container!="",device!~".+dm.+"}[5m])) by (device, container, node) and on (node) kube_node_role{role="master"}) > 0
metricName: containerDiskUsage
- query: cluster_version{type="completed"}
metricName: clusterVersion
instant: true
# Golang metrics
- query: go_memstats_heap_alloc_bytes{job=~"apiserver|api|etcd"}
metricName: goHeapAllocBytes
- query: go_memstats_heap_inuse_bytes{job=~"apiserver|api|etcd"}
metricName: goHeapInuseBytes
- query: go_gc_duration_seconds{job=~"apiserver|api|etcd",quantile="1"}
metricName: goGCDurationSeconds
instant: True

View File

@@ -27,8 +27,17 @@ metrics:
metricName: crioMemory
# Node metrics
- query: sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) > 0
metricName: nodeCPU
- query: (sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) > 0
metricName: nodeCPU-Masters
- query: (avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Masters
- query: (sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) > 0
metricName: nodeCPU-Workers
- query: (avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[2m:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Workers
- query: avg(node_memory_MemAvailable_bytes) by (instance)
metricName: nodeMemoryAvailable
@@ -36,6 +45,9 @@ metrics:
- query: avg(node_memory_Active_bytes) by (instance)
metricName: nodeMemoryActive
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: maxMemory-Masters
- query: avg(node_memory_Cached_bytes) by (instance) + avg(node_memory_Buffers_bytes) by (instance)
metricName: nodeMemoryCached+nodeMemoryBuffers
@@ -78,34 +90,4 @@ metrics:
- query: sum by (cluster_version)(etcd_cluster_version)
metricName: etcdVersion
instant: true
# Cluster metrics
- query: count(kube_namespace_created)
metricName: namespaceCount
- query: sum(kube_pod_status_phase{}) by (phase)
metricName: podStatusCount
- query: count(kube_secret_info{})
metricName: secretCount
- query: count(kube_deployment_labels{})
metricName: deploymentCount
- query: count(kube_configmap_info{})
metricName: configmapCount
- query: count(kube_service_info{})
metricName: serviceCount
- query: kube_node_role
metricName: nodeRoles
instant: true
- query: sum(kube_node_status_condition{status="true"}) by (condition)
metricName: nodeStatus
- query: cluster_version{type="completed"}
metricName: clusterVersion
instant: true
instant: true

View File

@@ -1,10 +1,10 @@
# oc build
FROM golang:1.23.0 AS oc-build
FROM golang:1.23.1 AS oc-build
RUN apt-get update && apt-get install -y --no-install-recommends libkrb5-dev
WORKDIR /tmp
RUN git clone --branch release-4.18 https://github.com/openshift/oc.git
WORKDIR /tmp/oc
RUN go mod edit -go 1.23.0 &&\
RUN go mod edit -go 1.23.1 &&\
go get github.com/moby/buildkit@v0.12.5 &&\
go get github.com/containerd/containerd@v1.7.11&&\
go get github.com/docker/docker@v25.0.6&&\
@@ -25,10 +25,6 @@ RUN dnf update -y
ENV KUBECONFIG /home/krkn/.kube/config
# install kubectl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" &&\
cp kubectl /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl &&\
cp kubectl /usr/bin/kubectl && chmod +x /usr/bin/kubectl
# This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo
RUN dnf update && dnf install -y --setopt=install_weak_deps=False \
@@ -50,7 +46,8 @@ RUN if [ -n "$PR_NUMBER" ]; then git fetch origin pull/${PR_NUMBER}/head:pr-${PR
# if it is a TAG trigger checkout the tag
RUN if [ -n "$TAG" ]; then git checkout "$TAG";fi
RUN python3.9 -m ensurepip
RUN python3.9 -m ensurepip --upgrade --default-pip
RUN python3.9 -m pip install --upgrade pip setuptools==70.0.0
RUN pip3.9 install -r requirements.txt
RUN pip3.9 install jsonschema

View File

@@ -101,12 +101,21 @@
{
"name": "alerts-path",
"short_description": "Cluster alerts path file (in container)",
"description": "Enables cluster alerts check",
"description": "Allows to specify a different alert file path",
"variable": "ALERTS_PATH",
"type": "string",
"default": "config/alerts.yaml",
"required": "false"
},
{
"name": "metrics-path",
"short_description": "Cluster metrics path file (in container)",
"description": "Allows to specify a different metrics file path",
"variable": "METRICS_PATH",
"type": "string",
"default": "config/metrics-aggregated.yaml",
"required": "false"
},
{
"name": "enable-es",
"short_description": "Enables elastic search data collection",
@@ -358,6 +367,60 @@
"default": "True",
"required": "false"
},
{
"name": "health-check-interval",
"short_description": "Heath check interval",
"description": "How often to check the health check urls",
"variable": "HEALTH_CHECK_INTERVAL",
"type": "number",
"default": "2",
"required": "false"
},
{
"name": "health-check-url",
"short_description": "Health check url",
"description": "Url to check the health of",
"variable": "HEALTH_CHECK_URL",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-auth",
"short_description": "Health check authentication tuple",
"description": "Authentication tuple to authenticate into health check URL",
"variable": "HEALTH_CHECK_AUTH",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-bearer-token",
"short_description": "Health check bearer token",
"description": "Bearer token to authenticate into health check URL",
"variable": "HEALTH_CHECK_BEARER_TOKEN",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-exit",
"short_description": "Health check exit on failure",
"description": "Exit on failure when health check URL is not able to connect",
"variable": "HEALTH_CHECK_EXIT_ON_FAILURE",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-verify",
"short_description": "SSL Verification of health check url",
"description": "SSL Verification to authenticate into health check URL",
"variable": "HEALTH_CHECK_VERIFY",
"type": "string",
"default": "false",
"required": "false"
},
{
"name": "krkn-debug",
"short_description": "Krkn debug mode",
@@ -369,4 +432,4 @@
"default": "False",
"required": "false"
}
]
]

View File

@@ -18,6 +18,8 @@ these workloads will use a predetermined amount of resources for a specified dur
|`namespace`| string | the namespace where the stress workload will be deployed |
|`node-selector`| string (Optional) | defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector. |
|`number-of-nodes`| number (Optional) | restricts the number of selected nodes by the selector|
|`taints`| list (Optional) default [] | list of taints for which tolerations need to created. Example: ["node-role.kubernetes.io/master:NoSchedule"]|
#### `cpu-hog` options

View File

@@ -33,7 +33,7 @@ class HogsScenarioPlugin(AbstractScenarioPlugin):
else:
node_selector = scenario_config.node_selector
available_nodes = lib_telemetry.get_lib_kubernetes().list_schedulable_nodes(node_selector)
available_nodes = lib_telemetry.get_lib_kubernetes().list_nodes(node_selector)
if len(available_nodes) == 0:
raise Exception("no available nodes to schedule workload")

View File

@@ -49,6 +49,7 @@ class NativeScenarioPlugin(AbstractScenarioPlugin):
return [
"pod_disruption_scenarios",
"pod_network_scenarios",
"ingress_node_scenarios"
]
def start_monitoring(self, pool: PodsMonitorPool, scenarios: list[Any]):

View File

@@ -97,15 +97,6 @@ class NetworkScenarioConfig:
},
)
kraken_config: typing.Optional[str] = field(
default="",
metadata={
"name": "Kraken Config",
"description": "Path to the config file of Kraken. "
"Set this field if you wish to publish status onto Cerberus",
},
)
@dataclass
class NetworkScenarioSuccessOutput:
@@ -710,6 +701,7 @@ def network_chaos(
pod_module_template = env.get_template("pod_module.j2")
cli, batch_cli = kube_helper.setup_kubernetes(cfg.kubeconfig_path)
logging.info("Starting Ingress Network Chaos")
try:
node_interface_dict = get_node_interfaces(
cfg.node_interface_name,
@@ -721,16 +713,6 @@ def network_chaos(
except Exception:
return "error", NetworkScenarioErrorOutput(format_exc())
job_list = []
publish = False
if cfg.kraken_config:
failed_post_scenarios = ""
try:
with open(cfg.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error("Error reading Kraken config from %s" % cfg.kraken_config)
return "error", NetworkScenarioErrorOutput(format_exc())
publish = True
try:
if cfg.execution_type == "parallel":
@@ -747,13 +729,7 @@ def network_chaos(
)
)
logging.info("Waiting for parallel job to finish")
start_time = int(time.time())
wait_for_job(batch_cli, job_list[:], cfg.test_duration + 100)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
)
elif cfg.execution_type == "serial":
create_interfaces = True
@@ -773,18 +749,12 @@ def network_chaos(
)
)
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
wait_for_job(batch_cli, job_list[:], cfg.test_duration + 100)
logging.info("Deleting jobs")
delete_jobs(cli, batch_cli, job_list[:])
job_list = []
logging.info("Waiting for wait_duration : %ss" % cfg.wait_duration)
time.sleep(cfg.wait_duration)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
)
create_interfaces = False
else:
@@ -799,7 +769,7 @@ def network_chaos(
execution_type=cfg.execution_type,
)
except Exception as e:
logging.error("Network Chaos exiting due to Exception - %s" % e)
logging.error("Ingress Network Chaos exiting due to Exception - %s" % e)
return "error", NetworkScenarioErrorOutput(format_exc())
finally:
delete_virtual_interfaces(cli, node_interface_dict.keys(), pod_module_template)

View File

@@ -0,0 +1,41 @@
from dataclasses import dataclass
from enum import Enum
class NetworkChaosScenarioType(Enum):
Node = 1
Pod = 2
@dataclass
class BaseNetworkChaosConfig:
supported_execution = ["serial", "parallel"]
id: str
wait_duration: int
test_duration: int
label_selector: str
instance_count: int
execution: str
namespace: str
def validate(self) -> list[str]:
errors = []
if self.execution is None:
errors.append(f"execution cannot be None, supported values are: {','.join(self.supported_execution)}")
if self.execution not in self.supported_execution:
errors.append(f"{self.execution} is not in supported execution mod: {','.join(self.supported_execution)}")
if self.label_selector is None:
errors.append("label_selector cannot be None")
return errors
@dataclass
class NetworkFilterConfig(BaseNetworkChaosConfig):
ingress: bool
egress: bool
interfaces: list[str]
target: str
ports: list[int]
def validate(self) -> list[str]:
errors = super().validate()
# here further validations
return errors

View File

@@ -0,0 +1,58 @@
import abc
import logging
import queue
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn.scenario_plugins.network_chaos_ng.models import BaseNetworkChaosConfig, NetworkChaosScenarioType
class AbstractNetworkChaosModule(abc.ABC):
"""
The abstract class that needs to be implemented by each Network Chaos Scenario
"""
@abc.abstractmethod
def run(self, target: str, kubecli: KrknTelemetryOpenshift, error_queue: queue.Queue = None):
"""
the entrypoint method for the Network Chaos Scenario
:param target: The resource name that will be targeted by the scenario (Node Name, Pod Name etc.)
:param kubecli: The `KrknTelemetryOpenshift` needed by the scenario to access to the krkn-lib methods
:param error_queue: A queue that will be used by the plugin to push the errors raised during the execution of parallel modules
"""
pass
@abc.abstractmethod
def get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig):
"""
returns the common subset of settings shared by all the scenarios `BaseNetworkChaosConfig` and the type of Network
Chaos Scenario that is running (Pod Scenario or Node Scenario)
"""
pass
def log_info(self, message: str, parallel: bool = False, node_name: str = ""):
"""
log helper method for INFO severity to be used in the scenarios
"""
if parallel:
logging.info(f"[{node_name}]: {message}")
else:
logging.info(message)
def log_warning(self, message: str, parallel: bool = False, node_name: str = ""):
"""
log helper method for WARNING severity to be used in the scenarios
"""
if parallel:
logging.warning(f"[{node_name}]: {message}")
else:
logging.warning(message)
def log_error(self, message: str, parallel: bool = False, node_name: str = ""):
"""
log helper method for ERROR severity to be used in the scenarios
"""
if parallel:
logging.error(f"[{node_name}]: {message}")
else:
logging.error(message)

View File

@@ -0,0 +1,136 @@
import os
import queue
import time
import yaml
from jinja2 import Environment, FileSystemLoader
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn_lib.utils import get_random_string
from krkn.scenario_plugins.network_chaos_ng.models import (
BaseNetworkChaosConfig,
NetworkFilterConfig,
NetworkChaosScenarioType,
)
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
AbstractNetworkChaosModule,
)
class NodeNetworkFilterModule(AbstractNetworkChaosModule):
config: NetworkFilterConfig
def run(
self,
target: str,
kubecli: KrknTelemetryOpenshift,
error_queue: queue.Queue = None,
):
parallel = False
if error_queue:
parallel = True
try:
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=True)
pod_name = f"node-filter-{get_random_string(5)}"
pod_template = env.get_template("templates/network-chaos.j2")
pod_body = yaml.safe_load(
pod_template.render(
pod_name=pod_name,
namespace=self.config.namespace,
host_network=True,
target=target,
)
)
self.log_info(
f"creating pod to filter "
f"ports {','.join([str(port) for port in self.config.ports])}, "
f"ingress:{str(self.config.ingress)}, "
f"egress:{str(self.config.egress)}",
parallel,
target,
)
kubecli.get_lib_kubernetes().create_pod(
pod_body, self.config.namespace, 300
)
if len(self.config.interfaces) == 0:
interfaces = [
self.get_default_interface(pod_name, self.config.namespace, kubecli)
]
self.log_info(f"detected default interface {interfaces[0]}")
else:
interfaces = self.config.interfaces
input_rules, output_rules = self.generate_rules(interfaces)
for rule in input_rules:
self.log_info(f"applying iptables INPUT rule: {rule}", parallel, target)
kubecli.get_lib_kubernetes().exec_cmd_in_pod(
[rule], pod_name, self.config.namespace
)
for rule in output_rules:
self.log_info(
f"applying iptables OUTPUT rule: {rule}", parallel, target
)
kubecli.get_lib_kubernetes().exec_cmd_in_pod(
[rule], pod_name, self.config.namespace
)
self.log_info(
f"waiting {self.config.test_duration} seconds before removing the iptables rules"
)
time.sleep(self.config.test_duration)
self.log_info("removing iptables rules")
for _ in input_rules:
# always deleting the first rule since has been inserted from the top
kubecli.get_lib_kubernetes().exec_cmd_in_pod(
[f"iptables -D INPUT 1"], pod_name, self.config.namespace
)
for _ in output_rules:
# always deleting the first rule since has been inserted from the top
kubecli.get_lib_kubernetes().exec_cmd_in_pod(
[f"iptables -D OUTPUT 1"], pod_name, self.config.namespace
)
self.log_info(
f"deleting network chaos pod {pod_name} from {self.config.namespace}"
)
kubecli.get_lib_kubernetes().delete_pod(pod_name, self.config.namespace)
except Exception as e:
if error_queue is None:
raise e
else:
error_queue.put(str(e))
def __init__(self, config: NetworkFilterConfig):
self.config = config
def get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig):
return NetworkChaosScenarioType.Node, self.config
def get_default_interface(
self, pod_name: str, namespace: str, kubecli: KrknTelemetryOpenshift
) -> str:
cmd = "ip r | grep default | awk '/default/ {print $5}'"
output = kubecli.get_lib_kubernetes().exec_cmd_in_pod(
[cmd], pod_name, namespace
)
return output.replace("\n", "")
def generate_rules(self, interfaces: list[str]) -> (list[str], list[str]):
input_rules = []
output_rules = []
for interface in interfaces:
for port in self.config.ports:
if self.config.egress:
output_rules.append(
f"iptables -I OUTPUT 1 -p tcp --dport {port} -m state --state NEW,RELATED,ESTABLISHED -j DROP"
)
if self.config.ingress:
input_rules.append(
f"iptables -I INPUT 1 -i {interface} -p tcp --dport {port} -m state --state NEW,RELATED,ESTABLISHED -j DROP"
)
return input_rules, output_rules

View File

@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: {{pod_name}}
namespace: {{namespace}}
spec:
{% if host_network %}
hostNetwork: true
{%endif%}
nodeSelector:
kubernetes.io/hostname: {{target}}
containers:
- name: fedora
imagePullPolicy: Always
image: quay.io/krkn-chaos/krkn-network-chaos:latest
securityContext:
privileged: true

View File

@@ -0,0 +1,24 @@
from krkn.scenario_plugins.network_chaos_ng.models import NetworkFilterConfig
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import AbstractNetworkChaosModule
from krkn.scenario_plugins.network_chaos_ng.modules.node_network_filter import NodeNetworkFilterModule
supported_modules = ["node_network_filter"]
class NetworkChaosFactory:
@staticmethod
def get_instance(config: dict[str, str]) -> AbstractNetworkChaosModule:
if config["id"] is None:
raise Exception("network chaos id cannot be None")
if config["id"] not in supported_modules:
raise Exception(f"{config['id']} is not a supported network chaos module")
if config["id"] == "node_network_filter":
config = NetworkFilterConfig(**config)
errors = config.validate()
if len(errors) > 0:
raise Exception(f"config validation errors: [{';'.join(errors)}]")
return NodeNetworkFilterModule(config)

View File

@@ -0,0 +1,116 @@
import logging
import queue
import random
import threading
import time
import yaml
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
from krkn.scenario_plugins.network_chaos_ng.models import (
NetworkChaosScenarioType,
BaseNetworkChaosConfig,
)
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
AbstractNetworkChaosModule,
)
from krkn.scenario_plugins.network_chaos_ng.network_chaos_factory import (
NetworkChaosFactory,
)
class NetworkChaosNgScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
try:
with open(scenario, "r") as file:
scenario_config = yaml.safe_load(file)
if not isinstance(scenario_config, list):
logging.error(
"network chaos scenario config must be a list of objects"
)
return 1
for config in scenario_config:
network_chaos = NetworkChaosFactory.get_instance(config)
network_chaos_config = network_chaos.get_config()
logging.info(
f"running network_chaos scenario: {network_chaos_config[1].id}"
)
if network_chaos_config[0] == NetworkChaosScenarioType.Node:
targets = lib_telemetry.get_lib_kubernetes().list_nodes(
network_chaos_config[1].label_selector
)
else:
targets = lib_telemetry.get_lib_kubernetes().list_pods(
network_chaos_config[1].namespace,
network_chaos_config[1].label_selector,
)
if len(targets) == 0:
logging.warning(
f"no targets found for {network_chaos_config[1].id} "
f"network chaos scenario with selector {network_chaos_config[1].label_selector} "
f"with target type {network_chaos_config[0]}"
)
if network_chaos_config[1].instance_count != 0 and network_chaos_config[1].instance_count > len(targets):
targets = random.sample(targets, network_chaos_config[1].instance_count)
if network_chaos_config[1].execution == "parallel":
self.run_parallel(targets, network_chaos, lib_telemetry)
else:
self.run_serial(targets, network_chaos, lib_telemetry)
if len(config) > 1:
logging.info(f"waiting {network_chaos_config[1].wait_duration} seconds before running the next "
f"Network Chaos NG Module")
time.sleep(network_chaos_config[1].wait_duration)
except Exception as e:
logging.error(str(e))
return 1
return 0
def run_parallel(
self,
targets: list[str],
module: AbstractNetworkChaosModule,
lib_telemetry: KrknTelemetryOpenshift,
):
error_queue = queue.Queue()
threads = []
errors = []
for target in targets:
thread = threading.Thread(
target=module.run, args=[target, lib_telemetry, error_queue]
)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
while True:
try:
errors.append(error_queue.get_nowait())
except queue.Empty:
break
if len(errors) > 0:
raise Exception(
f"module {module.get_config()[1].id} execution failed: [{';'.join(errors)}]"
)
def run_serial(
self,
targets: list[str],
module: AbstractNetworkChaosModule,
lib_telemetry: KrknTelemetryOpenshift,
):
for target in targets:
module.run(target, lib_telemetry)
def get_scenario_types(self) -> list[str]:
return ["network_chaos_ng_scenarios"]

View File

@@ -224,7 +224,6 @@ class gcp_node_scenarios(abstract_node_scenarios):
def __init__(self, kubecli: KrknKubernetes, affected_nodes_status: AffectedNodeStatus):
super().__init__(kubecli, affected_nodes_status)
self.gcp = GCP()
print("selfkeys" + str(vars(self)))
# Node scenario to start the node
def node_start_scenario(self, instance_kill_count, node, timeout):

View File

@@ -14,8 +14,7 @@ class OPENSTACKCLOUD:
self.Wait = 30
# Get the instance ID of the node
def get_instance_id(self, node):
openstack_node_ip = nodeaction.get_node_ip(node)
def get_instance_id(self, openstack_node_ip):
openstack_node_name = self.get_openstack_nodename(openstack_node_ip)
return openstack_node_name
@@ -128,7 +127,8 @@ class openstack_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_start_scenario injection")
logging.info("Starting the node %s" % (node))
openstack_node_name = self.openstackcloud.get_instance_id(node)
openstack_node_ip = self.kubecli.get_node_ip(node)
openstack_node_name = self.openstackcloud.get_instance_id(openstack_node_ip)
self.openstackcloud.start_instances(openstack_node_name)
self.openstackcloud.wait_until_running(openstack_node_name, timeout, affected_node)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
@@ -151,7 +151,8 @@ class openstack_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_stop_scenario injection")
logging.info("Stopping the node %s " % (node))
openstack_node_name = self.openstackcloud.get_instance_id(node)
openstack_node_ip = self.kubecli.get_node_ip(node)
openstack_node_name = self.openstackcloud.get_instance_id(openstack_node_ip)
self.openstackcloud.stop_instances(openstack_node_name)
self.openstackcloud.wait_until_stopped(openstack_node_name, timeout, affected_node)
logging.info("Node with instance name: %s is in stopped state" % (node))
@@ -173,7 +174,8 @@ class openstack_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_reboot_scenario injection")
logging.info("Rebooting the node %s" % (node))
openstack_node_name = self.openstackcloud.get_instance_id(node)
openstack_node_ip = self.kubecli.get_node_ip(node)
openstack_node_name = self.openstackcloud.get_instance_id(openstack_node_ip)
self.openstackcloud.reboot_instances(openstack_node_name)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)

View File

@@ -2,15 +2,21 @@ import logging
import time
import yaml
from multiprocessing.pool import ThreadPool
from itertools import repeat
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.models.k8s import AffectedNodeStatus
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn_lib.utils import log_exception
from krkn import utils
from krkn_lib.utils import get_yaml_item_value
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
from krkn.scenario_plugins.native.network import cerberus
from krkn.scenario_plugins.node_actions.aws_node_scenarios import AWS
from krkn.scenario_plugins.node_actions.aws_node_scenarios import AWS
from krkn.scenario_plugins.node_actions.gcp_node_scenarios import gcp_node_scenarios
class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
def run(
@@ -25,92 +31,138 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
with open(scenario, "r") as f:
zone_outage_config_yaml = yaml.full_load(f)
scenario_config = zone_outage_config_yaml["zone_outage"]
vpc_id = scenario_config["vpc_id"]
subnet_ids = scenario_config["subnet_id"]
duration = scenario_config["duration"]
cloud_type = scenario_config["cloud_type"]
# Add support for user-provided default network ACL
default_acl_id = scenario_config.get("default_acl_id")
ids = {}
acl_ids_created = []
if cloud_type.lower() == "aws":
cloud_object = AWS()
else:
logging.error(
"ZoneOutageScenarioPlugin Cloud type %s is not currently supported for "
"zone outage scenarios" % cloud_type
)
return 1
start_time = int(time.time())
for subnet_id in subnet_ids:
logging.info("Targeting subnet_id")
network_association_ids = []
associations, original_acl_id = cloud_object.describe_network_acls(
vpc_id, subnet_id
)
for entry in associations:
if entry["SubnetId"] == subnet_id:
network_association_ids.append(
entry["NetworkAclAssociationId"]
)
logging.info(
"Network association ids associated with "
"the subnet %s: %s" % (subnet_id, network_association_ids)
)
# Use provided default ACL if available, otherwise create a new one
if default_acl_id:
acl_id = default_acl_id
logging.info(
"Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
default_acl_id
)
# Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
if cloud_type.lower() == "aws":
self.cloud_object = AWS()
self.network_based_zone(scenario_config)
else:
kubecli = lib_telemetry.get_lib_kubernetes()
if cloud_type.lower() == "gcp":
affected_nodes_status = AffectedNodeStatus()
self.cloud_object = gcp_node_scenarios(kubecli, affected_nodes_status)
self.node_based_zone(scenario_config, kubecli)
affected_nodes_status = self.cloud_object.affected_nodes_status
scenario_telemetry.affected_nodes.extend(affected_nodes_status.affected_nodes)
else:
acl_id = cloud_object.create_default_network_acl(vpc_id)
logging.info("Created new default ACL %s", acl_id)
acl_ids_created.append(acl_id)
new_association_id = cloud_object.replace_network_acl_association(
network_association_ids[0], acl_id
)
# capture the orginal_acl_id, created_acl_id and
# new association_id to use during the recovery
ids[new_association_id] = original_acl_id
# wait for the specified duration
logging.info(
"Waiting for the specified duration " "in the config: %s" % duration
)
time.sleep(duration)
# replace the applied acl with the previous acl in use
for new_association_id, original_acl_id in ids.items():
cloud_object.replace_network_acl_association(
new_association_id, original_acl_id
)
logging.info(
"Wating for 60 seconds to make sure " "the changes are in place"
)
time.sleep(60)
# delete the network acl created for the run
for acl_id in acl_ids_created:
cloud_object.delete_network_acl(acl_id)
logging.error(
"ZoneOutageScenarioPlugin Cloud type %s is not currently supported for "
"zone outage scenarios" % cloud_type
)
return 1
end_time = int(time.time())
cerberus.publish_kraken_status(krkn_config, [], start_time, end_time)
except (RuntimeError, Exception):
except (RuntimeError, Exception) as e:
logging.error(
f"ZoneOutageScenarioPlugin scenario {scenario} failed with exception: {e}"
)
return 1
else:
return 0
def node_based_zone(self, scenario_config: dict[str, any], kubecli: KrknKubernetes ):
zone = scenario_config["zone"]
duration = get_yaml_item_value(scenario_config, "duration", 60)
timeout = get_yaml_item_value(scenario_config, "timeout", 180)
label_selector = f"topology.kubernetes.io/zone={zone}"
try:
# get list of nodes in zone/region
nodes = kubecli.list_killable_nodes(label_selector)
# stop nodes in parallel
pool = ThreadPool(processes=len(nodes))
pool.starmap(
self.cloud_object.node_stop_scenario,zip(repeat(1), nodes, repeat(timeout))
)
pool.close()
logging.info(
"Waiting for the specified duration " "in the config: %s" % duration
)
time.sleep(duration)
# start nodes in parallel
pool = ThreadPool(processes=len(nodes))
pool.starmap(
self.cloud_object.node_start_scenario,zip(repeat(1), nodes, repeat(timeout))
)
pool.close()
except Exception as e:
logging.info(
f"Node based zone outage scenario failed with exception: {e}"
)
return 1
else:
return 0
def network_based_zone(self, scenario_config: dict[str, any]):
vpc_id = scenario_config["vpc_id"]
subnet_ids = scenario_config["subnet_id"]
duration = scenario_config["duration"]
# Add support for user-provided default network ACL
default_acl_id = scenario_config.get("default_acl_id")
ids = {}
acl_ids_created = []
for subnet_id in subnet_ids:
logging.info("Targeting subnet_id")
network_association_ids = []
associations, original_acl_id = self.cloud_object.describe_network_acls(
vpc_id, subnet_id
)
for entry in associations:
if entry["SubnetId"] == subnet_id:
network_association_ids.append(
entry["NetworkAclAssociationId"]
)
logging.info(
"Network association ids associated with "
"the subnet %s: %s" % (subnet_id, network_association_ids)
)
# Use provided default ACL if available, otherwise create a new one
if default_acl_id:
acl_id = default_acl_id
logging.info(
"Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
default_acl_id
)
# Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
else:
acl_id = self.cloud_object.create_default_network_acl(vpc_id)
logging.info("Created new default ACL %s", acl_id)
acl_ids_created.append(acl_id)
new_association_id = self.cloud_object.replace_network_acl_association(
network_association_ids[0], acl_id
)
# capture the orginal_acl_id, created_acl_id and
# new association_id to use during the recovery
ids[new_association_id] = original_acl_id
# wait for the specified duration
logging.info(
"Waiting for the specified duration " "in the config: %s" % duration
)
time.sleep(duration)
# replace the applied acl with the previous acl in use
for new_association_id, original_acl_id in ids.items():
self.cloud_object.replace_network_acl_association(
new_association_id, original_acl_id
)
logging.info(
"Wating for 60 seconds to make sure " "the changes are in place"
)
time.sleep(60)
# delete the network acl created for the run
for acl_id in acl_ids_created:
self.cloud_object.delete_network_acl(acl_id)
def get_scenario_types(self) -> list[str]:
return ["zone_outages_scenarios"]

View File

@@ -11,9 +11,9 @@ class HealthChecker:
def __init__(self, iterations):
self.iterations = iterations
def make_request(self, url, auth=None, headers=None):
def make_request(self, url, auth=None, headers=None, verify=True):
response_data = {}
response = requests.get(url, auth=auth, headers=headers)
response = requests.get(url, auth=auth, headers=headers, verify=verify)
response_data["url"] = url
response_data["status"] = response.status_code == 200
response_data["status_code"] = response.status_code
@@ -26,18 +26,20 @@ class HealthChecker:
health_check_telemetry = []
health_check_tracker = {}
interval = health_check_config["interval"] if health_check_config["interval"] else 2
response_tracker = {config["url"]:True for config in health_check_config["config"]}
while self.current_iterations < self.iterations:
for config in health_check_config.get("config"):
auth, headers = None, None
verify_url = config["verify_url"] if "verify_url" in config else True
if config["url"]: url = config["url"]
if config["bearer_token"]:
bearer_token = "Bearer " + config["bearer_token"]
headers = {"Authorization": bearer_token}
if config["auth"]: auth = config["auth"]
response = self.make_request(url, auth, headers)
if config["auth"]: auth = tuple(config["auth"].split(','))
response = self.make_request(url, auth, headers, verify_url)
if response["status_code"] != 200:
if config["url"] not in health_check_tracker:

View File

@@ -15,7 +15,7 @@ google-cloud-compute==1.22.0
ibm_cloud_sdk_core==3.18.0
ibm_vpc==0.20.0
jinja2==3.1.6
krkn-lib==5.0.0
krkn-lib==5.0.1
lxml==5.1.0
kubernetes==28.1.0
numpy==1.26.4
@@ -30,7 +30,7 @@ python-openstackclient==6.5.0
requests==2.32.2
service_identity==24.1.0
PyYAML==6.0.1
setuptools==70.0.0
setuptools==78.1.1
werkzeug==3.0.6
wheel==0.42.0
zope.interface==5.4.0

View File

@@ -7,3 +7,4 @@ cpu-load-percentage: 90
cpu-method: all
node-selector: "node-role.kubernetes.io/worker="
number-of-nodes: 2
taints: [] #example ["node-role.kubernetes.io/master:NoSchedule"]

View File

@@ -11,4 +11,5 @@ io-target-pod-volume:
hostPath:
path: /root # a path writable by kubelet in the root filesystem of the node
node-selector: "node-role.kubernetes.io/worker="
number-of-nodes: ''
number-of-nodes: ''
taints: [] #example ["node-role.kubernetes.io/master:NoSchedule"]

View File

@@ -6,3 +6,4 @@ namespace: default
memory-vm-bytes: 90%
node-selector: "node-role.kubernetes.io/worker="
number-of-nodes: ''
taints: [] #example ["node-role.kubernetes.io/master:NoSchedule"]

View File

@@ -0,0 +1,13 @@
- id: node_network_filter
wait_duration: 300
test_duration: 100
label_selector: "kubernetes.io/hostname=ip-10-0-39-182.us-east-2.compute.internal"
namespace: 'default'
instance_count: 1
execution: parallel
ingress: false
egress: true
target: node
interfaces: []
ports:
- 2049

View File

@@ -1,49 +0,0 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: litmus-sa
namespace: litmus
labels:
name: litmus-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: litmus-sa
labels:
name: litmus-sa
app.kubernetes.io/part-of: litmus
rules:
- apiGroups: [""]
resources: ["pods","events"]
verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
resources: ["pods/exec","pods/log"]
verbs: ["list","get","create"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: litmus-sa
labels:
name: litmus-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: litmus-sa
subjects:
- kind: ServiceAccount
name: litmus-sa
namespace: litmus

View File

@@ -0,0 +1,4 @@
zone_outage: # Scenario to create an outage of a zone by tweaking network ACL
cloud_type: gcp # cloud type on which Kubernetes/OpenShift runs. aws is only platform supported currently for this scenario.
duration: 600 # duration in seconds after which the zone will be back online
zone: <zone> # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.