Compare commits

..

394 Commits
v1.4.7 ... main

Author SHA1 Message Date
Paige Patton
4f305e78aa remove chaos ai
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-02-11 13:44:13 -05:00
dependabot[bot]
b17e933134 Bump pillow from 10.3.0 to 12.1.1 in /utils/chaos_ai (#1157)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.3.0 to 12.1.1.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/10.3.0...12.1.1)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.1.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-11 10:08:42 -05:00
Paige Patton
beea484597 adding vm ware tests (#1133)
Signed-off-by: Paige Patton <paigepatton@Paiges-MacBook-Air.local>
Signed-off-by: Paige Patton <prubenda@redhat.com>
Co-authored-by: Paige Patton <paigepatton@Paiges-MacBook-Air.local>
2026-02-10 16:24:26 -05:00
Paige Patton
0222b0f161 fix ibm (#1155)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-02-10 10:09:28 -05:00
Ashish Mahajan
f7e674d5ad docs: fix typos in logs, comments, and documentation (#1079)
Signed-off-by: AR21SM <mahajanashishar21sm@gmail.com>
2026-02-09 09:48:51 -05:00
Ashish Mahajan
7aea12ce6c fix(VirtChecker): handle empty VMI interfaces list (#1072)
Signed-off-by: AR21SM <mahajanashishar21sm@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-02-09 08:29:48 -05:00
Darshan Jain
625e1e90cf feat: add color-coded console logging (#1122) (#1146)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 2m16s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Manage Stale Issues and Pull Requests / Mark and Close Stale Issues and PRs (push) Successful in 24s
Signed-off-by: ddjain <darjain@redhat.com>
2026-02-05 14:27:52 +05:30
dependabot[bot]
a9f1ce8f1b Bump pillow from 10.2.0 to 10.3.0 in /utils/chaos_ai (#1149)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 34m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Manage Stale Issues and Pull Requests / Mark and Close Stale Issues and PRs (push) Successful in 5s
Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.2.0 to 10.3.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/10.2.0...10.3.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 10.3.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-02 13:47:47 -05:00
Paige Patton
66e364e293 wheel updates (#1148)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-02-02 13:46:22 -05:00
Paige Patton
898ce76648 adding python3.11 updates (#1012)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-02-02 12:00:33 -05:00
Chaudary Farhan Saleem
4a0f4e7cab fix: correct spelling typos in log messages (#1145)
- Fix 'wating' - 'waiting' (2 occurrences)
- Fix 'successfuly' - 'successfully' (12 occurrences)
- Fix 'orginal' - 'original' (1 occurrence)

Improves professionalism of log output and code comments.

Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>
2026-02-02 09:23:44 -05:00
Darshan Jain
819191866d Add CLAUDE.md for AI-assisted development (#1141)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 1m38s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Manage Stale Issues and Pull Requests / Mark and Close Stale Issues and PRs (push) Successful in 6s
Signed-off-by: ddjain <darjain@redhat.com>
2026-01-31 23:41:49 +05:30
Paige Patton
37ca4bbce7 removing unneeded requirement (#1066)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 2m50s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Manage Stale Issues and Pull Requests / Mark and Close Stale Issues and PRs (push) Successful in 4s
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-20 13:33:28 -05:00
Ashish Mahajan
b9dd4e40d3 fix(hogs): correct off-by-one error in random node selection (#1112)
Signed-off-by: AR21SM <mahajanashishar21sm@gmail.com>
2026-01-20 11:00:50 -05:00
AR21SM
3fd249bb88 Add stale PR management to workflow
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 2m11s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Manage Stale Issues and Pull Requests / Mark and Close Stale Issues and PRs (push) Successful in 5s
Signed-off-by: AR21SM <mahajanashishar21sm@gmail.com>
2026-01-19 15:10:49 -05:00
Naga Ravi Chaitanya Elluri
773107245c Add contribution guidelines reference to the PR template (#1108)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2026-01-19 14:30:04 -05:00
Paige Patton
05bc201528 adding chaos_ai deprecation (#1106)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-19 13:14:04 -05:00
Ashish Mahajan
9a316550e1 fix: add missing 'as e' to capture exception in TimeActionsScenarioPlugin (#1057)
Signed-off-by: AR21SM <mahajanashishar21sm@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-01-19 09:37:30 -05:00
Ashish Mahajan
9c261e2599 feat(ci): add stale issues automation workflow (#1055)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m42s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Close Stale Issues / Mark and Close Stale Issues (push) Successful in 9s
Signed-off-by: AR21SM <mahajanashishar21sm@gmail.com>
2026-01-17 10:13:49 -05:00
Paige Patton
0cc82dc65d add service hijacking to add to file not overwrite (#1067)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 5m41s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-16 14:24:03 -05:00
Paige Patton
269e21e9eb adding telemety (#1064)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-16 13:53:48 -05:00
Paige Patton
d0dbe3354a adding always run tests if pr or main (#1061)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-16 13:24:07 -05:00
Paige Patton
4a0686daf3 adding openstack tests (#1060)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-16 13:23:49 -05:00
Paige Patton
822bebac0c removing arca utils (#1053)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m4s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-15 10:50:17 -05:00
Paige Patton
a13150b0f5 changing telemetry test to pod scenarios (#1052)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 5m4s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-13 10:16:26 -05:00
Sai Sanjay
0443637fe1 Add unit tests to pvc_scenario_plugin.py (#1014)
* Add PVC outage scenario plugin to manage PVC annotations during outages

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Remove PvcOutageScenarioPlugin as it is no longer needed; refactor PvcScenarioPlugin to include rollback functionality for temporary file cleanup during PVC scenarios.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor rollback_data handling in PvcScenarioPlugin to use str() instead of json.dumps() for resource_identifier.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Import json module in PvcScenarioPlugin for decoding rollback data from resource_identifier.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* feat: Encode rollback data in base64 format for resource_identifier in PvcScenarioPlugin to enhance data handling and security.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* feat: refactor: Update logging level from debug to info for temp file operations in PvcScenarioPlugin to improve visibility of command execution.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add unit tests for PvcScenarioPlugin methods and enhance test coverage

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add missed lines test cov

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor tests in test_pvc_scenario_plugin.py to use unittest framework and enhance test coverage for to_kbytes method

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Enhance rollback_temp_file test to verify logging of errors for invalid data

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor tests in TestPvcScenarioPluginRun to clarify pod_name behavior and enhance logging verification in rollback_temp_file tests

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactored imports

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor assertions in test cases to use assertEqual for consistency

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-01-13 09:47:12 -05:00
Sai Sanjay
36585630f2 Add tests to service_hijacking_scenario.py (#1015)
* Add rollback functionality to ServiceHijackingScenarioPlugin

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor rollback data handling in ServiceHijackingScenarioPlugin as json string

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Update rollback data handling in ServiceHijackingScenarioPlugin to decode directly from resource_identifier

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Add import statement for JSON handling in ServiceHijackingScenarioPlugin

This change introduces an import statement for the JSON module to facilitate the decoding of rollback data from the resource identifier.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* feat: Enhance rollback data handling in ServiceHijackingScenarioPlugin by encoding and decoding as base64 strings.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add rollback tests for ServiceHijackingScenarioPlugin

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor rollback tests for ServiceHijackingScenarioPlugin to improve error logging and remove temporary path dependency

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Remove redundant import of yaml in test_service_hijacking_scenario_plugin.py

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor rollback tests for ServiceHijackingScenarioPlugin to enhance readability and consistency

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-01-13 09:26:22 -05:00
dependabot[bot]
1401724312 Bump werkzeug from 3.1.4 to 3.1.5 in /utils/chaos_ai/docker
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m7s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/3.1.4...3.1.5)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-version: 3.1.5
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-01-08 20:35:19 -05:00
Paige Patton
fa204a515c testing chagnes link (#1047)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 2m7s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-08 09:19:33 -05:00
LEEITING
b3a5fc2d53 Fix the typo in krkn/cerberus/setup.py (#1043)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 3m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Fix typo in key name for application routes in setup.py

Signed-off-by: iting0321 <iting0321@MacBook-11111111.local>

* Fix typo in 'check_applicaton_routes' to 'check_application_routes' in configuration files and cerberus scripts

Signed-off-by: iting0321 <iting0321@MacBook-11111111.local>

---------

Signed-off-by: iting0321 <iting0321@MacBook-11111111.local>
Co-authored-by: iting0321 <iting0321@MacBook-11111111.local>
2026-01-03 23:29:02 -05:00
Paige Patton
05600b62b3 moving tests out from folders (#1042)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 5m7s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2026-01-02 11:07:29 -05:00
Sai Sanjay
126599e02c Add unit tests for ingress shaping functionality at test_ingress_network_plugin.py (#1036)
* Add unit tests for ingress shaping functionality at test_ingress_network_plugin.py

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add mocks for Environment and FileSystemLoader in network chaos tests

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2026-01-02 14:49:00 +01:00
Sai Sanjay
b3d6a19d24 Add unit tests for logging functions in NetworkChaosNgUtils (#1037)
* Add unit tests for logging functions in NetworkChaosNgUtils

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add pytest configuration to enable module imports in tests

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add tests for logging functions handling missing node names in parallel mode

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2026-01-02 14:48:19 +01:00
Sai Sanjay
65100f26a7 Add unit tests for native plugins.py (#1038)
* Add unit tests for native plugins.py

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Remove redundant yaml import statements in test cases

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add validation for registered plugin IDs and ensure no legacy aliases exist

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2026-01-02 14:47:50 +01:00
Sai Sanjay
10b6e4663e Kubevirt VM outage tests with improved mocking and validation scenarios at test_kubevirt_vm_outage.py (#1041)
* Kubevirt VM outage tests with improved mocking and validation scenarios at test_kubevirt_vm_outage.py

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor Kubevirt VM outage tests to improve time mocking and response handling

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Remove unused subproject reference for pvc_outage

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor Kubevirt VM outage tests to enhance time mocking and improve response handling

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Enhance VMI deletion test by mocking unchanged creationTimestamp to exercise timeout path

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor Kubevirt VM outage tests to use dynamic timestamps and improve mock handling

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2026-01-02 14:47:13 +01:00
Sai Sanjay
ce52183a26 Add unit tests for common_functions in ManagedClusterScenarioPlugin, common_function.py (#1039)
* Add unit tests for common_functions in ManagedClusterScenarioPlugin , common_function.py

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor unit tests for common_functions: improve mock behavior and assertions

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add unit tests for get_managedcluster: handle zero count and random selection

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-01-02 08:23:57 -05:00
Sai Sanjay
e9ab3b47b3 Add unit tests for ShutDownScenarioPlugin with AWS, GCP, Azure, and IBM cloud types at shut_down_scenario_plugin.py (#1040)
* Add unit tests for ShutDownScenarioPlugin with AWS, GCP, Azure, and IBM cloud types at shut_down_scenario_plugin.py

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor logging assertions in ShutDownScenarioPlugin tests for clarity and accuracy

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2026-01-02 08:22:49 -05:00
Sai Sanjay
3e14fe07b7 Add unit tests for Azure class methods in (#1035)
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
2026-01-02 08:20:34 -05:00
Paige Patton
d9271a4bcc adding ibm cloud node tests (#1018)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m42s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-23 12:59:22 -05:00
dependabot[bot]
850930631e Bump werkzeug from 3.0.6 to 3.1.4 in /utils/chaos_ai/docker (#1003)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m44s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.6 to 3.1.4.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/3.0.6...3.1.4)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-version: 3.1.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-12-23 08:23:06 -05:00
Sai Sanjay
15eee80c55 Add unit tests for syn_flood_scenario_plugin.py (#1016)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m3s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback functionality to SynFloodScenarioPlugin

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor rollback pod handling in SynFloodScenarioPlugin to handle podnames as string

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Update resource identifier handling in SynFloodScenarioPlugin to use list format for rollback functionality

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor chaos scenario configurations in config.yaml to comment out existing scenarios for clarity. Update rollback method in SynFloodScenarioPlugin to improve pod cleanup handling. Modify pvc_scenario.yaml with specific test values for better usability.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Enhance rollback functionality in SynFloodScenarioPlugin by encoding pod names in base64 format for improved data handling.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Add unit tests for SynFloodScenarioPlugin methods and rollback functionality

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor TestSynFloodRun and TestRollbackSynFloodPods to inherit from unittest.TestCase

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* Refactor SynFloodRun tests to use tempfile for scenario file creation and improve error logging in rollback functionality

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
2025-12-22 15:01:50 -05:00
Paige Patton
ff3c4f5313 increasing node action coverage (#1010)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-22 11:36:10 -05:00
Paige Patton
4c74df301f adding alibaba and az tests (#1011)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m52s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-19 15:31:55 -05:00
Parag Kamble
b60b66de43 Fixed IBM node_reboot_scenario failure (#1007)
Signed-off-by: Parag Kamble <pakamble@redhat.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2025-12-19 10:06:17 -05:00
Paige Patton
2458022248 moving telemetry (#1008)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 1s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-18 14:59:37 -05:00
Paige Patton
18385cba2b adding run unit tests on main (#1004)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 5m22s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-17 15:09:47 -05:00
Paige Patton
e7fa6bdebc checking chunk error in ci tests (#937)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-17 15:09:15 -05:00
Paige Patton
c3f6b1a7ff updating return code (#1001)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m37s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-16 10:27:24 -05:00
Paige Patton
f2ba8b85af adding podman support in docker configuration (#999)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 1s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-15 11:52:30 -05:00
Paige Patton
ba3fdea403 adding pvc ttests (#1000)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-15 11:46:48 -05:00
Paige Patton
42d18a8e04 adding fail scenario if unrecovered kubevirt vm killing (#994)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m10s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-15 10:04:35 -05:00
Paige Patton
4b3617bd8a adding gcp tests for node actions (#997)
Assisted By: Claude Code

Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-15 09:39:16 -05:00
Paige Patton
eb7a1e243c adding aws tests for node scenarios (#996)
Assisted By: Claude Code

Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-15 09:38:56 -05:00
Paige Patton
197ce43f9a adding test server (#982)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m2s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-12-02 14:10:05 -05:00
dependabot[bot]
eecdeed73c Bump werkzeug from 3.0.6 to 3.1.4
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m45s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.6 to 3.1.4.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/3.0.6...3.1.4)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-version: 3.1.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-02 01:09:08 -05:00
zhoujinyu
ef606d0f17 fix:delete statefulset instead of statefulsets while logging
Signed-off-by: zhoujinyu <2319109590@qq.com>
2025-12-02 01:06:22 -05:00
Paige Patton
9981c26304 adding return values for failure cases (#979)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m40s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-26 11:03:39 -05:00
Paige Patton
4ebfc5dde5 adding thread lock (#974)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-26 09:37:19 -05:00
Wei Liu
4527d073c6 Make AWS node stop wait time configurable via timeout (#940)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m13s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Make AWS node stop wait time configurable via timeout

Signed-off-by: Wei Liu <weiliu@redhat.com>

* Make AWS node stop wait time configurable via timeout

Signed-off-by: Wei Liu <weiliu@redhat.com>

* Also update node start and terminate

Signed-off-by: Wei Liu <weiliu@redhat.com>

* Make poll interval parameterized

Signed-off-by: Wei Liu <weiliu@redhat.com>

* Add poll_interval to other cloud platforms

Signed-off-by: Wei Liu <weiliu@redhat.com>

---------

Signed-off-by: Wei Liu <weiliu@redhat.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2025-11-24 12:25:23 -05:00
Shivam Sharma
93d6967331 Handled error handling in chaos recommender present in krkn/utils/chaos_recommender, not in run_kraken.py or chaos_recommender in krkn/krkn, as they used different prometheus client than this one (#820) 2025-11-24 12:02:21 -05:00
FAUST.
b462c46b28 feat:Add exlude_label in container scenario (#966)
* feat:Add exlude_label in container scenario

Signed-off-by: zhoujinyu <2319109590@qq.com>

* refactor:use list_pods with exclude_label in container scenario

Signed-off-by: zhoujinyu <2319109590@qq.com>

---------

Signed-off-by: zhoujinyu <2319109590@qq.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2025-11-24 15:59:36 +01:00
FAUST.
ab4ae85896 feat:Add exclude label to application outage (#967)
* feat:Add exclude label to application outage

Signed-off-by: zhoujinyu <2319109590@qq.com>

* chore: add missing comments

Signed-off-by: zhoujinyu <2319109590@qq.com>

* chore: adjust comments

Signed-off-by: zhoujinyu <2319109590@qq.com>

---------

Signed-off-by: zhoujinyu <2319109590@qq.com>
2025-11-24 15:54:05 +01:00
Paige Patton
6acd6f9bd3 adding common vars for new kubevirt checks (#973)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m58s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-21 09:51:46 -05:00
Paige Patton
787759a591 removing pycache from files found (#972)
Assisted By: Claude Code

Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-21 09:50:35 -05:00
Paige Patton
957cb355be not properly getting auto variable in RollbackConfig (#971)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-21 09:50:20 -05:00
Paige Patton
35609484d4 fixing batch size limit (#964)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-21 09:47:41 -05:00
LIU ZHE YOU
959337eb63 [Rollback Scenario] Refactor execution (#895)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Validate version file format

* Add validation for context dir, Exexcute all files by default

* Consolidate execute and cleanup, rename with .executed instead of
removing

* Respect auto_rollback config

* Add cleanup back but only for scenario successed

---------

Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2025-11-19 14:14:06 +01:00
Sai Sanjay
f4bdbff9dc Add rollback functionality to SynFloodScenarioPlugin (#948)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m48s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback functionality to SynFloodScenarioPlugin

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor rollback pod handling in SynFloodScenarioPlugin to handle podnames as string

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Update resource identifier handling in SynFloodScenarioPlugin to use list format for rollback functionality

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor chaos scenario configurations in config.yaml to comment out existing scenarios for clarity. Update rollback method in SynFloodScenarioPlugin to improve pod cleanup handling. Modify pvc_scenario.yaml with specific test values for better usability.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Enhance rollback functionality in SynFloodScenarioPlugin by encoding pod names in base64 format for improved data handling.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2025-11-19 11:18:50 +01:00
Sai Sanjay
954202cab7 Add rollback functionality to ServiceHijackingScenarioPlugin (#949)
* Add rollback functionality to ServiceHijackingScenarioPlugin

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor rollback data handling in ServiceHijackingScenarioPlugin as json string

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Update rollback data handling in ServiceHijackingScenarioPlugin to decode directly from resource_identifier

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Add import statement for JSON handling in ServiceHijackingScenarioPlugin

This change introduces an import statement for the JSON module to facilitate the decoding of rollback data from the resource identifier.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* feat: Enhance rollback data handling in ServiceHijackingScenarioPlugin by encoding and decoding as base64 strings.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2025-11-19 11:18:15 +01:00
Paige Patton
a373dcf453 adding virt checker tests (#960)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 3m45s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-18 14:27:59 -05:00
Paige Patton
d0c604a516 timeout on main ssh to worker (#957)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m22s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-18 09:02:41 -05:00
Sai Sanjay
82582f5bc3 Add PVC Scenario Rollback Feature (#947)
* Add PVC outage scenario plugin to manage PVC annotations during outages

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Remove PvcOutageScenarioPlugin as it is no longer needed; refactor PvcScenarioPlugin to include rollback functionality for temporary file cleanup during PVC scenarios.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Refactor rollback_data handling in PvcScenarioPlugin to use str() instead of json.dumps() for resource_identifier.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* Import json module in PvcScenarioPlugin for decoding rollback data from resource_identifier.

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>

* feat: Encode rollback data in base64 format for resource_identifier in PvcScenarioPlugin to enhance data handling and security.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

* feat: refactor: Update logging level from debug to info for temp file operations in PvcScenarioPlugin to improve visibility of command execution.

Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>

---------

Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2025-11-18 08:10:44 -05:00
Paige Patton
37f0f1eb8b fixing spacing
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m39s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-18 02:25:09 -05:00
Paige Patton
d2eab21f95 adding centos image fix (#958)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m5s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-17 12:28:53 -05:00
Paige Patton
d84910299a typo (#956)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m22s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-13 13:23:58 -05:00
Harry C
48f19c0a0e Fix type: kubleci -> kubecli in time scenario exclude_label (#955)
Signed-off-by: Harry12980 <onlyharryc@gmail.com>
2025-11-13 13:15:36 -05:00
Paige Patton
eb86885bcd adding kube virt check failure (#952)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m14s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-13 10:37:42 -05:00
Paige Patton
967fd14bd7 adding namespace regex match (#954)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-13 09:44:20 -05:00
Harry C
5cefe80286 Add exclude_label parameter to time disruption scenario (#953)
Signed-off-by: Harry12980 <onlyharryc@gmail.com>
2025-11-13 15:21:55 +01:00
Paige Patton
9ee76ce337 post chaos (#939)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m40s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-11 14:11:04 -05:00
Tullio Sebastiani
fd3e7ee2c8 Fixes several Image cves (#941)
* fixes some CVEs on the base image

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* oc dependencies updated

* virtctl build

fix

removed virtctil installation

pip

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-11-11 19:50:12 +01:00
dependabot[bot]
c85c435b5d Bump werkzeug from 3.0.3 to 3.0.6 in /utils/chaos_ai/docker (#945)
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.3 to 3.0.6.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/3.0.3...3.0.6)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-version: 3.0.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-11 19:48:47 +01:00
Paige Patton
d5284ace25 adding prometheus url to krknctl input (#943)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-11 13:45:27 -05:00
Paige Patton
c3098ec80b turning off es in ci tests (#944)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-11 12:51:10 -05:00
Paige Patton
6629c7ec33 adding virt checks (#932)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m46s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Assisted By: Claude Code

Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-11-05 21:17:21 -05:00
Sandeep Hans
fb6af04b09 Add IBM as a new adopter in ADOPTERS.md
Added IBM as a new adopter with details on their collaboration with Kraken for AI-enabled chaos testing.
2025-11-05 13:02:31 -05:00
Sai Sindhur Malleni
dc1215a61b Add OVN EgressIP scenario (#931)
Signed-off-by: smalleni <smalleni@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-11-04 13:58:36 -05:00
Parag Kamble
f74aef18f8 correct logging format in node_reboot_scenario (#936)
Signed-off-by: Parag Kamble <pakamble@redhat.com>
2025-10-31 15:23:23 -04:00
Paige Patton
166204e3c5 adding debug command line option
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-10-31 11:12:46 -04:00
Paige Patton
fc7667aef1 issue template and imporved pull request tempaltee
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-10-30 22:29:43 -04:00
Paige Patton
3eea42770f adding ibm power using request calls (#923)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m56s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-10-28 12:57:20 -04:00
Tullio Sebastiani
77a46e3869 Adds an exclude label for node scenarios (#929)
* added exclude label for node scenarios

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pipeline fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-10-28 16:55:16 +01:00
Paige Patton
b801308d4a Setting config back to all scenarios running
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m4s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-10-24 13:21:01 -04:00
Tullio Sebastiani
97f4c1fd9c main github action fix
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m55s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

main github action fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

elastic password

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

config fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-10-17 17:06:35 +02:00
Tullio Sebastiani
c54390d8b1 pod network filter ingress fix (#925)
* pod network filter ingress fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* increasing lib version

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-10-17 12:27:53 +02:00
Tullio Sebastiani
543729b18a Add exclude_label functionality to pod disruption scenarios (#910)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m15s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* kill pod exclude label

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* config alignment

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-10-08 22:10:27 +02:00
Paige Patton
a0ea4dc749 adding virt checks to metric info (#918)
Signed-off-by: Paige Patton <prubenda@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-10-08 15:43:48 -04:00
Paige Patton
a5459792ef adding critical alerts to post to elastic search
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-10-08 15:38:20 -04:00
Tullio Sebastiani
d434bb26fa Feature/add exclude label pod network chaos (#921)
* feat: Add exclude_label feature to pod network outage scenarios

This feature enables filtering out specific pods from network outage
chaos testing based on label selectors. Users can now target all pods
in a namespace except critical ones by specifying exclude_label.

- Added exclude_label parameter to list_pods() function
- Updated get_test_pods() to pass the exclude parameter
- Added exclude_label field to all relevant plugin classes
- Updated schema.json with the new parameter
- Added documentation and examples
- Created comprehensive unit tests

Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>

* krkn-lib update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed plugin schema

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
2025-10-08 16:01:41 +02:00
Paige Patton
fee41d404e adding code owners (#920)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 11m6s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-10-06 16:03:13 -04:00
Tullio Sebastiani
8663ee8893 new elasticsearch action (#919)
fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-10-06 12:58:26 -04:00
Paige Patton
a072f0306a adding failure if unrecoverd pod (#908)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m48s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-17 11:59:45 -04:00
Paige Patton
8221392356 adding kill count
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m29s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-17 09:46:32 -04:00
Sahil Shah
671fc581dd Adding node_label_selector for pod scenarios (#888)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m38s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Adding node_label_selector for pod scenarios

Signed-off-by: Sahil Shah <sahshah@redhat.com>

* using kubernetes function, adding node_name and removing extra config

Signed-off-by: Sahil Shah <sahshah@redhat.com>

* adding CI test for custom pod scenario

Signed-off-by: Sahil Shah <sahshah@redhat.com>

* fixing comment

* adding test to workflow

* adding list parsing logic for krkn hub

* parsing not needed, as input is always []

---------

Signed-off-by: Sahil Shah <sahshah@redhat.com>
2025-09-15 16:52:08 -04:00
Naga Ravi Chaitanya Elluri
11508ce017 Deprecate blog post links in favor of the website
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m40s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-09-08 15:04:53 -04:00
Paige Patton
0d78139fb6 increasing krkn lib version (#906)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-08 09:05:53 -04:00
Paige Patton
a3baffe8ee adding vm name option (#904)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m5s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-05 12:43:49 -04:00
Tullio Sebastiani
438b08fcd5 [CNCF Incubation] SBOM generation (#900)
fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-09-05 12:43:37 -04:00
Tullio Sebastiani
9b930a02a5 Implemented the new pod monitoring api on kill pod and kill container scenario (#896)
* implemented the new pod monitoring api

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* minor refactoring

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib 5.1.5 update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-09-05 12:42:57 -04:00
Tullio Sebastiani
194e3b87ee fixed test_pod_network_filter flaky test (#905)
syntax



syntax



fix



fix



fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-09-05 11:59:30 -04:00
Paige Patton
8c05e44c23 adding ssh install and virtctl version
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m59s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-04 13:57:34 -07:00
Paige Patton
88f8cf49f1 fixing kubevirt name not duplicate namespace
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-04 12:45:05 -07:00
Paige Patton
015ba4d90d adding privileged option (#901)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
2025-09-03 11:14:57 -04:00
Tullio Sebastiani
26fdbef144 [CNCF Incubation] RELEASE.md - release process description (#899)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m43s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* [CNCF Incubation] RELEASE.md - release process description

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

change

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added mantainers link

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added mantainers members and owners duties

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-09-02 16:18:30 +02:00
Paige Patton
d77e6dc79c adding maintainers definitions (#898)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-09-02 15:52:45 +02:00
Paige Patton
2885645e77 adding return pod status object not ints (#897)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m40s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-29 09:40:17 -04:00
Paige Patton
84169e2d4e adding no scenario type (#869)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 5m32s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-29 08:55:06 -04:00
Sahil Shah
05bc404d32 Adding IPMI tool to dockerfile
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m56s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Sahil Shah <sahshah@redhat.com>
2025-08-25 12:28:03 -04:00
Paige Patton
e8fd432fc5 adding enable metrics for prometheus coverage (#871)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m31s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-21 17:48:58 +02:00
Tullio Sebastiani
ec05675e3a enabling elastic on main test suite (#892)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-08-21 15:47:11 +02:00
Tullio Sebastiani
c91648d35c Fixing functional tests (#890)
* Fixes the service hijacking issue

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixes the rollback folder issue

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixes the test issue

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added config options to the main config

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-08-21 15:09:52 +02:00
LIU ZHE YOU
24aa9036b0 [Rollback Scenarios] Fix cleanup_rollback_version_files error (#889)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m57s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Replace ValueError with warning when directory count is not 1

* Add default config for rollback feature
2025-08-21 12:12:01 +02:00
LIU ZHE YOU
816363d151 [Rollback Scenarios] Perform rollback (#879)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m18s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback config

* Inject rollback handler to scenario plugin

* Add Serializer

* Add decorator

* Add test with SimpleRollbackScenarioPlugin

* Add logger for verbose debug flow

* Resolve review comment

- remove additional rollback config in config.yaml
- set KUBECONFIG to ~/.kube/config in test_rollback

* Simplify set_rollback_context_decorator

* Fix integration of rollback_handler in __load_plugins

* Refactor rollback.config module

  - make it singleton class with register method to construct
  - RollbackContext ( <timestamp>-<run_uuid> )
  - add get_rollback_versions_directory for moduling the directory
    format

* Adapt new rollback.config

* Refactor serialization

- respect rollback_callable_name
- refactor _parse_rollback_callable_code
- refine VERSION_FILE_TEMPLATE

* Add get_scenario_rollback_versions_directory in RollbackConfig

* Add rollback in ApplicationOutageScenarioPlugin

* Add RollbackCallable and RollbackContent for type annotation

* Refactor rollback_handler with limited arguments

* Refactor the serialization for rollback

- limited arguments: callback and rollback_content just these two!
- always constuct lib_openshift and lib_telemetry in version file
- add _parse_rollback_content_definition for retrieving scenaio specific
  rollback_content
- remove utils for formating variadic function

* Refactor applicaton outage scenario

* Fix test_rollback

* Make RollbackContent with static fields

* simplify serialization

  - Remove all unused format dynamic arguments utils
  - Add jinja template for version file
  - Replace set_context for serialization with passing version to serialize_callable

* Add rollback for hogs scenario

* Fix version file full path based on feedback

- {versions_directory}/<timestamp(ns)>-<run_uuid>/{scenario_type}-<timestamp(ns)>-<random_hash>.py

* Fix scenario plugins after rebase

* Add rollback config

* Inject rollback handler to scenario plugin

* Add test with SimpleRollbackScenarioPlugin

* Resolve review comment

- remove additional rollback config in config.yaml
- set KUBECONFIG to ~/.kube/config in test_rollback

* Fix integration of rollback_handler in __load_plugins

* Refactor rollback.config module

  - make it singleton class with register method to construct
  - RollbackContext ( <timestamp>-<run_uuid> )
  - add get_rollback_versions_directory for moduling the directory
    format

* Adapt new rollback.config

* Add rollback in ApplicationOutageScenarioPlugin

* Add RollbackCallable and RollbackContent for type annotation

* Refactor applicaton outage scenario

* Fix test_rollback

* Make RollbackContent with static fields

* simplify serialization

  - Remove all unused format dynamic arguments utils
  - Add jinja template for version file
  - Replace set_context for serialization with passing version to serialize_callable

* Add rollback for hogs scenario

* Fix version file full path based on feedback

- {versions_directory}/<timestamp(ns)>-<run_uuid>/{scenario_type}-<timestamp(ns)>-<random_hash>.py

* Fix scenario plugins after rebase

* Add execute rollback

* Add CLI for list and execute rollback

* Replace subprocess with importlib

* Fix error after rebase

* fixup! Fix docstring

- Add telemetry_ocp in execute_rollback docstring
- Remove rollback_config in create_plugin docstring
- Remove scenario_types in set_rollback_callable docsting

* fixup! Replace os.urandom with krkn_lib.utils.get_random_string

* fixup! Add missing telemetry_ocp for execute_rollback_version_files

* fixup! Remove redundant import

- Remove duplicate TYPE_CHECKING in handler module
- Remove cast in signal module
- Remove RollbackConfig in scenario_plugin_factory

* fixup! Replace sys.exit(1) with return

* fixup! Remove duplicate rollback_network_policy

* fixup! Decouple Serializer initialization

* fixup! Rename callback to rollback_callable

* fixup! Refine comment for constructing AbstractScenarioPlugin with
placeholder value

* fixup! Add version in docstring

* fixup! Remove uv.lock
2025-08-20 16:50:52 +02:00
Paige Patton
90c52f907f regex to tools pod names (#886)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m46s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-15 11:13:42 -04:00
Paige Patton
4f250c9601 adding affected nodes to affectednodestatus (#884)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m20s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-13 20:54:13 -04:00
Paige Patton
6480adc00a adding setting own image for network chaos (#883)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m5s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-13 17:49:47 -04:00
Paige Patton
5002f210ae removing dashboard installation
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-05 11:27:41 -04:00
Paige Patton
62c5afa9a2 updated done items in roadmap
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m52s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-08-01 13:23:23 -04:00
Paige Patton
c109fc0b17 adding elastic installation into krkn tests
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 6m36s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-31 10:41:31 -04:00
Tullio Sebastiani
fff675f3dd added service account to Network Chaos NG workload (#870)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m56s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-07-23 10:17:50 +02:00
Naga Ravi Chaitanya Elluri
c125e5acf7 Update network scenario image
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m34s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit updates fedora tools image reference used by the network scenarios
to the one hosted in the krkn-chaos quay org. This also fixes the issues with
RHACS flagging runs when using latest tag by using tools tag instead.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-07-22 14:29:00 -04:00
Naga Ravi Chaitanya Elluri
ca6995a1a1 [Snyk] Fix for 3 vulnerabilities (#859)
* fix: requirements.txt to reduce vulnerabilities


The following vulnerabilities are fixed by pinning transitive dependencies:
- https://snyk.io/vuln/SNYK-PYTHON-PROTOBUF-10364902
- https://snyk.io/vuln/SNYK-PYTHON-URLLIB3-10390193
- https://snyk.io/vuln/SNYK-PYTHON-URLLIB3-10390194

* partial vulnerability fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-07-22 16:50:31 +02:00
Sahil Shah
50cf91ac9e Disable SSL verification for IBM node scenarios and fix node reboot s… (#861)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Disable SSL verification for IBM node scenarios and fix node reboot scenario

Signed-off-by: Sahil Shah <sahshah@redhat.com>

* adding disable ssl as a scenario parameter for ibmcloud

Signed-off-by: Sahil Shah <sahshah@redhat.com>

---------

Signed-off-by: Sahil Shah <sahshah@redhat.com>
2025-07-16 12:48:45 -04:00
Tullio Sebastiani
11069c6982 added tolerations to node network filter pod deployment (#867) 2025-07-16 17:11:46 +02:00
Charles Uneze
106d9bf1ae A working kind config (#866)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 5m13s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Charles Uneze <charlesniklaus@gmail.com>
2025-07-15 10:25:01 -04:00
Abhinav Sharma
17f832637c feat: add optional node-name field to hog scenarios with precedence over node-selector (#831)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m31s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
2025-07-11 14:10:16 -04:00
Paige Patton
0e5c8c55a4 adding details of node for hog failure
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m23s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-10 16:49:28 -04:00
Tullio Sebastiani
9d9a6f9b80 added missing parameters to node-network-filter + added default values
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-07-10 13:22:50 -04:00
Anshuman Panda
f8fe2ae5b7 Refactor: to use krkn-lib for getting and remove invoke funct. usage node IP
Signed-off-by: Anshuman Panda <ichuk0078@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-10 13:21:10 -04:00
Paige Patton
77b1dd32c7 adding kubevirt with pod timing
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-10 13:19:37 -04:00
Anshuman Panda
9df727ccf5 Ensure metrics are always saved with improved local fallback
Signed-off-by: Anshuman Panda <ichuk0078@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-10 13:19:07 -04:00
Tullio Sebastiani
70c8fec705 added pod-network-filter funtest (#863)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m37s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* added pod-network-filter funtest

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated kind settings

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-07-10 09:35:59 +02:00
Abhinav Sharma
0731144a6b Add support for triggering kubevirt VM outages (#816)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m2s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* add requirement for kubevirt_vm_outage

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* add initial init and kubevirt_plugin files

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* add scenario in  kubevirt-vm-outage.yaml

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* implement init, get_scenario_types, run and placeholder for inject and recover functions

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* implement init client, execute_scenario, validate environment, inject and get_VMinstance fucntions

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* implement recover function

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* implement recover function

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* add test for kubevirt_vm_outage feature

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* improve KubeVirt recovery logic and update dependencies, for kubevirt

Signed-off-by: Paige Patton <prubenda@redhat.com>

* refactor(kubevirt): use KrknKubernetes client for KubeVirt operations

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* chore: Add auto-restart disable option and simplify code

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* chore: remove kubevirt external package used.

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>

* adding few changes and scenario in config file

Signed-off-by: Paige Patton <prubenda@redhat.com>

* removing docs

Signed-off-by: Paige Patton <prubenda@redhat.com>

* no affected pods

Signed-off-by: Paige Patton <prubenda@redhat.com>

---------

Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
Co-authored-by: Paige Patton <prubenda@redhat.com>
2025-07-08 14:04:57 -04:00
yogananth subramanian
9337052e7b Fix bm_node_scenarios.py
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m29s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Fix the logic in disk disruption scenario, which was returning the right set of disks to be off-lined.

Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
2025-07-07 13:49:33 -04:00
yogananth subramanian
dc8d7ad75b Add disk detach/attach scenario to baremetal node actions (#855)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
- Implemented methods for detaching and attaching disks to baremetal nodes.
- Added a new scenario `node_disk_detach_attach_scenario` to manage disk operations.
- Updated the YAML configuration to include the new scenario with disk details.

Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
2025-07-03 17:18:57 +02:00
Paige Patton
1cc44e1f18 adding non native verison of pod scenarios (#847)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-03 15:46:13 +02:00
Paige Patton
c8190fd1c1 adding pod test (#858)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-07-03 15:00:51 +02:00
Tullio Sebastiani
9078b35e46 updated krkn-lib
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-07-02 17:30:58 +02:00
Tullio Sebastiani
e6b1665aa1 added toleration to schedule pod on master
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-06-30 10:30:47 +02:00
Tullio Sebastiani
c56819365c minor nits fixes
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-06-27 15:12:45 +02:00
Tullio Sebastiani
6a657576cb api refactoring + pod network filter scenario
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-06-26 15:51:35 +02:00
Tullio Sebastiani
f04f1f1101 added workload image as scenario parameter (#854)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m58s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* added workload image as scenario parameter

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* renamed workload_image to image

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-06-25 17:08:59 +02:00
Naga Ravi Chaitanya Elluri
bddbd42f8c Expose kube_check parameter for baremetal node scenarios
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m7s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-06-16 11:43:32 -04:00
dependabot[bot]
630dbd805b Bump requests from 2.32.2 to 2.32.4
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m38s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-06-11 12:54:11 -04:00
Paige Patton
10d26ba50e adding kube check into gcp zone'
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-06-11 12:53:47 -04:00
Naga Ravi Chaitanya Elluri
d47286ae21 Expose parallel option in the baremetal node scenarios
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m14s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-06-09 09:48:04 -04:00
Paige Patton
890e3012dd updating krkn-lib req
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m50s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-06-06 10:55:52 -04:00
Yogananth Subramanian
d0dafa872d Fix: network scenario timing issue
Introduce a delay in network scenarios prior to imposing restrictions.
This ensures that chaos test case jobs are scheduled before any restrictions are put in place.

Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
2025-06-06 10:55:18 -04:00
Paige Patton
149eb8fcd3 adding kube_check as option into node scenarios
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-06-06 10:54:58 -04:00
Paige Patton
4c462a8971 updating health checks
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-06-06 10:54:39 -04:00
Priyansh Saxena
5bdbf622c3 These changes will:
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m18s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
1. Make the CI workflow fail when tests fail

2. Set a proper Git email for automated commits

3. Fix the Prometheus installation by setting the required `maximumStartupDurationSeconds` parameter

Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>

fix: run command twice

Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>

fix: update helm install command to properly include maximumStartupDurationSeconds=300 ensuring all arguments pass correctly

Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
2025-06-03 11:28:12 -04:00
ShAsHi
0dcb901da1 Update README.md
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m56s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
2025-05-28 07:43:14 -04:00
Paige Patton
6e94df9cfc removing all docs
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m55s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-26 13:30:03 -04:00
Paige Patton
87c2b3c8fd adding recovery times to metrics
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m26s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-22 13:49:30 -04:00
Abhinav Sharma
7e4b2aff65 Add RBAC configuration for priviledged and non priviledged users.
Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
2025-05-22 13:48:30 -04:00
10sharmashivam
27f0845182 fix: run all node scenarios instead of exiting after the first
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m45s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: 10sharmashivam <10sharmashivam@gmail.com>
2025-05-16 18:44:53 -04:00
Tullio Sebastiani
4c9cd5bced added release notes automatic workflow on tag push (#813)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m24s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-05-15 10:14:43 +02:00
Abhinav Sharma
075dbd10c7 Docs: Fix broken contribution link in README
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m0s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Abhinav Sharma <abhinavs1920bpl@gmail.com>
2025-05-13 09:32:37 -04:00
Tullio Sebastiani
e080ad2ee2 removes a bad character that makes the test fail (#807)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m33s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-05-13 11:39:12 +02:00
Emmanuel Ferdman
693520f306 Migrate to modern Python logger API (#806)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m35s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-12 22:21:18 -04:00
Naga Ravi Chaitanya Elluri
bf909a7c18 Add OpenSSF best practices badge
This helps with showcasing that krkn project is following the best practices

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-05-12 22:02:37 -04:00
Paige Patton
abbcfe09ec azure block node using network security group and setting it to subnet
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m6s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-08 10:38:05 -04:00
Paige Patton
32fb6eec07 enum of true/false variables
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m20s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-07 16:25:18 -04:00
Roshni Pattath
608b7c847f Red Hat added to Adopters 2025-05-07 14:07:32 -04:00
Paige Patton
edd0159251 adding health check global variables (#798)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m12s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-07 15:47:03 +02:00
Naga Ravi Chaitanya Elluri
cf9f7702ed fix: requirements.txt to reduce vulnerabilities (#795)
The following vulnerabilities are fixed by pinning transitive dependencies:
- https://snyk.io/vuln/SNYK-PYTHON-SETUPTOOLS-9964606

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2025-05-07 15:46:16 +02:00
Tullio Sebastiani
cfe624f153 changed get_node_ip to krkn-lib and removed kubectl dependency (#799)
* changed get_node_ip to krkn-lib and removed kubectl dependency

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated krkn-lib to 5.0.1

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-05-07 15:43:27 +02:00
Paige Patton
62f50db195 removing litmus sa (#797)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-05-07 15:41:49 +02:00
yogananth subramanian
aee838d3ac Fix: Add support for tains (#790) (#791)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
2025-05-06 12:51:59 -04:00
Tullio Sebastiani
3b4d8a13f9 network_chaos_ng_scenarios configuration fixes (#794)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-05-02 17:53:14 +02:00
Naga Ravi Chaitanya Elluri
a86bb6ab95 Refactor docs to point to krkn-chaos.dev
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 56s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-05-01 09:19:35 -04:00
Paige Patton
7f0110972b updating tuple type for health checks
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m58s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-04-28 08:24:14 -04:00
Paige Patton
126f4ebb35 logging getting into ingress shaping file
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 21s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-04-21 13:36:11 -04:00
Paige Patton
83d99bbb02 two types of zone outage
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-04-14 13:13:37 -04:00
Tullio Sebastiani
2624102d65 Node Network Filtering Scenario + Network Chaos NG modular architecture (#766)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
* network chaos NG modular architecture

error handling

* first working version (missing protocols, number of instances, wait duration)

* added instance_count + sleep + methods documentation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-04-10 16:47:29 +02:00
briankwyu2
02587bcbe6 Update ADOPTERS.md
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
2025-04-09 12:40:02 -04:00
Sahil Shah
c51bf04f9e Removing Krkn Documentation (#770)
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
2025-04-08 18:13:42 -04:00
Naga Ravi Chaitanya Elluri
41195b1a60 Add placeholder for capturing adopters
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
This will enable users and organizations to share their Krkn adoption
journey for their chaos engineering use cases.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-04-08 14:03:03 -04:00
Sahil Shah
ab80acbee7 Adding github-workflow to maintain documentation (#775)
* Adding githubworkflow to maintain documentation

* adding hyperlink
2025-04-08 06:43:47 -04:00
Gareth Healy
3573d13ea9 Fixed deadlink in README.md
Some checks are pending
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
Signed-off-by: Gareth Healy <garethahealy@gmail.com>
2025-04-07 14:12:38 -04:00
Tullio Sebastiani
9c5251d52f setuptools + golang stdlib (#781)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m54s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* setuptools + golang stdlib

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* equals

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-03-24 14:41:25 +01:00
Paige Patton
a0bba27edc triming down metrics
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m8s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-03-24 10:01:50 +00:00
Tullio Sebastiani
0d0143d1e0 added metrics-patch global krknctl flag
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m41s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

indent

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-03-21 14:29:24 +00:00
Naga Ravi Chaitanya Elluri
0004c05f81 Add security policy
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m15s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds a policy on how Krkn follows best practices and
addresses security vulnerabilities.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-03-20 17:40:23 +00:00
Tullio Sebastiani
57a747a34a fix funtests on main branch + removed golang vulnerabilities (#777)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 3m30s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* fix funtests on main branch + removed golang vulnerabilities

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* upgraded go to 1.23.0 + library updates

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-03-20 13:12:19 +01:00
kattameghana
22108ae4e7 fixed the health checks docs (#776)
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
2025-03-20 09:46:34 +00:00
Tullio Sebastiani
cecaa1eda3 removed deprecated ES fields + removed host validator (#774)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m6s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
DCO

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2025-03-19 13:10:44 -04:00
Paige Patton
5450ecb914 adding scenario type (#758)
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-03-19 17:38:45 +01:00
Paige Patton
cad6b68f43 adding collecting metrics (#752)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 1m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-03-19 17:08:44 +01:00
Paige Patton
0eba329305 moving ibm node to non native
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-03-19 15:02:12 +00:00
Tullio Sebastiani
ce8593f2f0 random network policy name to allow parallel scenario run on the same cluster
fix name
2025-03-19 14:28:35 +00:00
Paige Patton
9061ddbb5b adding cluster events into file
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m30s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-03-18 15:28:45 +00:00
kattameghana
dd4d0d0389 Health checks implementation for application endpoints (#761)
* Hog scenario porting from arcaflow to native (#748)

* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Hog scenario porting from arcaflow to native (#748)

* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* adding vsphere updates to non native

Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* adding node id to affected node

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Fixed the spelling mistake

Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* adding v4.0.8 version (#756)

Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Add autodetecting distribution (#753)

Used is_openshift function from krkn lib

Remove distribution from config

Remove distribution from documentation

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes include health check doc and exit_on_failure config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes include health check doc and exit_on_failure config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Added the health check config in functional test config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Modified the health checks documentation

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for debugging the functional test failing

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* changed the code for debugging in run_test.sh

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removed the functional test running line

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removing the health check config in common_test_config for debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Fixing functional test fialure

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removing the changes that are added for debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* few modifications

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Renamed timestamp

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changed the start timestamp and end timestamp data type to the datetime

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes include health check doc and exit_on_failure config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Hog scenario porting from arcaflow to native (#748)

* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* adding node id to affected node

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes include health check doc and exit_on_failure config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Added the health check config in functional test config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Modified the health checks documentation

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for debugging the functional test failing

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* changed the code for debugging in run_test.sh

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removed the functional test running line

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removing the health check config in common_test_config for debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Fixing functional test fialure

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removing the changes that are added for debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* few modifications

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Renamed timestamp

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Hog scenario porting from arcaflow to native (#748)

* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Hog scenario porting from arcaflow to native (#748)

* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* adding node id to affected node

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes include health check doc and exit_on_failure config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* initial version of health checks

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for appending success response and health check config format

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Update config.yaml

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Added the health check config in functional test config

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changes for debugging the functional test failing

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* changed the code for debugging in run_test.sh

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removed the functional test running line

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removing the health check config in common_test_config for debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Fixing functional test fialure

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Removing the changes that are added for debugging

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* few modifications

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Renamed timestamp

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* passing the health check response as HealthCheck object

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Updated the krkn-lib version in requirements.txt

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

* Changed the coverage

Signed-off-by: kattameghana <meghanakatta8@gmail.com>

---------

Signed-off-by: kattameghana <meghanakatta8@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Co-authored-by: Paige Patton <prubenda@redhat.com>
Co-authored-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
Co-authored-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
2025-03-18 12:08:30 +00:00
dependabot[bot]
0cabe5e91d Bump jinja2 from 3.1.5 to 3.1.6 (#768)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m45s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-03-06 22:25:05 -05:00
Naga Ravi Chaitanya Elluri
32fe0223ff Add recommendations around Pod Disruption Budgets
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m14s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds recommendation to test and ensure Pod Disruption
Budgets are set for critical applications to avoid downtime.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-03-06 07:56:02 -05:00
jtydlack
a25736ad08 Add autodetecting distribution (#753)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m12s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Used is_openshift function from krkn lib



Remove distribution from config



Remove distribution from documentation

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
2025-02-13 15:45:08 -05:00
Paige Patton
440890d252 adding v4.0.8 version (#756)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 3m50s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-02-05 13:46:58 -05:00
Meghana Katta
69bf20fc76 Fixed the spelling mistake
Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
2025-02-05 12:53:30 -05:00
Paige Patton
2a42a2dc31 adding node id to affected node
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
2025-02-03 19:30:52 -05:00
Paige Patton
21ab8d475d adding vsphere updates to non native
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m19s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-01-31 15:21:48 -05:00
Tullio Sebastiani
b024cfde19 Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread

Signed-off-by: Paige Patton <prubenda@redhat.com>
2025-01-31 13:45:59 -05:00
Tullio Sebastiani
c7e068a562 Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario

* removed arcaflow dependency + legacy hog scenarios

* config update

* changed hog configuration structure + added average samples

* fix on cpu count

* removes tripledes warning

* changed selector format

* changed selector syntax

* number of nodes option

* documentation

* functional tests

* exception handling on hog deployment thread
2025-01-31 17:01:26 +01:00
Tullio Sebastiani
64cfd2ca4d fixes krknctl describe bug
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 4m36s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
2025-01-20 09:43:59 -05:00
Naga Ravi Chaitanya Elluri
9cb701a616 Convert thresholds to float
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m22s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This is needed to avoid issues due to comparing two different data types:
TypeError: Invalid comparison between dtype=float64 and str. This commit also
avoids setting defaults for the thresholds to make it mandatory for the users
to define them as it plays a key role in determining the outliers.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2025-01-13 15:47:33 -05:00
dependabot[bot]
0372013b67 Bump jinja2 from 3.1.4 to 3.1.5 (#745)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 3m57s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.4...3.1.5)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-08 09:54:26 +01:00
Tullio Sebastiani
4fea1a354d added krknctl types to krkn baseimage for global variables (#741)
Some checks failed
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 7m55s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* added krknctl types to krkn baseimage for global variables

fixed

* fixed dockerfile

* dockerfile compile script

fix
2025-01-07 10:12:37 -05:00
Pablo Méndez Hernández
667798d588 Change API from 'Google API Client' to 'Google Cloud Python Client' (#723)
* Document how to use Google's credentials associated with a user acccount

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>

* Change API from 'Google API Client' to 'Google Cloud Python Client'

According to the 'Google API Client' GH page:

```
This library is considered complete and is in maintenance mode. This means
that we will address critical bugs and security issues but will not add any
new features.

This library is officially supported by Google. However, the maintainers of
this repository recommend using Cloud Client Libraries for Python, where
possible, for new code development.
```

So change the code accordingly to adapt it to 'Google Cloud Python Client'.

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>

---------

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
2024-12-12 22:34:45 -05:00
jtydlack
0c30d89a1b Add node_disk_detach_attach_scenario for aws under node scenarios
Resolves #678

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>

Add functions for aws detach disk scenario

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>

Add detach disk scenario in node scenario

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>

Add disk_deatch_attach_scenario in docs

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
2024-12-10 09:21:05 -05:00
Paige Patton
2ba20fa483 adding code bock 2024-12-05 12:37:43 -05:00
Paige Patton
97035a765c adding get node name list changes
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-26 10:34:25 -05:00
Paige Patton
10ba53574e not equal to gcp
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-15 09:31:09 -07:00
Paige Patton
0ecba41082 adding multi label comment 2024-11-12 10:34:09 -07:00
Paige Patton
491f59d152 few small changes
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-12 10:34:09 -07:00
Tullio Sebastiani
2549c9a146 bump werkzeug to 3.0.6 to fix cve on krkn-hub baseimage 2024-11-12 09:42:50 -07:00
Henrick Goldwurm
949f1f09e0 Add support for user-provided default network ACL (#731)
* Add support for user-provided default network ACL

Signed-off-by: henrick <self@thehenrick.com>

* Add logs to notify user when their provided acl is used

Signed-off-by: henrick <self@thehenrick.com>

* Update docs to include optional default_acl_id parameter in zone_outage

Signed-off-by: henrick <self@thehenrick.com>

---------

Signed-off-by: henrick <self@thehenrick.com>
Co-authored-by: henrick <self@thehenrick.com>
2024-11-06 12:58:25 -05:00
Naga Ravi Chaitanya Elluri
959766254d Update status of the relevant work items under roadmap
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-11-04 08:36:11 -05:00
Paige Patton
0e68dedb12 adding ibm shut down scenario (#697)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-01 15:16:07 -04:00
Tullio Sebastiani
34a676a795 block_size parameter for dd (#719)
removed log

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-28 11:45:33 -04:00
Naga Ravi Chaitanya Elluri
e5c5b35db3 Update kube-burner references to krkn
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-10-28 11:03:52 -04:00
Pablo Méndez Hernández
93d2e60386 Fix typo in docs index
Replace "oraganization" with "organization" in table of contents.

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
2024-10-24 15:10:55 -04:00
Naga Ravi Chaitanya Elluri
462c9ac67e Rename test suite name to chaos-krkn
This is needed for the TRT/component readiness integration to improve
dashboard readability and tie results back to chaos.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-10-21 14:38:37 -04:00
Tullio Sebastiani
04e44738d9 updated deprecated upload artfiact action (#717)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-11 17:03:24 +02:00
Tullio Sebastiani
f810cadad2 Fixes the Plugin scenario schema error (#718)
* reformatting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* schema refactoring

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* plugin refactoring

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-10 09:59:53 -04:00
Tullio Sebastiani
4b869bad83 added fallback on dd if fallocate is not in the $PATH (#716)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-10 11:15:03 +02:00
Matt Leader
a36b0c76b2 OCP Chaos Arcaflow Workflow (#699)
* add workflows

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* update readme

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* rm my kubeconfig path

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* add workflow details to readme

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* mv arcaflow to utils

Signed-off-by: Matthew F Leader <mleader@redhat.com>

---------

Signed-off-by: Matthew F Leader <mleader@redhat.com>
2024-10-09 14:46:08 -04:00
Tullio Sebastiani
a17e16390c cluster events check removed from funtest (deprecated krkn-lib v4.0.0)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-09 10:19:24 -04:00
Paige Patton
f8534d616c v4.0.3
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-10-08 23:30:28 -04:00
Paige Patton
9670ce82f5 adding container updates
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-10-08 14:31:29 -04:00
Paige Patton
95e4b68389 plural pod network
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-10-08 11:14:54 -04:00
Tullio Sebastiani
0aac6119b0 hotfix: krkn-lib update (#709)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-07 08:22:31 -04:00
Tullio Sebastiani
7e5bdfd5cf disabled elastic (#708)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-04 12:42:34 -04:00
Tullio Sebastiani
3c207ab2ea hotfix: krkn-lib update (#706)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-04 11:11:20 -04:00
Tullio Sebastiani
d91172d9b2 Core Refactoring, Krkn Scenario Plugin API (#694)
* relocated shared libraries from `kraken` to `krkn` folder

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* AbstractScenarioPlugin and ScenarioPluginFactory

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* application_outage porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* arcaflow_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* managedcluster_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* network_chaos porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* node_actions porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* plugin_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pvc_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* service_disruption porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* service_hijacking porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* cluster_shut_down_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* syn_flood porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* time_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* zone_outages porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* ScenarioPluginFactory tests

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* unit tests update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pod_scenarios and post actions deprecated

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

scenarios post_actions

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* funtests and config update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* run_krkn.py update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* utils porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* API Documentation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* container_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* funtest fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* document gif update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* Documentation + tests update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed example plugin

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* global renaming

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* config.yaml typos

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typos

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed `plugin_scenarios` from NativScenarioPlugin class

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pod_network_scenarios type added

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* documentation update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-03 20:48:04 +02:00
Tullio Sebastiani
a13fb43d94 krkn-lib updated v3.1.2
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-03 09:44:20 -04:00
Tullio Sebastiani
37ee7177bc krkn-lib update to support VirtualMachine count (#704)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-03 10:38:44 +02:00
Tullio Sebastiani
32142cc159 CVEs fix (#698)
* golang cves fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* arcaflow update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-09-20 08:33:41 -04:00
Paige Patton
34bfc0d3d9 Adding aws bare metal (#695)
* adding aws bare metal

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

* no found reservations

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

---------

Co-authored-by: Auto User <auto@users.noreply.github.com>
2024-09-18 13:55:58 -04:00
Tullio Sebastiani
736c90e937 Namespaced cluster events and logs integration (#690)
* namespaced events integration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* namespaced logs  implementation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

namespaced logs plugin scenario

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

namespaced logs integration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* logs collection fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib 3.1.0 update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-09-12 11:54:57 +02:00
Naga Ravi Chaitanya Elluri
5e7938ba4a Update default configuration pointer for the node scenarios (#693)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-09-09 22:10:25 -04:00
Paige Patton
b525f83261 restart kubelet (#688)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>
2024-09-09 21:57:53 -04:00
Paige Patton
26460a0dce Adding elastic set to none (#691)
* adding elastic set to none

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>

* too many ls

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

---------

Signed-off-by: Auto User <auto@users.noreply.github.com>
Co-authored-by: Auto User <auto@users.noreply.github.com>
2024-09-05 16:05:19 -04:00
dependabot[bot]
7968c2a776 Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-09-03 23:03:39 -04:00
Tullio Sebastiani
6186555c15 Elastic search krkn-lib integration (#658)
* Elastic search krkn-lib integration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

removed default urls

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* Fix alerts bug on prometheus

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* fixed prometheus object initialization bug

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated requirements to krkn-lib 2.1.8

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* disabled alerts and metrics by default

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* reverted requirement to elastic branch on krkn-lib

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* numpy downgrade

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* maximium retries added to hijacking funtest

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added elastic settings to funtest config

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib 3.0.0 update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-08-28 10:46:42 -04:00
Tullio Sebastiani
9cd086f59c Adds the startup option to produce prow junit XML output for sippy integration (#684)
* removed legacy kubernetes module

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added sippy junit XML file production options

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

krkn-lib update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-08-13 12:40:34 +02:00
Naga Ravi Chaitanya Elluri
1057917731 Add duration parameter for node scenarios
This option is enabled only for node_stop_start scenario where
user will want to stop the node for certain duration to understand
the impact before starting the node back on. This commit also bumps
the timeout for the scenario to 360 seconds from 120 seconds to make
sure there's enough time for the node to get to Ready state from the
Kubernetes side after the node is started on the infra side.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-08-12 13:40:18 -04:00
Naga Ravi Chaitanya Elluri
5484828b67 Deprecate running krkn as kubernetes app
This commit removes the instructions on running krkn as kubernetes
deployment as it is not supported/maintained and also not recommended.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-08-09 13:44:43 -04:00
Naga Ravi Chaitanya Elluri
d18b6332e5 Improve node-scenario docs
This commit adds sample configuration files for each of the supported
platforms.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-08-07 13:52:15 -04:00
Paige Patton
89a0e166f1 no multiprocess for gcp shutdown (#682)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>
2024-08-03 18:43:52 -04:00
Naga Ravi Chaitanya Elluri
624f50acd1 Output rate of increase for the SLO queries
This commit:
- Also switches the rate queries severity to critical as 5%
  threshold is high for low scale/density clusters and needs to be flagged.
- Adds rate queries to openshift alerts file
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-08-01 12:29:35 -04:00
Tullio Sebastiani
e02c6d1287 SYN flood scenario (#668)
* scenario config file

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* syn flood plugin

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* run_krkn.py updaated

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* requirements.txt + documentation + config.yaml

* set node selector defaults to worker

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-07-29 15:31:37 -04:00
jtydlack
04425a8d8a Add alerts to alert.yaml
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
2024-07-25 10:51:15 -04:00
Naga Ravi Chaitanya Elluri
f3933f0e62 fix: requirements.txt to reduce vulnerabilities (#673)
The following vulnerabilities are fixed by pinning transitive dependencies:
- https://snyk.io/vuln/SNYK-PYTHON-SETUPTOOLS-7448482

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
2024-07-22 10:12:14 -04:00
Naga Ravi Chaitanya Elluri
56ff0a8c72 Deprecate setting release version in the container source file
This commit also deprecates building container image for ppc64le as it
is not actively maintained. We will add support if users request for it
in the future.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-07-18 12:56:08 -04:00
Tullio Sebastiani
9378cd74cd krkn-lib update v2.1.6 to fix pod monitoring time calculations (#674)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-07-16 18:04:24 +02:00
Paige Patton
4d3491da0f adidng action token passing (#671)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-07-15 12:50:20 -04:00
Naga Ravi Chaitanya Elluri
d6ce66160b Remove podman-compose dependency
We are not using it in the krkn code base and removing it fixes one
of the license issues reported by FOSSA. This commit also removes
setting up dependencies using docker/podman compose as it not actively
maintained.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-07-10 17:25:33 -04:00
Paige Rubendall
ef1a55438b taking out need for az cli to be installed
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-07-05 15:18:06 -04:00
Tullio Sebastiani
d8f54b83a2 fixed image push issue
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-07-05 10:32:01 -04:00
Tullio Sebastiani
4870c86515 moves the krkn-hub build from push on main to tag (#660)
* moves the krkn-hub build from push on main to tag + final image enhancement

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixed syntax

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* quotes

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-07-05 16:09:34 +02:00
Naga Ravi Chaitanya Elluri
6ae17cf678 Update dockerfile to install azure-cli using dnf
Avoids architecture issues such as "bash: /usr/bin/az: cannot execute: required file not found"

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-07-03 18:35:45 -04:00
Tullio Sebastiani
ce9f8aa050 Dockerfile update v1.6.2 (#659)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-07-03 16:34:37 +02:00
Paige Patton
05148317c1 taking out one glcoud call (#657)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-07-03 16:14:19 +02:00
Tullio Sebastiani
5f836f294b Kill pod arca plugin update adaptation (#656)
* new kill-pod interface adaptation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* unit test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* requirements update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* fixed duplicate requirement

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added conditional dockerfile build

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

removed useless print

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-07-03 15:50:43 +02:00
snyk-bot
cfa1bb09a0 fix: requirements.txt to reduce vulnerabilities
The following vulnerabilities are fixed by pinning transitive dependencies:
- https://snyk.io/vuln/SNYK-PYTHON-REQUESTS-6928867
2024-06-24 10:23:37 -04:00
Naga Ravi Chaitanya Elluri
5ddfff5a85 Make krkn dir executable
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-06-20 14:32:20 -04:00
Tullio Sebastiani
7d18487228 Dockerfile update
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-12 14:36:38 -04:00
Naga Ravi Chaitanya Elluri
08de42c91a Bump arcaflow version to 0.17.2 (#648)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-06-12 20:29:32 +02:00
dependabot[bot]
dc7d5bb01b Bump azure-identity from 1.15.0 to 1.16.1
Bumps [azure-identity](https://github.com/Azure/azure-sdk-for-python) from 1.15.0 to 1.16.1.
- [Release notes](https://github.com/Azure/azure-sdk-for-python/releases)
- [Changelog](https://github.com/Azure/azure-sdk-for-python/blob/main/doc/esrp_release.md)
- [Commits](https://github.com/Azure/azure-sdk-for-python/compare/azure-identity_1.15.0...azure-identity_1.16.1)

---
updated-dependencies:
- dependency-name: azure-identity
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-06-12 09:17:14 -04:00
Tullio Sebastiani
ea3444d375 added dependencies removed from the hub
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

jsonschema

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-11 12:07:28 -04:00
Tullio Sebastiani
7b660a0878 Fixes system and oc vulnerabilities detected by trivy (#644)
* fixes system and oc vulnerabilities detected by trivy

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated base image to run as krkn user instead of root

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-10 14:26:03 -04:00
Tullio Sebastiani
5fe0655f22 libnghttp2 version update
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-06 08:21:08 -04:00
Tullio Sebastiani
5df343c183 dockerfile update
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-04 14:36:11 -04:00
Tullio Sebastiani
f364e9f283 Arcaflow upgrade to engine v0.17.1 (#639)
* krkn plugin refactoring to match new engine context path management

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* cpu-hog new syntax

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* memory-hog new syntax

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

removed s from duration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* io-hog new syntax

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

cpu-hog input

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* path management refactoring agreed with arca team

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

refactoring

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-04 14:13:33 -04:00
Tullio Sebastiani
86a7427606 Dockerfile refactoring to build oc together with krkn
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

added oc in /usr/local/bin as well

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixed dumb docker build copy

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-06-04 10:41:11 -04:00
Mudit Verma
31266fbc3e support for node limits 2024-05-31 11:22:30 -04:00
Tullio Sebastiani
57de3769e7 ubi 9 base image + quay.io vulnerability fixes
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-31 10:58:52 -04:00
Paige Rubendall
42fc8eea40 adding wait in pvc scenarios and serivce hijack
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-05-29 16:34:33 -04:00
dependabot[bot]
22d56e2cdc ---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-05-22 17:12:46 -04:00
Matt Leader
a259b68221 Updates for Arcaflow Plugin Stress-NG 0.6.0 (#625)
* change for cpu hog

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* change for io hog

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* change for memory hog

Signed-off-by: Matthew F Leader <mleader@redhat.com>

---------

Signed-off-by: Matthew F Leader <mleader@redhat.com>
2024-05-20 12:35:51 -04:00
Tullio Sebastiani
052f83e7d9 added reference to webservice source code in the documentation (#630) 2024-05-14 17:58:06 +02:00
Tullio Sebastiani
fb3bbe4e26 replaced log syntax to allow objects to be printed
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-14 11:13:44 -04:00
Naga Ravi Chaitanya Elluri
96ba9be4b8 Add instructions to copy the python package file to docker dir (#616)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-05-13 12:36:37 -04:00
Naga Ravi Chaitanya Elluri
58d5d1d8dc Have a config in the chaos_recommender dir (#615)
This will make it easy for the users to find, configure and run it.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-05-13 12:33:41 -04:00
Tullio Sebastiani
3fe22a0d8f fixing badgecommit fail when coverage doesn't change
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-13 12:30:59 -04:00
Tullio Sebastiani
21b89a32a7 fixing missing import for log_exception
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-13 11:58:13 -04:00
Tullio Sebastiani
dbe3ea9718 Dockerfiles update
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-13 10:56:58 -04:00
Tullio Sebastiani
a142f6e7a4 Service hijacking scenario (#617)
* WIP: service hijacking scenario

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* wip

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* error handling

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

adapted run_raken.py

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* restored config.yaml

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added funtest

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixed test

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix test

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixed funtest

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

funtest fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

minor nit

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

added explicit curl method

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

push

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

restored all funtests

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

added mime type test

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixed pipeline

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

commented unit

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

utf-8

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test restored

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix test pipeline

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* documentation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib 2.1.3

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added other funtests to main merge to collect coverage

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-13 10:04:06 +02:00
Tullio Sebastiani
2610a7af67 added coverage badge and build badge to krkn
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

nit

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

permission

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

if main

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-10 09:57:10 -04:00
dependabot[bot]
f827f65132 Bump werkzeug from 2.3.8 to 3.0.3 in /utils/chaos_ai/docker (#619)
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.3.8 to 3.0.3.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/2.3.8...3.0.3)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-05-06 16:09:10 -04:00
dependabot[bot]
aa6cbbc11a Bump werkzeug from 3.0.1 to 3.0.3
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.1 to 3.0.3.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/3.0.1...3.0.3)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-05-06 16:04:27 -04:00
dependabot[bot]
e17354e54d Bump jinja2 from 3.1.3 to 3.1.4
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.3...3.1.4)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-05-06 15:44:52 -04:00
Tullio Sebastiani
2dfa5cb0cd fixes missing data in telemetry.json
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-05-06 14:16:09 -04:00
dependabot[bot]
0799008cd5 Bump flask from 2.1.0 to 2.2.5 in /utils/chaos_ai/docker (#611)
Bumps [flask](https://github.com/pallets/flask) from 2.1.0 to 2.2.5.
- [Release notes](https://github.com/pallets/flask/releases)
- [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/flask/compare/2.1.0...2.2.5)

---
updated-dependencies:
- dependency-name: flask
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-04-25 09:11:50 -04:00
Tullio Sebastiani
2327531e46 Dockerfiles update (#614)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-04-24 11:40:58 -04:00
dependabot[bot]
2c14c48a63 Bump werkzeug from 2.2.2 to 2.3.8 in /utils/chaos_ai/docker (#610)
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.2.2 to 2.3.8.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/2.2.2...2.3.8)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-23 15:26:51 +02:00
Tullio Sebastiani
ab98e416a6 Integration of the new pod recovery monitoring strategy implemented in krkn-lib (#609)
* pod monitoring integration in plugin scenario

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pod monitoring integration in container scenario

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed wait-for-pod step from plugin scenario config files

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* introduced global pod recovery time

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

nit

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* introduced krkn_pod_recovery_time in plugin scenario and removed all the references to wait-for-pods

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* functional test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* main branch functional test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* increased recovery times

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-04-23 10:49:01 +02:00
Sandeep Hans
19ad2d1a3d initial version of Chaos AI (#606)
* init push

Signed-off-by: Sandeep Hans <shans001@in.ibm.com>

* remove litmus + updated readme

Signed-off-by: Sandeep Hans <shans001@in.ibm.com>

* remove redundant files

Signed-off-by: Sandeep Hans <shans001@in.ibm.com>

* removed generated file+unused reference

---------

Signed-off-by: Sandeep Hans <shans001@in.ibm.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-04-16 10:41:31 -04:00
jtydlcak
804d7cbf58 Accept list of namespaces in chaos recommender
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
2024-04-09 23:32:17 -04:00
Paige Rubendall
54af2fc6ff adding v1.5.12 tag
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-03-29 18:45:52 -04:00
Paige Rubendall
b79e526cfd adding app outage not creating file (#605)
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-03-29 14:35:14 -04:00
Naga Ravi Chaitanya Elluri
a5efd7d06c Bump release version to v1.5.11
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-03-22 15:24:04 -04:00
yogananth
a1b81bd382 Fix: Reslove ingress network chaos plugin issue
Added network_chaos to plugin step and job wait time to be based on the test duration and set the default wait_time to 30s

Signed-off-by: yogananth subramanian <ysubrama@redhat.com>
2024-03-22 14:48:17 -04:00
Naga Ravi Chaitanya Elluri
782440c8c4 Copy oc and kubectl clients to additional paths
This will make sure oc and kubectl clients are accessible for users
with both /usr/bin and /usr/local/bin paths set on the host.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-03-21 11:29:50 -04:00
Naga Ravi Chaitanya Elluri
7e2755cbb7 Remove container status badge
Quay is no longer exposing it correctly: https://quay.io/repository/krkn-chaos/krkn/status

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-03-19 15:33:25 -04:00
Naga Ravi Chaitanya Elluri
2babb53d6e Bump cryptography version
This is need to fix the security vulnerability: https://nvd.nist.gov/vuln/detail/CVE-2024-26130.
Note: Reported by FOSSA.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-03-19 14:44:47 -04:00
Tullio Sebastiani
85f76e9193 do not consider exit code 2 as an error in funtests
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-17 23:07:46 -04:00
Liangquan Li
8bf21392f1 fix doc's nit
Signed-off-by: Liangquan Li <liangli@redhat.com>
2024-03-13 15:21:57 -04:00
Tullio Sebastiani
606fb60811 changed exit codes on post chaos alerts and post_scenario failure (#592)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-07 16:31:55 +01:00
Tullio Sebastiani
fac7c3c6fb lowered arcaflow log level to error (#591)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-07 15:32:53 +01:00
Paige Rubendall
8dd9b30030 updating tag (#589)
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-03-06 13:11:44 -05:00
Naga Ravi Chaitanya Elluri
2d99f17aaf fix: requirements.txt to reduce vulnerabilities (#587)
The following vulnerabilities are fixed by pinning transitive dependencies:
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3172287
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3314966
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3315324
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3315328
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3315331
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3315452
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3315972
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3315975
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3316038
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-3316211
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-5663682
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-5777683
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-5813745
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-5813746
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-5813750
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-5914629
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-6036192
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-6050294
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-6092044
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-6126975
- https://snyk.io/vuln/SNYK-PYTHON-CRYPTOGRAPHY-6210214
- https://snyk.io/vuln/SNYK-PYTHON-SETUPTOOLS-3180412
- https://snyk.io/vuln/SNYK-PYTHON-WHEEL-3180413

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
2024-03-06 12:54:30 -05:00
Tullio Sebastiani
50742a793c updated krkn-lib to 2.1.0 (#588)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-06 11:30:01 -05:00
Naga Ravi Chaitanya Elluri
ba6a844544 Add /usr/local/bin to the path for krkn images
This is needed to ensure oc and kubectl binaries under /usr/local/bin
are accessible.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-03-04 16:03:40 -05:00
Tullio Sebastiani
7e7a917dba dockerfiles update (#585)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-04 15:59:53 +01:00
Tullio Sebastiani
b9c0bb39c7 checking post run alerts properties presence (#584)
added metric check

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-01 18:30:54 +01:00
Tullio Sebastiani
706a886151 checking alert properties presence (#583)
typo fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-01 17:58:21 +01:00
Tullio Sebastiani
a1cf9e2c00 fixed typo on funtests (#582)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-01 17:09:19 +01:00
Tullio Sebastiani
0f5dfcb823 fixed the telemetry funtest according to the new telemetry API
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-03-01 09:48:56 -05:00
Tullio Sebastiani
1e1015e6e7 added new WS configuration to funtests
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-29 11:35:00 -05:00
Tullio Sebastiani
c71ce31779 integrated new telemetry library for WS 2.0
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

updated krkn-lib version

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-28 22:58:54 -05:00
Tullio Sebastiani
1298f220a6 Critical alerts collection and upload (#577)
* added prometheus client method for critical alerts

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* adapted run_kraken to the new plugin method for critical_alerts collection + telemetry upload

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* requirements.txt pointing temporarly to git

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* fixed severity level

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added functional tests

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* exit on post chaos critical alerts

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

log moved

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed noisy log

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixed log

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated requirements.txt to krkn-lib 1.4.13

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn lib

* added check on variable that makes kraken return 1 whether post critical alerts are > 0

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-28 09:48:29 -05:00
jtydlcak
24059fb731 Add json output file option for recommender (#511)
Output in terminal changed to use json structure.

The json output file names are in format
recommender_namespace_YYYY-MM-DD_HH-MM-SS.

The path to the json file can be specified. Default path is in
kraken/utils/chaos_recommender/recommender_output.

Signed-off-by: jtydlcak <139967002+jtydlack@users.noreply.github.com>
2024-02-27 11:09:00 -05:00
Naga Ravi Chaitanya Elluri
ab951adb78 Expose thresholds config options (#574)
This commit allows users to edit the thresholds in the chaos-recommender
config to be able to identify outliers based on their use case.

Fixes https://github.com/krkn-chaos/krkn/issues/509
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-02-26 09:43:34 -05:00
Paige Rubendall
a9a7fb7e51 updating release version in dockerfiles (#578)
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-02-21 10:17:02 -05:00
Naga Ravi Chaitanya Elluri
5a8d5b0fe1 Allow critical alerts check when enable_alerts is disabled
This covers use case where user wants to just check for critical alerts
post chaos without having to enable the alerts evaluation feature which
evaluates prom queries specified in an alerts file.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-02-19 23:15:47 -05:00
Paige Rubendall
c440dc4b51 Taking out start and end time for critical alerts (#572)
* taking out start and end time"

Signed-off-by: Paige Rubendall <prubenda@redhat.com>

* adding only break when alert fires

Signed-off-by: Paige Rubendall <prubenda@redhat.com>

* fail at end if alert had fired

Signed-off-by: Paige Rubendall <prubenda@redhat.com>

* adding new krkn-lib function with no range

Signed-off-by: Paige Rubendall <prubenda@redhat.com>

* updating requirements to new krkn-lib

Signed-off-by: Paige Rubendall <prubenda@redhat.com>

---------

Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-02-19 09:28:13 -05:00
Paige Rubendall
b174c51ee0 adding check if connection was properly set
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-02-15 17:28:20 -05:00
Paige Rubendall
fec0434ce1 adding upload to elastic search
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-02-13 12:01:40 -05:00
Tullio Sebastiani
1067d5ec8d changed telemetry endpoint for funtests (#571)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-13 17:06:20 +01:00
Tullio Sebastiani
85ea1ef7e1 Dockerfiles update (#570)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-09 17:20:06 +01:00
Tullio Sebastiani
2e38b8b033 Kubernetes prometheus telemetry + functional tests (#566)
added comment on the node selector input.yaml

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-09 16:38:12 +01:00
Tullio Sebastiani
c7ea366756 frozen package versions (#569)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-02-09 16:10:25 +01:00
Paige Rubendall
67d4ee9fa2 updating comment to match query (#568)
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-02-08 22:09:37 -05:00
Paige Rubendall
fa59834bae updating release versin (#565)
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-01-25 11:12:00 -05:00
Paige Rubendall
f154bcb692 adding krkn report location
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
2024-01-25 10:45:01 -05:00
Naga Ravi Chaitanya Elluri
60ece4b1b8 Use 0.38.0 wheel version to fix security vulnerability
Reported by https://snyk.io/

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-25 09:51:19 -05:00
Naga Ravi Chaitanya Elluri
d660542a40 Add CNCF trademark guidelines and update community members (#560)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-24 14:13:53 -05:00
Naga Ravi Chaitanya Elluri
2e651798fa Update redhat-chaos references with krkn-chaos
The tools are now hosted under https://github.com/krkn-chaos

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-24 13:40:39 -05:00
Tullio Sebastiani
f801dfce54 functional tests pointing to real scenario config files
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

app_outage fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-18 12:54:39 -05:00
Tullio Sebastiani
8b95458444 Dockerfile v1.5.5 (#558)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-17 17:06:51 +01:00
Naga Ravi Chaitanya Elluri
ce1ae78f1f Update new references in the docs
This commit also updates the support matrix docs for the time scenarios.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-17 10:47:49 -05:00
Tullio Sebastiani
967753489b arcaflow hog scenarios + app outage functional tests
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-17 10:40:33 -05:00
Tullio Sebastiani
aa16cb1bf2 fixed io-hog scenario (#555)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-17 16:05:35 +01:00
Tullio Sebastiani
ac47e215d8 Functional Tests porting to kubernetes (#553)
* Functional Tests porting to kubernetes

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-17 09:48:43 +01:00
Tullio Sebastiani
4f7c58106d Dockerfile v1.5.4 (#552)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-15 19:22:52 +01:00
Tullio Sebastiani
a7e5ae6c80 Replaced oc debug command execution on node with a native version (#547)
* native time skew feature

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* fixed podname conflict issue

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated krkn-lib to v1.4.6

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* fixed pod conflict issue

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-15 12:15:38 -05:00
Tullio Sebastiani
aa030a21d3 Fixes the critical alerts exception with the start_time > end_time
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-15 11:11:45 -05:00
Paige Rubendall
631f12bdff Adding push to both red hat and krkn chaos quay (#550)
* adding push to both red hat and krkn chaos quay

* tag redhat chaos from krkn-chaos image

* login to both quays
2024-01-12 13:58:50 -05:00
Naga Ravi Chaitanya Elluri
2525982c55 Rename repo name and update workflow
This commit also removes OpenShift references and updates source
in the dockerfile.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-12 13:21:37 -05:00
dependabot[bot]
9760d7d97d Bump jinja2 from 3.0.3 to 3.1.3
Bumps [jinja2](https://github.com/pallets/jinja) from 3.0.3 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.0.3...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-11 15:40:09 -05:00
Naga Ravi Chaitanya Elluri
720488c159 Add new blogs to the useful resources list (#546)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-10 15:45:36 -05:00
Naga Ravi Chaitanya Elluri
487a9f464c Deprecate long term metrics collection
This will be added back soon via native prometheus integration.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-10 15:08:58 -05:00
Tullio Sebastiani
d9e137e85a fixes prometheus url check on Kubernetes
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-10 11:23:02 -05:00
Tullio Sebastiani
d6c8054275 changed docker files (#543)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-10 12:22:42 +01:00
Paige Rubendall
462f93ad87 updating scenarios to have deployers (#537)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-10 12:06:15 +01:00
Mark McLoughlin
c200f0774f Fix some links in README.md (#542)
* Fix github.io link in README.md

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* Fix krknChaos-hub link in README.md

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* Fix kube-burner link in README.md

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

---------

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2024-01-09 11:49:52 -05:00
Tullio Sebastiani
f2d7f88cb8 Krkn lib prometheus client + kube_burner references removed
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-09 10:43:32 -05:00
Naga Ravi Chaitanya Elluri
93f1f19411 Focus on Kubernetes in the chaos testing guide
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-08 20:09:12 -05:00
Naga Ravi Chaitanya Elluri
83c6058816 Use CNCF code of conduct
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-03 10:53:47 -05:00
Naga Ravi Chaitanya Elluri
ee34d08f41 Rename Krkn to KrknChaos (#536)
This change will help reflect the use case of the tool more evidently.
2023-12-18 16:40:34 -08:00
Tullio Sebastiani
41f9573563 Fixes cluster shutdown issue with single entry in scenario config (#535)
* fixed cluster shutdown issue

* fixed config file list parsing
2023-12-15 14:22:25 -05:00
Tullio Sebastiani
c00328cc2b v1.4.5 (#534) 2023-12-15 11:00:41 +01:00
Tullio Sebastiani
c2431d548f functional tests adapted to newer version of crc-cloud + OCP 4.14.1 (#532) 2023-12-11 12:48:42 -05:00
Paige Rubendall
b03511850b taking out more litmus references 2023-12-03 13:10:52 +05:30
Sahil Shah
82db2fca75 Removing Litmus Scenario 2023-11-16 09:50:04 -05:00
Naga Ravi Chaitanya Elluri
afe8d817a9 Print telemetry data location to stdout
This commit also deprecates litmus integration.
2023-11-13 10:01:17 -05:00
Tullio Sebastiani
dbf02a6c22 updated krkn-lib to fix log filtering in prow (#527) 2023-11-09 17:47:00 +01:00
Naga Ravi Chaitanya Elluri
94bec8dc9b Add missing import to get values from yaml (#526)
* Add missing import to get values from yaml

* Update Dockerfile

* Update Dockerfile-ppc64le

---------

Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
2023-11-07 11:07:17 +01:00
yogananth-subramanian
2111bab9a4 Pod ingress network shaping Chaos scenario
The scenario introduces network latency, packet loss, and bandwidth restriction in the Pod's network interface. The purpose of this scenario is to observe faults caused by random variations in the network.

Below example config applies ingress traffic shaping to openshift console.
````
- id: pod_ingress_shaping
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied.
    label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
    network_params:
        latency: 500ms             # Add 500ms latency to ingress traffic from the pod.
````
2023-11-06 23:34:17 -05:00
Kamesh Akella
b734f1dd05 Updating the chaos recommender README to point to accurate python version 2023-11-03 11:23:43 -04:00
Tullio Sebastiani
7a966a71d0 krkn integration of telemetry events collection (#523)
* function package refactoring in krkn-lib

* cluster events collection flag

* krkn-lib version bump

requirements

* dockerfile bump
2023-10-31 14:31:33 -04:00
Naga Ravi Chaitanya Elluri
43d891afd3 Bump telemetry archive default size to 500MB
This commit also removes litmus configs as they are not maintained.
2023-10-30 12:50:04 -04:00
Tullio Sebastiani
27fabfd4af OCP/K8S functionalities and packages splitting in krkn-lib (#507)
* krkn-lib ocp/k8s split adaptation

* library reference updated

* requirements update

* rebase with main + fix
2023-10-30 17:31:48 +01:00
Tullio Sebastiani
724068a978 Chaos recommender refactoring (#516)
* basic structure working

* config and options refactoring

nits and changes

* removed unused function with typo + fixed duration

* removed unused arguments

* minor fixes
2023-10-30 15:51:09 +01:00
Tullio Sebastiani
c9778474f1 arcaflow version bump (#520)
arcaflow version bump

stressng version typo
2023-10-27 18:09:46 +02:00
dependabot[bot]
6efdb2eb84 Bump werkzeug from 2.2.3 to 3.0.1
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.2.3 to 3.0.1.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/2.2.3...3.0.1)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-26 11:09:29 -04:00
Naga Ravi Chaitanya Elluri
0e852da7d4 Deprecate kubernetes method of deploying Krkn
This will ensure users will use the recommended methods ( standlone or containerized )
of installing and running Krkn.
2023-10-25 12:32:46 -04:00
jtydlack
86d1fda325 Fix container scenario to accept only signal number (#350) (#485) 2023-10-24 16:51:48 -04:00
Naga Ravi Chaitanya Elluri
fc6344176b Add pointer to the CNCF sandbox discussion (#517)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2023-10-24 16:07:40 -04:00
jtydlack
ff469579e9 Use function get_yaml_item_value
Enables using default even though the value was loaded as None.
2023-10-24 14:55:49 -04:00
Naga Ravi Chaitanya Elluri
8cbd1c5e7f Add docs for installing chaos-recommender dependencies
This commit also updates roadmap around chaos-recommender.
2023-10-18 08:56:33 -04:00
Mudit Verma
5953e53b46 chaos recommendation entry in README (#510) 2023-10-16 11:26:32 -04:00
Mudit Verma
23f1fc044b Chaos Recommendation Utility (#508)
* application profiling based chaos recommendation

* deleted unused dir

* Update requirements.txt

Signed-off-by: Mudit Verma <mudiverm@in.ibm.com>

* Update config.ini

Signed-off-by: Mudit Verma <mudiverm@in.ibm.com>

* Update Makefile

Signed-off-by: Mudit Verma <mudiverm@in.ibm.com>

* Update Dockerfile

Signed-off-by: Mudit Verma <mudiverm@in.ibm.com>

* Update README.md

Signed-off-by: Mudit Verma <mudiverm@in.ibm.com>

---------

Signed-off-by: Mudit Verma <mudiverm@in.ibm.com>
2023-10-16 10:06:02 -04:00
Naga Ravi Chaitanya Elluri
69e386db53 Update roadmap with upcoming integrations and enhancements 2023-10-11 09:24:34 -04:00
392 changed files with 31652 additions and 14673 deletions

4
.coveragerc Normal file
View File

@@ -0,0 +1,4 @@
[run]
omit =
tests/*
krkn/tests/**

1
.github/CODEOWNERS vendored Normal file
View File

@@ -0,0 +1 @@
* @paigerube14 @tsebastiani @chaitanyaenr

43
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@@ -0,0 +1,43 @@
---
name: Bug report
about: Create a report an issue
title: "[BUG]"
labels: bug
---
# Bug Description
## **Describe the bug**
A clear and concise description of what the bug is.
## **To Reproduce**
Any specific steps used to reproduce the behavior
### Scenario File
Scenario file(s) that were specified in your config file (can be starred (*) with confidential information )
```yaml
<config>
```
### Config File
Config file you used when error was seen (the default used is config/config.yaml)
```yaml
<config>
```
## **Expected behavior**
A clear and concise description of what you expected to happen.
## **Krkn Output**
Krkn output to help show your problem
## **Additional context**
Add any other context about the problem

16
.github/ISSUE_TEMPLATE/feature.md vendored Normal file
View File

@@ -0,0 +1,16 @@
---
name: New Feature Request
about: Suggest an idea for this project
title: ''
labels: enhancement
assignees: ''
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to see added/changed. Ex. new parameter in [xxx] scenario, new scenario that does [xxx]
**Additional context**
Add any other context about the feature request here.

47
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,47 @@
# Type of change
- [ ] Refactor
- [ ] New feature
- [ ] Bug fix
- [ ] Optimization
# Description
<-- Provide a brief description of the changes made in this PR. -->
## Related Tickets & Documents
If no related issue, please create one and start the converasation on wants of
- Related Issue #:
- Closes #:
# Documentation
- [ ] **Is documentation needed for this update?**
If checked, a documentation PR must be created and merged in the [website repository](https://github.com/krkn-chaos/website/).
## Related Documentation PR (if applicable)
<-- Add the link to the corresponding documentation PR in the website repository -->
# Checklist before requesting a review
[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See [contributing guidelines](https://krkn-chaos.dev/docs/contribution-guidelines/)
See [testing your changes](https://krkn-chaos.dev/docs/developers-guide/testing-changes/) and run on any Kubernetes or OpenShift cluster to validate your changes
- [ ] I have performed a self-review of my code by running krkn and specific scenario
- [ ] If it is a core feature, I have added thorough unit tests with above 80% coverage
*REQUIRED*:
Description of combination of tests performed and output of run
```bash
python run_kraken.py
...
<---insert test results output--->
```
OR
```bash
python -m coverage run -a -m unittest discover -s tests -v
...
<---insert test results output--->
```

7
.github/release-template.md vendored Normal file
View File

@@ -0,0 +1,7 @@
## Release {VERSION}
### Download Artifacts
- 📦 Krkn sources (noarch): [krkn-{VERSION}-src.tar.gz](https://krkn-chaos.gateway.scarf.sh/krkn-src-{VERSION}.tar.gz)
### Changes
{CHANGES}

View File

@@ -1,51 +0,0 @@
name: Build Krkn
on:
pull_request:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v3
- name: Create multi-node KinD cluster
uses: redhat-chaos/actions/kind@main
- name: Install Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
architecture: 'x64'
- name: Install environment
run: |
sudo apt-get install build-essential python3-dev
pip install --upgrade pip
pip install -r requirements.txt
- name: Run unit tests
run: python -m coverage run -a -m unittest discover -s tests -v
- name: Run CI
run: |
./CI/run.sh
cat ./CI/results.markdown >> $GITHUB_STEP_SUMMARY
echo >> $GITHUB_STEP_SUMMARY
- name: Upload CI logs
uses: actions/upload-artifact@v3
with:
name: ci-logs
path: CI/out
if-no-files-found: error
- name: Collect coverage report
run: |
python -m coverage html
- name: Publish coverage report to job summary
run: |
pip install html2text
html2text --ignore-images --ignore-links -b 0 htmlcov/index.html >> $GITHUB_STEP_SUMMARY
- name: Upload coverage data
uses: actions/upload-artifact@v3
with:
name: coverage
path: htmlcov
if-no-files-found: error
- name: Check CI results
run: grep Fail CI/results.markdown && false || true

View File

@@ -1,8 +1,7 @@
name: Docker Image CI
on:
push:
branches:
- main
tags: ['v[0-9].[0-9]+.[0-9]+']
pull_request:
jobs:
@@ -12,19 +11,45 @@ jobs:
- name: Check out code
uses: actions/checkout@v3
- name: Build the Docker images
run: docker build --no-cache -t quay.io/redhat-chaos/krkn containers/
if: startsWith(github.ref, 'refs/tags')
run: |
./containers/compile_dockerfile.sh
docker build --no-cache -t quay.io/krkn-chaos/krkn containers/ --build-arg TAG=${GITHUB_REF#refs/tags/}
docker tag quay.io/krkn-chaos/krkn quay.io/redhat-chaos/krkn
docker tag quay.io/krkn-chaos/krkn quay.io/krkn-chaos/krkn:${GITHUB_REF#refs/tags/}
docker tag quay.io/krkn-chaos/krkn quay.io/redhat-chaos/krkn:${GITHUB_REF#refs/tags/}
- name: Test Build the Docker images
if: ${{ github.event_name == 'pull_request' }}
run: |
./containers/compile_dockerfile.sh
docker build --no-cache -t quay.io/krkn-chaos/krkn containers/ --build-arg PR_NUMBER=${{ github.event.pull_request.number }}
- name: Login in quay
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
if: startsWith(github.ref, 'refs/tags')
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
env:
QUAY_USER: ${{ secrets.QUAY_USERNAME }}
QUAY_TOKEN: ${{ secrets.QUAY_PASSWORD }}
- name: Push the KrknChaos Docker images
if: startsWith(github.ref, 'refs/tags')
run: |
docker push quay.io/krkn-chaos/krkn
docker push quay.io/krkn-chaos/krkn:${GITHUB_REF#refs/tags/}
- name: Login in to redhat-chaos quay
if: startsWith(github.ref, 'refs/tags/v')
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
env:
QUAY_USER: ${{ secrets.QUAY_USER_1 }}
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}
- name: Push the Docker images
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: docker push quay.io/redhat-chaos/krkn
- name: Push the RedHat Chaos Docker images
if: startsWith(github.ref, 'refs/tags')
run: |
docker push quay.io/redhat-chaos/krkn
docker push quay.io/redhat-chaos/krkn:${GITHUB_REF#refs/tags/}
- name: Rebuild krkn-hub
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
if: startsWith(github.ref, 'refs/tags')
uses: redhat-chaos/actions/krkn-hub@main
with:
QUAY_USER: ${{ secrets.QUAY_USER_1 }}
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}
QUAY_USER: ${{ secrets.QUAY_USERNAME }}
QUAY_TOKEN: ${{ secrets.QUAY_PASSWORD }}
AUTOPUSH: ${{ secrets.AUTOPUSH }}

View File

@@ -1,111 +0,0 @@
on: issue_comment
jobs:
check_user:
# This job only runs for pull request comments
name: Check User Authorization
env:
USERS: ${{vars.USERS}}
if: contains(github.event.comment.body, '/funtest') && contains(github.event.comment.html_url, '/pull/')
runs-on: ubuntu-latest
steps:
- name: Check User
run: |
for name in `echo $USERS`
do
name="${name//$'\r'/}"
name="${name//$'\n'/}"
if [ $name == "${{github.event.sender.login}}" ]
then
echo "user ${{github.event.sender.login}} authorized, action started..."
exit 0
fi
done
echo "user ${{github.event.sender.login}} is not allowed to run functional tests Action"
exit 1
pr_commented:
# This job only runs for pull request comments containing /functional
name: Functional Tests
if: contains(github.event.comment.body, '/funtest') && contains(github.event.comment.html_url, '/pull/')
runs-on: ubuntu-latest
needs:
- check_user
steps:
- name: Check out Kraken
uses: actions/checkout@v3
- name: Checkout Pull Request
run: gh pr checkout ${{ github.event.issue.number }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Install OC CLI
uses: redhat-actions/oc-installer@v1
with:
oc_version: latest
- name: Install python 3.9
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Setup kraken dependencies
run: pip install -r requirements.txt
- name: Create Workdir & export the path
run: |
mkdir workdir
echo "WORKDIR_PATH=`pwd`/workdir" >> $GITHUB_ENV
- name: Teardown CRC (Post Action)
uses: webiny/action-post-run@3.0.0
id: post-run-command
with:
# currently using image coming from tsebastiani quay.io repo
# waiting that a fix is merged in the upstream one
# post action run cannot (apparently) be properly indented
run: docker run -v "${{ env.WORKDIR_PATH }}:/workdir" -e WORKING_MODE=T -e AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} -e AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} -e AWS_DEFAULT_REGION=us-west-2 -e TEARDOWN_RUN_ID=crc quay.io/tsebastiani/crc-cloud
- name: Run CRC
# currently using image coming from tsebastiani quay.io repo
# waiting that a fix is merged in the upstream one
run: |
docker run -v "${{ env.WORKDIR_PATH }}:/workdir" \
-e WORKING_MODE=C \
-e PULL_SECRET="${{ secrets.PULL_SECRET }}" \
-e AWS_ACCESS_KEY_ID="${{ secrets.AWS_ACCESS_KEY_ID }}" \
-e AWS_SECRET_ACCESS_KEY="${{ secrets.AWS_SECRET_ACCESS_KEY }}" \
-e AWS_DEFAULT_REGION=us-west-2 \
-e CREATE_RUN_ID=crc \
-e PASS_KUBEADMIN="${{ secrets.KUBEADMIN_PWD }}" \
-e PASS_REDHAT="${{ secrets.REDHAT_PWD }}" \
-e PASS_DEVELOPER="${{ secrets.DEVELOPER_PWD }}" \
quay.io/tsebastiani/crc-cloud
- name: OpenShift login and example deployment, GitHub Action env init
env:
NAMESPACE: test-namespace
DEPLOYMENT_NAME: test-nginx
KUBEADMIN_PWD: '${{ secrets.KUBEADMIN_PWD }}'
run: ./CI/CRC/init_github_action.sh
- name: Setup test suite
run: |
yq -i '.kraken.port="8081"' CI/config/common_test_config.yaml
yq -i '.kraken.signal_address="0.0.0.0"' CI/config/common_test_config.yaml
echo "test_app_outages_gh" > ./CI/tests/my_tests
echo "test_container" >> ./CI/tests/my_tests
echo "test_namespace" >> ./CI/tests/my_tests
echo "test_net_chaos" >> ./CI/tests/my_tests
echo "test_time" >> ./CI/tests/my_tests
- name: Print affected config files
run: |
echo -e "## CI/config/common_test_config.yaml\n\n"
cat CI/config/common_test_config.yaml
- name: Running test suite
run: |
./CI/run.sh
- name: Print test output
run: cat CI/out/*
- name: Create coverage report
run: |
echo "# Test results" > $GITHUB_STEP_SUMMARY
cat CI/results.markdown >> $GITHUB_STEP_SUMMARY
echo "# Test coverage" >> $GITHUB_STEP_SUMMARY
python -m coverage report --format=markdown >> $GITHUB_STEP_SUMMARY

60
.github/workflows/release.yml vendored Normal file
View File

@@ -0,0 +1,60 @@
name: Create Release
on:
push:
tags:
- 'v*'
jobs:
release:
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: calculate previous tag
run: |
git fetch --tags origin
PREVIOUS_TAG=$(git tag --sort=-creatordate | sed -n '2 p')
echo $PREVIOUS_TAG
echo "PREVIOUS_TAG=$PREVIOUS_TAG" >> "$GITHUB_ENV"
- name: generate release notes from template
id: release-notes
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
NOTES=$(gh api \
--method POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/repos/krkn-chaos/krkn/releases/generate-notes \
-f "tag_name=${{ github.ref_name }}" -f "target_commitish=main" -f "previous_tag_name=${{ env.PREVIOUS_TAG }}" | jq -r .body)
echo "NOTES<<EOF" >> $GITHUB_ENV
echo "$NOTES" >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
- name: replace placeholders in template
run: |
echo "${{ env.NOTES }}"
TEMPLATE=$(cat .github/release-template.md)
VERSION=${{ github.ref_name }}
NOTES="${{ env.NOTES }}"
OUTPUT=${TEMPLATE//\{VERSION\}/$VERSION}
OUTPUT=${OUTPUT//\{CHANGES\}/$NOTES}
echo "$OUTPUT" > release-notes.md
- name: create release
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
gh release create ${{ github.ref_name }} --title "${{ github.ref_name }}" -F release-notes.md
- name: Install Syft
run: |
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sudo sh -s -- -b /usr/local/bin
- name: Generate SBOM
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
syft . --scope all-layers --output cyclonedx-json > sbom.json
echo "SBOM generated successfully!"
gh release upload ${{ github.ref_name }} sbom.json

45
.github/workflows/require-docs.yml vendored Normal file
View File

@@ -0,0 +1,45 @@
name: Require Documentation Update
on:
pull_request:
types: [opened, edited, synchronize]
branches:
- main
jobs:
check-docs:
name: Check Documentation Update
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Check if Documentation is Required
id: check_docs
run: |
echo "Checking PR body for documentation checkbox..."
# Read the PR body from the GitHub event payload
if echo "${{ github.event.pull_request.body }}" | grep -qi '\[x\].*documentation needed'; then
echo "Documentation required detected."
echo "docs_required=true" >> $GITHUB_OUTPUT
else
echo "Documentation not required."
echo "docs_required=false" >> $GITHUB_OUTPUT
fi
- name: Enforce Documentation Update (if required)
if: steps.check_docs.outputs.docs_required == 'true'
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Retrieve feature branch and repository owner from the GitHub context
FEATURE_BRANCH="${{ github.head_ref }}"
REPO_OWNER="${{ github.repository_owner }}"
WEBSITE_REPO="website"
echo "Searching for a merged documentation PR for feature branch: $FEATURE_BRANCH in $REPO_OWNER/$WEBSITE_REPO..."
MERGED_PR=$(gh pr list --repo "$REPO_OWNER/$WEBSITE_REPO" --state merged --json headRefName,title,url | jq -r \
--arg FEATURE_BRANCH "$FEATURE_BRANCH" '.[] | select(.title | contains($FEATURE_BRANCH)) | .url')
if [[ -z "$MERGED_PR" ]]; then
echo ":x: Documentation PR for branch '$FEATURE_BRANCH' is required and has not been merged."
exit 1
else
echo ":white_check_mark: Found merged documentation PR: $MERGED_PR"
fi

52
.github/workflows/stale.yml vendored Normal file
View File

@@ -0,0 +1,52 @@
name: Manage Stale Issues and Pull Requests
on:
schedule:
# Run daily at 1:00 AM UTC
- cron: '0 1 * * *'
workflow_dispatch:
permissions:
issues: write
pull-requests: write
jobs:
stale:
name: Mark and Close Stale Issues and PRs
runs-on: ubuntu-latest
steps:
- name: Mark and close stale issues and PRs
uses: actions/stale@v9
with:
days-before-issue-stale: 60
days-before-issue-close: 14
stale-issue-label: 'stale'
stale-issue-message: |
This issue has been automatically marked as stale because it has not had any activity in the last 60 days.
It will be closed in 14 days if no further activity occurs.
If this issue is still relevant, please leave a comment or remove the stale label.
Thank you for your contributions to krkn!
close-issue-message: |
This issue has been automatically closed due to inactivity.
If you believe this issue is still relevant, please feel free to reopen it or create a new issue with updated information.
Thank you for your understanding!
close-issue-reason: 'not_planned'
days-before-pr-stale: 90
days-before-pr-close: 14
stale-pr-label: 'stale'
stale-pr-message: |
This pull request has been automatically marked as stale because it has not had any activity in the last 90 days.
It will be closed in 14 days if no further activity occurs.
If this PR is still relevant, please rebase it, address any pending reviews, or leave a comment.
Thank you for your contributions to krkn!
close-pr-message: |
This pull request has been automatically closed due to inactivity.
If you believe this PR is still relevant, please feel free to reopen it or create a new pull request with updated changes.
Thank you for your understanding!
# Exempt labels
exempt-issue-labels: 'bug,enhancement,good first issue'
exempt-pr-labels: 'pending discussions,hold'
remove-stale-when-updated: true

206
.github/workflows/tests.yml vendored Normal file
View File

@@ -0,0 +1,206 @@
name: Functional & Unit Tests
on:
pull_request:
push:
branches:
- main
jobs:
tests:
# Common steps
name: Functional & Unit Tests
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v3
- name: Create multi-node KinD cluster
uses: redhat-chaos/actions/kind@main
- name: Deploy prometheus & Port Forwarding
uses: redhat-chaos/actions/prometheus@main
- name: Deploy Elasticsearch
with:
ELASTIC_PORT: ${{ env.ELASTIC_PORT }}
RUN_ID: ${{ github.run_id }}
uses: redhat-chaos/actions/elastic@main
- name: Download elastic password
uses: actions/download-artifact@v4
with:
name: elastic_password_${{ github.run_id }}
- name: Set elastic password on env
run: |
ELASTIC_PASSWORD=$(cat elastic_password.txt)
echo "ELASTIC_PASSWORD=$ELASTIC_PASSWORD" >> "$GITHUB_ENV"
- name: Install Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
architecture: 'x64'
- name: Install environment
run: |
sudo apt-get install build-essential python3-dev
pip install --upgrade pip
pip install -r requirements.txt
pip install coverage
- name: Deploy test workloads
run: |
es_pod_name=$(kubectl get pods -l "app=elasticsearch-master" -o name)
echo "POD_NAME: $es_pod_name"
kubectl --namespace default port-forward $es_pod_name 9200 &
prom_name=$(kubectl get pods -n monitoring -l "app.kubernetes.io/name=prometheus" -o name)
kubectl --namespace monitoring port-forward $prom_name 9090 &
# Wait for Elasticsearch to be ready
echo "Waiting for Elasticsearch to be ready..."
for i in {1..30}; do
if curl -k -s -u elastic:$ELASTIC_PASSWORD https://localhost:9200/_cluster/health > /dev/null 2>&1; then
echo "Elasticsearch is ready!"
break
fi
echo "Attempt $i: Elasticsearch not ready yet, waiting..."
sleep 2
done
kubectl apply -f CI/templates/outage_pod.yaml
kubectl wait --for=condition=ready pod -l scenario=outage --timeout=300s
kubectl apply -f CI/templates/container_scenario_pod.yaml
kubectl wait --for=condition=ready pod -l scenario=container --timeout=300s
kubectl create namespace namespace-scenario
kubectl apply -f CI/templates/time_pod.yaml
kubectl wait --for=condition=ready pod -l scenario=time-skew --timeout=300s
kubectl apply -f CI/templates/service_hijacking.yaml
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=proxy" --timeout=300s
kubectl apply -f CI/legacy/scenarios/volume_scenario.yaml
kubectl wait --for=condition=ready pod kraken-test-pod -n kraken --timeout=300s
- name: Get Kind nodes
run: |
kubectl get nodes --show-labels=true
# Pull request only steps
- name: Run unit tests
run: python -m coverage run -a -m unittest discover -s tests -v
- name: Setup Functional Tests
run: |
yq -i '.kraken.performance_monitoring="localhost:9090"' CI/config/common_test_config.yaml
yq -i '.elastic.elastic_port=9200' CI/config/common_test_config.yaml
yq -i '.elastic.elastic_url="https://localhost"' CI/config/common_test_config.yaml
yq -i '.elastic.enable_elastic=False' CI/config/common_test_config.yaml
yq -i '.elastic.password="${{env.ELASTIC_PASSWORD}}"' CI/config/common_test_config.yaml
yq -i '.performance_monitoring.prometheus_url="http://localhost:9090"' CI/config/common_test_config.yaml
echo "test_app_outages" >> ./CI/tests/functional_tests
echo "test_container" >> ./CI/tests/functional_tests
echo "test_cpu_hog" >> ./CI/tests/functional_tests
echo "test_customapp_pod" >> ./CI/tests/functional_tests
echo "test_io_hog" >> ./CI/tests/functional_tests
echo "test_memory_hog" >> ./CI/tests/functional_tests
echo "test_namespace" >> ./CI/tests/functional_tests
echo "test_net_chaos" >> ./CI/tests/functional_tests
echo "test_node" >> ./CI/tests/functional_tests
echo "test_pod" >> ./CI/tests/functional_tests
echo "test_pod_error" >> ./CI/tests/functional_tests
echo "test_service_hijacking" >> ./CI/tests/functional_tests
echo "test_pod_network_filter" >> ./CI/tests/functional_tests
echo "test_pod_server" >> ./CI/tests/functional_tests
echo "test_time" >> ./CI/tests/functional_tests
# echo "test_pvc" >> ./CI/tests/functional_tests
# Push on main only steps + all other functional to collect coverage
# for the badge
- name: Configure AWS Credentials
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region : ${{ secrets.AWS_REGION }}
- name: Setup Post Merge Request Functional Tests
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: |
yq -i '.telemetry.username="${{secrets.TELEMETRY_USERNAME}}"' CI/config/common_test_config.yaml
yq -i '.telemetry.password="${{secrets.TELEMETRY_PASSWORD}}"' CI/config/common_test_config.yaml
echo "test_telemetry" >> ./CI/tests/functional_tests
# Final common steps
- name: Run Functional tests
env:
AWS_BUCKET: ${{ secrets.AWS_BUCKET }}
run: |
./CI/run.sh
cat ./CI/results.markdown >> $GITHUB_STEP_SUMMARY
echo >> $GITHUB_STEP_SUMMARY
- name: Upload CI logs
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: ci-logs
path: CI/out
if-no-files-found: error
- name: Collect coverage report
if: ${{ always() }}
run: |
python -m coverage html
python -m coverage json
- name: Publish coverage report to job summary
if: ${{ always() }}
run: |
pip install html2text
html2text --ignore-images --ignore-links -b 0 htmlcov/index.html >> $GITHUB_STEP_SUMMARY
- name: Upload coverage data
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: coverage
path: htmlcov
if-no-files-found: error
- name: Upload json coverage
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: coverage.json
path: coverage.json
if-no-files-found: error
- name: Check CI results
if: ${{ always() }}
run: "! grep Fail CI/results.markdown"
badge:
permissions:
contents: write
name: Generate Coverage Badge
runs-on: ubuntu-latest
needs:
- tests
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- name: Check out doc repo
uses: actions/checkout@master
with:
repository: krkn-chaos/krkn-lib-docs
path: krkn-lib-docs
ssh-key: ${{ secrets.KRKN_LIB_DOCS_PRIV_KEY }}
- name: Download json coverage
uses: actions/download-artifact@v4
with:
name: coverage.json
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Copy badge on GitHub Page Repo
env:
COLOR: yellow
run: |
# generate coverage badge on previously calculated total coverage
# and copy in the docs page
export TOTAL=$(python -c "import json;print(json.load(open('coverage.json'))['totals']['percent_covered_display'])")
[[ $TOTAL > 40 ]] && COLOR=green
echo "TOTAL: $TOTAL"
echo "COLOR: $COLOR"
curl "https://img.shields.io/badge/coverage-$TOTAL%25-$COLOR" > ./krkn-lib-docs/coverage_badge_krkn.svg
- name: Push updated Coverage Badge
run: |
cd krkn-lib-docs
git add .
git config user.name "krkn-chaos"
git config user.email "krkn-actions@users.noreply.github.com"
git commit -m "[KRKN] Coverage Badge ${GITHUB_REF##*/}" || echo "no changes to commit"
git push

3
.gitignore vendored
View File

@@ -16,6 +16,7 @@ __pycache__/*
*.out
kube-burner*
kube_burner*
recommender_*.json
# Project files
.ropeproject
@@ -61,7 +62,7 @@ inspect.local.*
!CI/config/common_test_config.yaml
CI/out/*
CI/ci_results
CI/scenarios/*node.yaml
CI/legacy/*node.yaml
CI/results.markdown
#env

9
ADOPTERS.md Normal file
View File

@@ -0,0 +1,9 @@
# Krkn Adopters
This is a list of organizations that have publicly acknowledged usage of Krkn and shared details of how they are leveraging it in their environment for chaos engineering use cases. Do you want to add yourself to this list? Please fork the repository and open a PR with the required change.
| Organization | Since | Website | Use-Case |
|:-|:-|:-|:-|
| MarketAxess | 2024 | https://www.marketaxess.com/ | Kraken enables us to achieve our goal of increasing the reliability of our cloud products on Kubernetes. The tool allows us to automatically run various chaos scenarios, identify resilience and performance bottlenecks, and seamlessly restore the system to its original state once scenarios finish. These chaos scenarios include pod disruptions, node (EC2) outages, simulating availability zone (AZ) outages, and filling up storage spaces like EBS and EFS. The community is highly responsive to requests and works on expanding the tool's capabilities. MarketAxess actively contributes to the project, adding features such as the ability to leverage existing network ACLs and proposing several feature improvements to enhance test coverage. |
| Red Hat Openshift | 2020 | https://www.redhat.com/ | Kraken is a highly reliable chaos testing tool used to ensure the quality and resiliency of Red Hat Openshift. The engineering team runs all the test scenarios under Kraken on different cloud platforms on both self-managed and cloud services environments prior to the release of a new version of the product. The team also contributes to the Kraken project consistently which helps the test scenarios to keep up with the new features introduced to the product. Inclusion of this test coverage has contributed to gaining the trust of new and existing customers of the product. |
| IBM | 2023 | https://www.ibm.com/ | While working on AI for Chaos Testing at IBM Research, we closely collaborated with the Kraken (Krkn) team to advance intelligent chaos engineering. Our contributions included developing AI-enabled chaos injection strategies and integrating reinforcement learning (RL)-based fault search techniques into the Krkn tool, enabling it to identify and explore system vulnerabilities more efficiently. Kraken stands out as one of the most user-friendly and effective tools for chaos engineering, and the Kraken teams deep technical involvement played a crucial role in the success of this collaboration—helping bridge cutting-edge AI research with practical, real-world system reliability testing. |

View File

@@ -1,44 +0,0 @@
apiVersion: v1
kind: Namespace
metadata:
name: $NAMESPACE
---
apiVersion: v1
kind: Service
metadata:
name: $DEPLOYMENT_NAME-service
namespace: $NAMESPACE
spec:
selector:
app: $DEPLOYMENT_NAME
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: $NAMESPACE
name: $DEPLOYMENT_NAME-deployment
spec:
replicas: 3
selector:
matchLabels:
app: $DEPLOYMENT_NAME
template:
metadata:
labels:
app: $DEPLOYMENT_NAME
spec:
containers:
- name: $DEPLOYMENT_NAME
image: nginxinc/nginx-unprivileged:stable-alpine
ports:
- name: http
containerPort: 8080

View File

@@ -1,72 +0,0 @@
#!/bin/bash
SCRIPT_PATH=./CI/CRC
DEPLOYMENT_PATH=$SCRIPT_PATH/deployment.yaml
CLUSTER_INFO=cluster_infos.json
[[ -z $WORKDIR_PATH ]] && echo "[ERROR] please set \$WORKDIR_PATH environment variable" && exit 1
CLUSTER_INFO_PATH=$WORKDIR_PATH/crc/$CLUSTER_INFO
[[ ! -f $DEPLOYMENT_PATH ]] && echo "[ERROR] please run $0 from GitHub action root directory" && exit 1
[[ -z $KUBEADMIN_PWD ]] && echo "[ERROR] kubeadmin password not set, please check the repository secrets" && exit 1
[[ -z $DEPLOYMENT_NAME ]] && echo "[ERROR] please set \$DEPLOYMENT_NAME environment variable" && exit 1
[[ -z $NAMESPACE ]] && echo "[ERROR] please set \$NAMESPACE environment variable" && exit 1
[[ ! -f $CLUSTER_INFO_PATH ]] && echo "[ERROR] cluster_info.json not found in $CLUSTER_INFO_PATH" && exit 1
OPENSSL=`which openssl 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: openssl missing, please install it and try again" && exit 1
OC=`which oc 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: oc missing, please install it and try again" && exit 1
SED=`which sed 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: sed missing, please install it and try again" && exit 1
JQ=`which jq 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: jq missing, please install it and try again" && exit 1
ENVSUBST=`which envsubst 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: envsubst missing, please install it and try again" && exit 1
API_ADDRESS="$($JQ -r '.api.address' $CLUSTER_INFO_PATH)"
API_PORT="$($JQ -r '.api.port' $CLUSTER_INFO_PATH)"
BASE_HOST=`$JQ -r '.api.address' $CLUSTER_INFO_PATH | sed -r 's#https:\/\/api\.(.+\.nip\.io)#\1#'`
FQN=$DEPLOYMENT_NAME.apps.$BASE_HOST
echo "[INF] logging on $API_ADDRESS:$API_PORT"
COUNTER=1
until `$OC login --insecure-skip-tls-verify -u kubeadmin -p $KUBEADMIN_PWD $API_ADDRESS:$API_PORT > /dev/null 2>&1`
do
echo "[INF] login attempt $COUNTER"
[[ $COUNTER == 20 ]] && echo "[ERR] maximum login attempts exceeded, failing" && exit 1
((COUNTER++))
sleep 10
done
echo "[INF] deploying example deployment: $DEPLOYMENT_NAME in namespace: $NAMESPACE"
$ENVSUBST < $DEPLOYMENT_PATH | $OC apply -f - > /dev/null 2>&1
echo "[INF] creating SSL self-signed certificates for route https://$FQN"
$OPENSSL genrsa -out servercakey.pem > /dev/null 2>&1
$OPENSSL req -new -x509 -key servercakey.pem -out serverca.crt -subj "/CN=$FQN/O=Red Hat Inc./C=US" > /dev/null 2>&1
$OPENSSL genrsa -out server.key > /dev/null 2>&1
$OPENSSL req -new -key server.key -out server_reqout.txt -subj "/CN=$FQN/O=Red Hat Inc./C=US" > /dev/null 2>&1
$OPENSSL x509 -req -in server_reqout.txt -days 3650 -sha256 -CAcreateserial -CA serverca.crt -CAkey servercakey.pem -out server.crt > /dev/null 2>&1
echo "[INF] creating deployment: $DEPLOYMENT_NAME public route: https://$FQN"
$OC create route --namespace $NAMESPACE edge --service=$DEPLOYMENT_NAME-service --cert=server.crt --key=server.key --ca-cert=serverca.crt --hostname="$FQN" > /dev/null 2>&1
echo "[INF] setting github action environment variables"
NODE_NAME="`$OC get nodes -o json | $JQ -r '.items[0].metadata.name'`"
COVERAGE_FILE="`pwd`/coverage.md"
echo "DEPLOYMENT_NAME=$DEPLOYMENT_NAME" >> $GITHUB_ENV
echo "DEPLOYMENT_FQN=$FQN" >> $GITHUB_ENV
echo "API_ADDRESS=$API_ADDRESS" >> $GITHUB_ENV
echo "API_PORT=$API_PORT" >> $GITHUB_ENV
echo "NODE_NAME=$NODE_NAME" >> $GITHUB_ENV
echo "NAMESPACE=$NAMESPACE" >> $GITHUB_ENV
echo "COVERAGE_FILE=$COVERAGE_FILE" >> $GITHUB_ENV
echo "[INF] deployment fully qualified name will be available in \${{ env.DEPLOYMENT_NAME }} with value $DEPLOYMENT_NAME"
echo "[INF] deployment name will be available in \${{ env.DEPLOYMENT_FQN }} with value $FQN"
echo "[INF] OCP API address will be available in \${{ env.API_ADDRESS }} with value $API_ADDRESS"
echo "[INF] OCP API port will be available in \${{ env.API_PORT }} with value $API_PORT"
echo "[INF] OCP node name will be available in \${{ env.NODE_NAME }} with value $NODE_NAME"
echo "[INF] coverage file will ve available in \${{ env.COVERAGE_FILE }} with value $COVERAGE_FILE"

View File

@@ -1,7 +1,7 @@
## CI Tests
### First steps
Edit [my_tests](tests/my_tests) with tests you want to run
Edit [functional_tests](tests/functional_tests) with tests you want to run
### How to run
```./CI/run.sh```
@@ -11,7 +11,7 @@ This will run kraken using python, make sure python3 is set up and configured pr
### Adding a test case
1. Add in simple scenario yaml file to execute under [../CI/scenarios/](scenarios)
1. Add in simple scenario yaml file to execute under [../CI/scenarios/](legacy)
2. Copy [test_application_outages.sh](tests/test_app_outages.sh) for example on how to get started
@@ -27,7 +27,7 @@ This will run kraken using python, make sure python3 is set up and configured pr
e. 15: Make sure name of config in line 14 matches what you pass on this line
4. Add test name to [my_tests](../CI/tests/my_tests) file
4. Add test name to [functional_tests](../CI/tests/functional_tests) file
a. This will be the name of the file without ".sh"

View File

@@ -1,29 +1,31 @@
kraken:
distribution: openshift # Distribution can be kubernetes or openshift.
distribution: kubernetes # Distribution can be kubernetes or openshift.
kubeconfig_path: ~/.kube/config # Path to kubeconfig.
exit_on_failure: False # Exit when a post action scenario fails.
litmus_version: v1.13.6 # Litmus version to install.
litmus_uninstall: False # If you want to uninstall litmus if failure.
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
port: 8081 # Signal port
auto_rollback: True # Enable auto rollback for scenarios.
rollback_versions_directory: /tmp/kraken-rollback # Directory to store rollback version files.
chaos_scenarios: # List of policies/chaos scenarios to load.
- $scenario_type: # List of chaos pod scenarios to load.
- $scenario_file
$post_config
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed.
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal.
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift.
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
kube_burner_binary_url: "https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz"
capture_metrics: False
config_path: config/kube_burner.yaml # Define the Elasticsearch url and index name in this config.
metrics_profile_path: config/metrics-aggregated.yaml
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set.
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error.
alert_profile: config/alerts # Path to alert profile with the prometheus queries.
enable_alerts: True # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
enable_metrics: True
alert_profile: config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile: config/metrics-report.yaml
check_critical_alerts: True # Path to alert profile with the prometheus queries.
tunings:
wait_duration: 6 # Duration to wait between each chaos scenario.
@@ -31,13 +33,42 @@ tunings:
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever.
telemetry:
enabled: False # enable/disables the telemetry collection feature
api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
username: username # telemetry service username
password: password # telemetry service password
api_url: https://yvnn4rfoi7.execute-api.us-west-2.amazonaws.com/test #telemetry service endpoint
username: $TELEMETRY_USERNAME # telemetry service username
password: $TELEMETRY_PASSWORD # telemetry service password
prometheus_namespace: 'monitoring' # prometheus namespace
prometheus_pod_name: 'prometheus-kind-prometheus-kube-prome-prometheus-0' # prometheus pod_name
prometheus_container_name: 'prometheus'
prometheus_backup: True # enables/disables prometheus data collection
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads: 5 # number of telemetry download/upload threads
archive_path: /tmp # local path where the archive files will be temporarly stored
archive_path: /tmp # local path where the archive files will be temporarily stored
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size: 10000 # the size of the prometheus data archive size in KB. The lower the size of archive is
logs_backup: True
logs_filter_patterns:
- "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+" # Sep 9 11:20:36.123425532
- "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+" # kinit 2023/09/15 11:20:36 log
- "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+" # 2023-09-15T11:20:36.123425532Z log
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
events_backup: True # enables/disables cluster events collection
telemetry_group: "funtests"
elastic:
enable_elastic: False
verify_certs: False
elastic_url: "https://192.168.39.196" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port: 32766
username: "elastic"
password: "test"
metrics_index: "krkn-metrics"
alerts_index: "krkn-alerts"
telemetry_index: "krkn-telemetry"
health_checks: # Utilizing health check endpoints to observe application behavior during chaos injection.
interval: # Interval in seconds to perform health checks, default value is 2 seconds
config: # Provide list of health check configurations for applications
- url: # Provide application endpoint
bearer_token: # Bearer token for authentication if any
auth: # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure: # If value is True exits when health check failed for application, values can be True/False

View File

@@ -45,15 +45,45 @@ metadata:
name: kraken-test-pod
namespace: kraken
spec:
securityContext:
fsGroup: 1001
# initContainer to fix permissions on the mounted volume
initContainers:
- name: fix-permissions
image: 'quay.io/centos7/httpd-24-centos7:centos7'
command:
- sh
- -c
- |
echo "Setting up permissions for /home/kraken..."
# Create the directory if it doesn't exist
mkdir -p /home/kraken
# Set ownership to user 1001 and group 1001
chown -R 1001:1001 /home/kraken
# Set permissions to allow read/write
chmod -R 755 /home/kraken
rm -rf /home/kraken/*
echo "Permissions fixed. Current state:"
ls -la /home/kraken
volumeMounts:
- mountPath: "/home/kraken"
name: kraken-test-pv
securityContext:
runAsUser: 0 # Run as root to fix permissions
volumes:
- name: kraken-test-pv
persistentVolumeClaim:
claimName: kraken-test-pvc
containers:
- name: kraken-test-container
image: 'quay.io/centos7/httpd-24-centos7:latest'
volumeMounts:
- mountPath: "/home/krake-dir/"
name: kraken-test-pv
image: 'quay.io/centos7/httpd-24-centos7:centos7'
securityContext:
privileged: true
runAsUser: 1001
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumeMounts:
- mountPath: "/home/kraken"
name: kraken-test-pv

View File

@@ -1,15 +1,14 @@
#!/bin/bash
set -x
MAX_RETRIES=60
OC=`which oc 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: oc missing, please install it and try again" && exit 1
KUBECTL=`which kubectl 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: kubectl missing, please install it and try again" && exit 1
wait_cluster_become_ready() {
COUNT=1
until `$OC get namespace > /dev/null 2>&1`
until `$KUBECTL get namespace > /dev/null 2>&1`
do
echo "[INF] waiting OpenShift to become ready, after $COUNT check"
echo "[INF] waiting Kubernetes to become ready, after $COUNT check"
sleep 3
[[ $COUNT == $MAX_RETRIES ]] && echo "[ERR] max retries exceeded, failing" && exit 1
((COUNT++))
@@ -18,9 +17,9 @@ wait_cluster_become_ready() {
ci_tests_loc="CI/tests/my_tests"
ci_tests_loc="CI/tests/functional_tests"
echo "running test suit consisting of ${ci_tests}"
echo -e "********* Running Functional Tests Suite *********\n\n"
rm -rf CI/out
@@ -37,9 +36,32 @@ echo 'Test | Result | Duration' >> $results
echo '-----------------------|--------|---------' >> $results
# Run each test
for test_name in `cat CI/tests/my_tests`
failed_tests=()
for test_name in `cat CI/tests/functional_tests`
do
wait_cluster_become_ready
./CI/run_test.sh $test_name $results
#wait_cluster_become_ready
return_value=`./CI/run_test.sh $test_name $results`
if [[ $return_value == 1 ]]
then
echo "Failed"
failed_tests+=("$test_name")
fi
wait_cluster_become_ready
done
if (( ${#failed_tests[@]}>0 ))
then
echo -e "\n\n======================================================================"
echo -e "\n FUNCTIONAL TESTS FAILED ${failed_tests[*]} ABORTING"
echo -e "\n======================================================================\n\n"
for test in "${failed_tests[@]}"
do
echo -e "\n********** $test KRKN RUN OUTPUT **********\n"
cat "CI/out/$test.out"
echo -e "\n********************************************\n\n\n\n"
done
exit 1
fi

View File

@@ -1,5 +1,4 @@
#!/bin/bash
set -x
readonly SECONDS_PER_HOUR=3600
readonly SECONDS_PER_MINUTE=60
function get_time_format() {
@@ -14,9 +13,7 @@ ci_test=`echo $1`
results_file=$2
echo -e "\n======================================================================"
echo -e " CI test for ${ci_test} "
echo -e "======================================================================\n"
echo -e "test: ${ci_test}" >&2
ci_results="CI/out/$ci_test.out"
# Test ci
@@ -28,13 +25,16 @@ then
# if the test passes update the results and complete
duration=$SECONDS
duration=$(get_time_format $duration)
echo "$ci_test: Successful"
echo -e "> $ci_test: Successful\n" >&2
echo "$ci_test | Pass | $duration" >> $results_file
count=$retries
# return value for run.sh
echo 0
else
duration=$SECONDS
duration=$(get_time_format $duration)
echo "$ci_test: Failed"
echo -e "> $ci_test: Failed\n" >&2
echo "$ci_test | Fail | $duration" >> $results_file
echo "Logs for "$ci_test
# return value for run.sh
echo 1
fi

View File

@@ -1,5 +0,0 @@
application_outage: # Scenario to create an outage of an application by blocking traffic
duration: 10 # Duration in seconds after which the routes will be accessible
namespace: openshift-monitoring # Namespace to target - all application routes will go inaccessible if pod selector is empty
pod_selector: {} # Pods to target
block: [Ingress, Egress] # It can be Ingress or Egress or Ingress, Egress

View File

@@ -1,8 +0,0 @@
scenarios:
- name: "kill machine config container"
namespace: "openshift-machine-config-operator"
label_selector: "k8s-app=machine-config-server"
container_name: "hello-openshift"
action: "kill 1"
count: 1
retry_wait: 60

View File

@@ -1,6 +0,0 @@
network_chaos: # Scenario to create an outage by simulating random variations in the network.
duration: 10 # seconds
instance_count: 1
execution: serial
egress:
bandwidth: 100mbit

View File

@@ -1,7 +0,0 @@
scenarios:
- action: delete
namespace: "^$openshift-network-diagnostics$"
label_selector:
runs: 1
sleep: 15
wait_time: 30

View File

@@ -1,34 +0,0 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: litmus
spec:
# It can be true/false
annotationCheck: 'false'
# It can be active/stop
engineState: 'active'
chaosServiceAccount: litmus-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: node-cpu-hog
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
# Number of cores of node CPU to be consumed
- name: NODE_CPU_CORE
value: '1'
# percentage of total nodes to target
- name: NODES_AFFECTED_PERC
value: '30'
# ENTER THE COMMA SEPARATED TARGET NODES NAME
- name: TARGET_NODES
value: $WORKER_NODE

View File

@@ -1,34 +0,0 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: litmus
spec:
# It can be true/false
annotationCheck: 'false'
# It can be active/stop
engineState: 'active'
chaosServiceAccount: litmus-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: node-cpu-hog
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
# Number of cores of node CPU to be consumed
- name: NODE_CPU_CORE
value: '1'
# percentage of total nodes to target
- name: NODES_AFFECTED_PERC
value: '30'
# ENTER THE COMMA SEPARATED TARGET NODES NAME
- name: TARGET_NODES
value:

View File

@@ -1,35 +0,0 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: litmus
spec:
# It can be delete/retain
jobCleanUpPolicy: 'retain'
# It can be active/stop
engineState: 'active'
chaosServiceAccount: litmus-sa
experiments:
- name: node-io-stress
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
## specify the size as percentage of free space on the file system
- name: FILESYSTEM_UTILIZATION_PERCENTAGE
value: '100'
## Number of core of CPU
- name: CPU
value: '1'
## Total number of workers default value is 4
- name: NUMBER_OF_WORKERS
value: '3'
## enter the comma separated target nodes name
- name: TARGET_NODES
value: $WORKER_NODE

View File

@@ -1,35 +0,0 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: litmus
spec:
# It can be delete/retain
jobCleanUpPolicy: 'retain'
# It can be active/stop
engineState: 'active'
chaosServiceAccount: litmus-sa
experiments:
- name: node-io-stress
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
## specify the size as percentage of free space on the file system
- name: FILESYSTEM_UTILIZATION_PERCENTAGE
value: '100'
## Number of core of CPU
- name: CPU
value: '1'
## Total number of workers default value is 4
- name: NUMBER_OF_WORKERS
value: '3'
## enter the comma separated target nodes name
- name: TARGET_NODES
value:

View File

@@ -1,28 +0,0 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: litmus
spec:
# It can be delete/retain
jobCleanUpPolicy: 'retain'
# It can be active/stop
engineState: 'active'
chaosServiceAccount: litmus-sa
experiments:
- name: node-memory-hog
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
## Specify the size as percent of total node capacity Ex: '30'
## Note: For consuming memory in mebibytes change the variable to MEMORY_CONSUMPTION_MEBIBYTES
- name: MEMORY_CONSUMPTION_PERCENTAGE
value: '30'
# ENTER THE COMMA SEPARATED TARGET NODES NAME
- name: TARGET_NODES
value: $WORKER_NODE

View File

@@ -1,28 +0,0 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: litmus
spec:
# It can be delete/retain
jobCleanUpPolicy: 'retain'
# It can be active/stop
engineState: 'active'
chaosServiceAccount: litmus-sa
experiments:
- name: node-memory-hog
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
## Specify the size as percent of total node capacity Ex: '30'
## Note: For consuming memory in mebibytes change the variable to MEMORY_CONSUMPTION_MEBIBYTES
- name: MEMORY_CONSUMPTION_PERCENTAGE
value: '30'
# ENTER THE COMMA SEPARATED TARGET NODES NAME
- name: TARGET_NODES
value:

View File

@@ -1,5 +0,0 @@
time_scenarios:
- action: skew_time
object_type: pod
label_selector: k8s-app=etcd
container_name: ""

View File

@@ -0,0 +1,16 @@
apiVersion: v1
kind: Pod
metadata:
name: container
labels:
scenario: container
spec:
hostNetwork: true
containers:
- name: fedtools
image: docker.io/fedora/tools
command:
- /bin/sh
- -c
- |
sleep infinity

View File

@@ -0,0 +1,16 @@
apiVersion: v1
kind: Pod
metadata:
name: outage
labels:
scenario: outage
spec:
hostNetwork: true
containers:
- name: fedtools
image: quay.io/krkn-chaos/krkn:tools
command:
- /bin/sh
- -c
- |
sleep infinity

View File

@@ -0,0 +1,29 @@
apiVersion: v1
kind: Pod
metadata:
name: pod-network-filter-test
labels:
app.kubernetes.io/name: pod-network-filter
spec:
containers:
- name: nginx
image: quay.io/krkn-chaos/krkn-funtests:pod-network-filter
ports:
- containerPort: 5000
name: pod-network-prt
---
apiVersion: v1
kind: Service
metadata:
name: pod-network-filter-service
spec:
selector:
app.kubernetes.io/name: pod-network-filter
type: NodePort
ports:
- name: pod-network-filter-svc
protocol: TCP
port: 80
targetPort: pod-network-prt
nodePort: 30037

View File

@@ -0,0 +1,29 @@
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app.kubernetes.io/name: proxy
spec:
containers:
- name: nginx
image: nginx:stable
ports:
- containerPort: 80
name: http-web-svc
---
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app.kubernetes.io/name: proxy
type: NodePort
ports:
- name: name-of-service-port
protocol: TCP
port: 80
targetPort: http-web-svc
nodePort: 30036

View File

@@ -0,0 +1,16 @@
apiVersion: v1
kind: Pod
metadata:
name: time-skew
labels:
scenario: time-skew
spec:
hostNetwork: true
containers:
- name: fedtools
image: quay.io/krkn-chaos/krkn:tools
command:
- /bin/sh
- -c
- |
sleep infinity

View File

@@ -1,18 +1,26 @@
ERRORED=false
function finish {
if [ $? -eq 1 ] && [ $ERRORED != "true" ]
if [ $? != 0 ] && [ $ERRORED != "true" ]
then
error
fi
}
function error {
echo "Error caught."
ERRORED=true
exit_code=$?
if [ $exit_code == 1 ]
then
echo "Error caught."
ERRORED=true
elif [ $exit_code == 2 ]
then
echo "Run with exit code 2 detected, it is expected, wrapping the exit code with 0 to avoid pipeline failure"
exit 0
fi
}
function get_node {
worker_node=$(oc get nodes --no-headers | grep worker | head -n 1)
worker_node=$(kubectl get nodes --no-headers | grep worker | head -n 1)
export WORKER_NODE=$worker_node
}

View File

@@ -0,0 +1 @@

View File

@@ -1 +0,0 @@
test_net_chaos

View File

@@ -7,11 +7,19 @@ trap finish EXIT
function functional_test_app_outage {
export scenario_type="application_outages"
export scenario_file="CI/scenarios/app_outage.yaml"
yq -i '.application_outage.duration=10' scenarios/openshift/app_outage.yaml
yq -i '.application_outage.pod_selector={"scenario":"outage"}' scenarios/openshift/app_outage.yaml
yq -i '.application_outage.namespace="default"' scenarios/openshift/app_outage.yaml
export scenario_type="application_outages_scenarios"
export scenario_file="scenarios/openshift/app_outage.yaml"
export post_config=""
kubectl get services -A
kubectl get pods
envsubst < CI/config/common_test_config.yaml > CI/config/app_outage.yaml
cat $scenario_file
cat CI/config/app_outage.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/app_outage.yaml
echo "App outage scenario test: Success"
}

View File

@@ -1,21 +0,0 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_app_outage {
[ -z $DEPLOYMENT_NAME ] && echo "[ERR] DEPLOYMENT_NAME variable not set, failing." && exit 1
yq -i '.application_outage.pod_selector={"app":"'$DEPLOYMENT_NAME'"}' CI/scenarios/app_outage.yaml
yq -i '.application_outage.namespace="'$NAMESPACE'"' CI/scenarios/app_outage.yaml
export scenario_type="application_outages"
export scenario_file="CI/scenarios/app_outage.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/app_outage.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/app_outage.yaml
echo "App outage scenario test: Success"
}
functional_test_app_outage

View File

@@ -8,14 +8,18 @@ trap finish EXIT
pod_file="CI/scenarios/hello_pod.yaml"
function functional_test_container_crash {
yq -i '.scenarios[0].namespace="default"' scenarios/openshift/container_etcd.yml
yq -i '.scenarios[0].label_selector="scenario=container"' scenarios/openshift/container_etcd.yml
yq -i '.scenarios[0].container_name="fedtools"' scenarios/openshift/container_etcd.yml
export scenario_type="container_scenarios"
export scenario_file="- CI/scenarios/container_scenario.yml"
export scenario_file="scenarios/openshift/container_etcd.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/container_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/container_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/container_config.yaml -d True
echo "Container scenario test: Success"
kubectl get pods -n kube-system -l component=etcd
}
functional_test_container_crash

18
CI/tests/test_cpu_hog.sh Executable file → Normal file
View File

@@ -6,15 +6,15 @@ trap error ERR
trap finish EXIT
function functional_test_litmus_cpu {
function functional_test_cpu_hog {
yq -i '."node-selector"="kubernetes.io/hostname=kind-worker2"' scenarios/kube/cpu-hog.yml
export scenario_type="litmus_scenarios"
export scenario_file="- scenarios/templates/litmus-rbac.yaml"
export post_config="- CI/scenarios/node_cpu_hog_engine_node.yaml"
envsubst < CI/config/common_test_config.yaml > CI/config/litmus_config.yaml
envsubst < CI/scenarios/node_cpu_hog_engine.yaml > CI/scenarios/node_cpu_hog_engine_node.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/cpu-hog.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/cpu_hog.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/cpu_hog.yaml
echo "CPU Hog: Success"
}
functional_test_litmus_cpu
functional_test_cpu_hog

View File

@@ -1,20 +0,0 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_cpu {
[ -z $NODE_NAME ] && echo "[ERR] NODE_NAME variable not set, failing." && exit 1
yq -i ' .spec.experiments = [{"name": "node-cpu-hog", "spec":{"components":{"env":[{"name":"TOTAL_CHAOS_DURATION","value":"10"},{"name":"NODE_CPU_CORE","value":"1"},{"name":"NODES_AFFECTED_PERC","value":"30"},{"name":"TARGET_NODES","value":"'$NODE_NAME'"}]}}}]' CI/scenarios/node_cpu_hog_engine_node.yaml
cp CI/config/common_test_config.yaml CI/config/litmus_config.yaml
yq '.kraken.chaos_scenarios = [{"litmus_scenarios":[["scenarios/openshift/templates/litmus-rbac.yaml","CI/scenarios/node_cpu_hog_engine_node.yaml"]]}]' -i CI/config/litmus_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
}
functional_test_litmus_cpu

18
CI/tests/test_customapp_pod.sh Executable file
View File

@@ -0,0 +1,18 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_customapp_pod_node_selector {
export scenario_type="pod_disruption_scenarios"
export scenario_file="scenarios/openshift/customapp_pod.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/customapp_pod_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/customapp_pod_config.yaml -d True
echo "Pod disruption with node_label_selector test: Success"
}
functional_test_customapp_pod_node_selector

20
CI/tests/test_io_hog.sh Executable file → Normal file
View File

@@ -5,16 +5,16 @@ source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_io_hog {
yq -i '."node-selector"="kubernetes.io/hostname=kind-worker2"' scenarios/kube/io-hog.yml
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/io-hog.yml"
export post_config=""
function functional_test_litmus_io {
export scenario_type="litmus_scenarios"
export scenario_file="- scenarios/templates/litmus-rbac.yaml"
export post_config="- CI/scenarios/node_io_engine_node.yaml"
envsubst < CI/config/common_test_config.yaml > CI/config/litmus_config.yaml
envsubst < CI/scenarios/node_io_engine.yaml > CI/scenarios/node_io_engine_node.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
cat $scenario_file
envsubst < CI/config/common_test_config.yaml > CI/config/io_hog.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/io_hog.yaml
echo "IO Hog: Success"
}
functional_test_litmus_io
functional_test_io_hog

View File

@@ -1,19 +0,0 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_io {
[ -z $NODE_NAME ] && echo "[ERR] NODE_NAME variable not set, failing." && exit 1
yq -i ' .spec.experiments = [{"name": "node-io-stress", "spec":{"components":{"env":[{"name":"TOTAL_CHAOS_DURATION","value":"10"},{"name":"FILESYSTEM_UTILIZATION_PERCENTAGE","value":"100"},{"name":"CPU","value":"1"},{"name":"NUMBER_OF_WORKERS","value":"3"},{"name":"TARGET_NODES","value":"'$NODE_NAME'"}]}}}]' CI/scenarios/node_io_engine_node.yaml
cp CI/config/common_test_config.yaml CI/config/litmus_config.yaml
yq '.kraken.chaos_scenarios = [{"litmus_scenarios":[["scenarios/openshift/templates/litmus-rbac.yaml","CI/scenarios/node_io_engine_node.yaml"]]}]' -i CI/config/litmus_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
}
functional_test_litmus_io

View File

@@ -1,20 +0,0 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_mem {
export scenario_type="litmus_scenarios"
export scenario_file="- scenarios/templates/litmus-rbac.yaml"
export post_config="- CI/scenarios/node_mem_engine_node.yaml"
envsubst < CI/config/common_test_config.yaml > CI/config/litmus_config.yaml
envsubst < CI/scenarios/node_mem_engine.yaml > CI/scenarios/node_mem_engine_node.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario $1 test: Success"
}
functional_test_litmus_mem "- CI/scenarios/node_mem_engine.yaml"

View File

@@ -1,19 +0,0 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_mem {
[ -z $NODE_NAME ] && echo "[ERR] NODE_NAME variable not set, failing." && exit 1
yq -i ' .spec.experiments = [{"name": "node-io-stress", "spec":{"components":{"env":[{"name":"TOTAL_CHAOS_DURATION","value":"10"},{"name":"CPU","value":"1"},{"name":"TARGET_NODES","value":"'$NODE_NAME'"}]}}}]' CI/scenarios/node_mem_engine_node.yaml
cp CI/config/common_test_config.yaml CI/config/litmus_config.yaml
yq '.kraken.chaos_scenarios = [{"litmus_scenarios":[["scenarios/openshift/templates/litmus-rbac.yaml","CI/scenarios/node_mem_engine_node.yaml"]]}]' -i CI/config/litmus_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
}
functional_test_litmus_mem

View File

@@ -0,0 +1,19 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_memory_hog {
yq -i '."node-selector"="kubernetes.io/hostname=kind-worker2"' scenarios/kube/memory-hog.yml
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/memory-hog.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/memory_hog.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/memory_hog.yaml
echo "Memory Hog: Success"
}
functional_test_memory_hog

View File

@@ -6,13 +6,14 @@ trap error ERR
trap finish EXIT
function funtional_test_namespace_deletion {
export scenario_type="namespace_scenarios"
export scenario_file="- CI/scenarios/network_diagnostics_namespace.yaml"
export scenario_type="service_disruption_scenarios"
export scenario_file="scenarios/openshift/ingress_namespace.yaml"
export post_config=""
yq '.scenarios.[0].namespace="^openshift-network-diagnostics$"' -i CI/scenarios/network_diagnostics_namespace.yaml
yq '.scenarios[0].namespace="^namespace-scenario$"' -i scenarios/openshift/ingress_namespace.yaml
yq '.scenarios[0].wait_time=30' -i scenarios/openshift/ingress_namespace.yaml
yq '.scenarios[0].action="delete"' -i scenarios/openshift/ingress_namespace.yaml
envsubst < CI/config/common_test_config.yaml > CI/config/namespace_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/namespace_config.yaml
echo $?
echo "Namespace scenario test: Success"
}

View File

@@ -7,9 +7,16 @@ trap finish EXIT
function functional_test_network_chaos {
yq -i '.network_chaos.duration=10' scenarios/openshift/network_chaos.yaml
yq -i '.network_chaos.node_name="kind-worker2"' scenarios/openshift/network_chaos.yaml
yq -i '.network_chaos.egress.bandwidth="100mbit"' scenarios/openshift/network_chaos.yaml
yq -i 'del(.network_chaos.interfaces)' scenarios/openshift/network_chaos.yaml
yq -i 'del(.network_chaos.label_selector)' scenarios/openshift/network_chaos.yaml
yq -i 'del(.network_chaos.egress.latency)' scenarios/openshift/network_chaos.yaml
yq -i 'del(.network_chaos.egress.loss)' scenarios/openshift/network_chaos.yaml
export scenario_type="network_chaos"
export scenario_file="CI/scenarios/network_chaos.yaml"
export scenario_type="network_chaos_scenarios"
export scenario_file="scenarios/openshift/network_chaos.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/network_chaos.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/network_chaos.yaml

18
CI/tests/test_node.sh Executable file
View File

@@ -0,0 +1,18 @@
uset -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_node_stop_start {
export scenario_type="node_scenarios"
export scenario_file="scenarios/kind/node_scenarios_example.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/node_config.yaml
cat CI/config/node_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/node_config.yaml
echo "Node Stop/Start scenario test: Success"
}
functional_test_node_stop_start

20
CI/tests/test_pod.sh Executable file
View File

@@ -0,0 +1,20 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_pod_crash {
export scenario_type="pod_disruption_scenarios"
export scenario_file="scenarios/kind/pod_etcd.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml
echo "Pod disruption scenario test: Success"
date
kubectl get pods -n kube-system -l component=etcd -o yaml
}
functional_test_pod_crash

28
CI/tests/test_pod_error.sh Executable file
View File

@@ -0,0 +1,28 @@
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_pod_error {
export scenario_type="pod_disruption_scenarios"
export scenario_file="scenarios/kind/pod_etcd.yml"
export post_config=""
yq -i '.[0].config.kill=5' scenarios/kind/pod_etcd.yml
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
cat CI/config/pod_config.yaml
cat scenarios/kind/pod_etcd.yml
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml
ret=$?
echo "\n\nret $ret"
if [[ $ret -ge 1 ]]; then
echo "Pod disruption error scenario test: Success"
else
echo "Pod disruption error scenario test: Failure"
exit 1
fi
}
functional_test_pod_error

View File

@@ -0,0 +1,62 @@
function functional_pod_network_filter {
export SERVICE_URL="http://localhost:8889"
export scenario_type="network_chaos_ng_scenarios"
export scenario_file="scenarios/kube/pod-network-filter.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/pod_network_filter.yaml
yq -i '.[0].test_duration=10' scenarios/kube/pod-network-filter.yml
yq -i '.[0].label_selector=""' scenarios/kube/pod-network-filter.yml
yq -i '.[0].ingress=false' scenarios/kube/pod-network-filter.yml
yq -i '.[0].egress=true' scenarios/kube/pod-network-filter.yml
yq -i '.[0].target="pod-network-filter-test"' scenarios/kube/pod-network-filter.yml
yq -i '.[0].protocols=["tcp"]' scenarios/kube/pod-network-filter.yml
yq -i '.[0].ports=[443]' scenarios/kube/pod-network-filter.yml
yq -i '.performance_monitoring.check_critical_alerts=False' CI/config/pod_network_filter.yaml
## Test webservice deployment
kubectl apply -f ./CI/templates/pod_network_filter.yaml
COUNTER=0
while true
do
curl $SERVICE_URL
EXITSTATUS=$?
if [ "$EXITSTATUS" -eq "0" ]
then
break
fi
sleep 1
COUNTER=$((COUNTER+1))
[ $COUNTER -eq "100" ] && echo "maximum number of retry reached, test failed" && exit 1
done
cat scenarios/kube/pod-network-filter.yml
python3 -m coverage run -a run_kraken.py -c CI/config/pod_network_filter.yaml > krkn_pod_network.out 2>&1 &
PID=$!
# wait until the dns resolution starts failing and the service returns 400
DNS_FAILURE_STATUS=0
while true
do
OUT_STATUS_CODE=$(curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL)
if [ "$OUT_STATUS_CODE" -eq "404" ]
then
DNS_FAILURE_STATUS=404
fi
if [ "$DNS_FAILURE_STATUS" -eq "404" ] && [ "$OUT_STATUS_CODE" -eq "200" ]
then
echo "service restored"
break
fi
COUNTER=$((COUNTER+1))
[ $COUNTER -eq "100" ] && echo "maximum number of retry reached, test failed" && exit 1
sleep 2
done
wait $PID
}
functional_pod_network_filter

35
CI/tests/test_pod_server.sh Executable file
View File

@@ -0,0 +1,35 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_pod_server {
export scenario_type="pod_disruption_scenarios"
export scenario_file="scenarios/kind/pod_etcd.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
yq -i '.[0].config.kill=1' scenarios/kind/pod_etcd.yml
yq -i '.tunings.daemon_mode=True' CI/config/pod_config.yaml
cat CI/config/pod_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml &
sleep 15
curl -X POST http:/0.0.0.0:8081/STOP
wait
yq -i '.kraken.signal_state="PAUSE"' CI/config/pod_config.yaml
yq -i '.tunings.daemon_mode=False' CI/config/pod_config.yaml
cat CI/config/pod_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml &
sleep 5
curl -X POST http:/0.0.0.0:8081/RUN
wait
echo "Pod disruption with server scenario test: Success"
}
functional_test_pod_server

18
CI/tests/test_pvc.sh Executable file
View File

@@ -0,0 +1,18 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_pvc_fill {
export scenario_type="pvc_scenarios"
export scenario_file="scenarios/kind/pvc_scenario.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/pvc_config.yaml
cat CI/config/pvc_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/pvc_config.yaml --debug True
echo "PVC Fill scenario test: Success"
}
functional_test_pvc_fill

View File

@@ -0,0 +1,119 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
# port mapping has been configured in kind-config.yml
SERVICE_URL=http://localhost:8888
PAYLOAD_GET_1="{ \
\"status\":\"internal server error\" \
}"
STATUS_CODE_GET_1=500
PAYLOAD_PATCH_1="resource patched"
STATUS_CODE_PATCH_1=201
PAYLOAD_POST_1="{ \
\"status\": \"unauthorized\" \
}"
STATUS_CODE_POST_1=401
PAYLOAD_GET_2="{ \
\"status\":\"resource created\" \
}"
STATUS_CODE_GET_2=201
PAYLOAD_PATCH_2="bad request"
STATUS_CODE_PATCH_2=400
PAYLOAD_POST_2="not found"
STATUS_CODE_POST_2=404
JSON_MIME="application/json"
TEXT_MIME="text/plain; charset=utf-8"
function functional_test_service_hijacking {
export scenario_type="service_hijacking_scenarios"
export scenario_file="scenarios/kube/service_hijacking.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/service_hijacking.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/service_hijacking.yaml > /tmp/krkn.log 2>&1 &
PID=$!
#Waiting the hijacking to have effect
COUNTER=0
while [ `curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php` == 404 ]
do
echo "waiting scenario to kick in."
sleep 1
COUNTER=$((COUNTER+1))
[ $COUNTER -eq "100" ] && echo "maximum number of retry reached, test failed" && exit 1
done
#Checking Step 1 GET on /list/index.php
OUT_GET="`curl -X GET -s $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X GET -s -o /dev/null -I -w "%{content_type}" $SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php`
[ "${PAYLOAD_GET_1//[$'\t\r\n ']}" == "${OUT_GET//[$'\t\r\n ']}" ] && echo "Step 1 GET Payload OK" || (echo "Payload did not match. Test failed." && exit 1)
[ "$OUT_STATUS_CODE" == "$STATUS_CODE_GET_1" ] && echo "Step 1 GET Status Code OK" || (echo " Step 1 GET status code did not match. Test failed." && exit 1)
[ "$OUT_CONTENT" == "$JSON_MIME" ] && echo "Step 1 GET MIME OK" || (echo " Step 1 GET MIME did not match. Test failed." && exit 1)
#Checking Step 1 POST on /list/index.php
OUT_POST="`curl -s -X POST $SERVICE_URL/list/index.php`"
OUT_STATUS_CODE=`curl -X POST -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php`
OUT_CONTENT=`curl -X POST -s -o /dev/null -I -w "%{content_type}" $SERVICE_URL/list/index.php`
[ "${PAYLOAD_POST_1//[$'\t\r\n ']}" == "${OUT_POST//[$'\t\r\n ']}" ] && echo "Step 1 POST Payload OK" || (echo "Payload did not match. Test failed." && exit 1)
[ "$OUT_STATUS_CODE" == "$STATUS_CODE_POST_1" ] && echo "Step 1 POST Status Code OK" || (echo "Step 1 POST status code did not match. Test failed." && exit 1)
[ "$OUT_CONTENT" == "$JSON_MIME" ] && echo "Step 1 POST MIME OK" || (echo " Step 1 POST MIME did not match. Test failed." && exit 1)
#Checking Step 1 PATCH on /patch
OUT_PATCH="`curl -s -X PATCH $SERVICE_URL/patch`"
OUT_STATUS_CODE=`curl -X PATCH -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/patch`
OUT_CONTENT=`curl -X PATCH -s -o /dev/null -I -w "%{content_type}" $SERVICE_URL/patch`
[ "${PAYLOAD_PATCH_1//[$'\t\r\n ']}" == "${OUT_PATCH//[$'\t\r\n ']}" ] && echo "Step 1 PATCH Payload OK" || (echo "Payload did not match. Test failed." && exit 1)
[ "$OUT_STATUS_CODE" == "$STATUS_CODE_PATCH_1" ] && echo "Step 1 PATCH Status Code OK" || (echo "Step 1 PATCH status code did not match. Test failed." && exit 1)
[ "$OUT_CONTENT" == "$TEXT_MIME" ] && echo "Step 1 PATCH MIME OK" || (echo " Step 1 PATCH MIME did not match. Test failed." && exit 1)
# wait for the next step
sleep 16
#Checking Step 2 GET on /list/index.php
OUT_GET="`curl -X GET -s $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X GET -s -o /dev/null -I -w "%{content_type}" $SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php`
[ "${PAYLOAD_GET_2//[$'\t\r\n ']}" == "${OUT_GET//[$'\t\r\n ']}" ] && echo "Step 2 GET Payload OK" || (echo "Step 2 GET Payload did not match. Test failed." && exit 1)
[ "$OUT_STATUS_CODE" == "$STATUS_CODE_GET_2" ] && echo "Step 2 GET Status Code OK" || (echo "Step 2 GET status code did not match. Test failed." && exit 1)
[ "$OUT_CONTENT" == "$JSON_MIME" ] && echo "Step 2 GET MIME OK" || (echo " Step 2 GET MIME did not match. Test failed." && exit 1)
#Checking Step 2 POST on /list/index.php
OUT_POST="`curl -s -X POST $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X POST -s -o /dev/null -I -w "%{content_type}" $SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X POST -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php`
[ "${PAYLOAD_POST_2//[$'\t\r\n ']}" == "${OUT_POST//[$'\t\r\n ']}" ] && echo "Step 2 POST Payload OK" || (echo "Step 2 POST Payload did not match. Test failed." && exit 1)
[ "$OUT_STATUS_CODE" == "$STATUS_CODE_POST_2" ] && echo "Step 2 POST Status Code OK" || (echo "Step 2 POST status code did not match. Test failed." && exit 1)
[ "$OUT_CONTENT" == "$TEXT_MIME" ] && echo "Step 2 POST MIME OK" || (echo " Step 2 POST MIME did not match. Test failed." && exit 1)
#Checking Step 2 PATCH on /patch
OUT_PATCH="`curl -s -X PATCH $SERVICE_URL/patch`"
OUT_CONTENT=`curl -X PATCH -s -o /dev/null -I -w "%{content_type}" $SERVICE_URL/patch`
OUT_STATUS_CODE=`curl -X PATCH -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/patch`
[ "${PAYLOAD_PATCH_2//[$'\t\r\n ']}" == "${OUT_PATCH//[$'\t\r\n ']}" ] && echo "Step 2 PATCH Payload OK" || (echo "Step 2 PATCH Payload did not match. Test failed." && exit 1)
[ "$OUT_STATUS_CODE" == "$STATUS_CODE_PATCH_2" ] && echo "Step 2 PATCH Status Code OK" || (echo "Step 2 PATCH status code did not match. Test failed." && exit 1)
[ "$OUT_CONTENT" == "$TEXT_MIME" ] && echo "Step 2 PATCH MIME OK" || (echo " Step 2 PATCH MIME did not match. Test failed." && exit 1)
wait $PID
cat /tmp/krkn.log
# now checking if service has been restore correctly and nginx responds correctly
curl -s $SERVICE_URL | grep nginx! && echo "BODY: Service restored!" || (echo "BODY: failed to restore service" && exit 1)
OUT_STATUS_CODE=`curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL`
[ "$OUT_STATUS_CODE" == "200" ] && echo "STATUS_CODE: Service restored!" || (echo "STATUS_CODE: failed to restore service" && exit 1)
echo "Service Hijacking Chaos test: Success"
}
functional_test_service_hijacking

View File

@@ -0,0 +1,37 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_telemetry {
AWS_CLI=`which aws`
[ -z "$AWS_CLI" ]&& echo "AWS cli not found in path" && exit 1
[ -z "$AWS_BUCKET" ] && echo "AWS bucket not set in environment" && exit 1
export RUN_TAG="funtest-telemetry"
yq -i '.telemetry.enabled=True' CI/config/common_test_config.yaml
yq -i '.telemetry.full_prometheus_backup=True' CI/config/common_test_config.yaml
yq -i '.performance_monitoring.check_critical_alerts=True' CI/config/common_test_config.yaml
yq -i '.performance_monitoring.prometheus_url="http://localhost:9090"' CI/config/common_test_config.yaml
yq -i '.telemetry.run_tag=env(RUN_TAG)' CI/config/common_test_config.yaml
export scenario_type="pod_disruption_scenarios"
export scenario_file="scenarios/kind/pod_etcd.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/telemetry.yaml
retval=$(python3 -m coverage run -a run_kraken.py -c CI/config/telemetry.yaml)
RUN_FOLDER=`cat CI/out/test_telemetry.out | grep amazonaws.com | sed -rn "s#.*https:\/\/.*\/files/(.*)#\1#p"`
$AWS_CLI s3 ls "s3://$AWS_BUCKET/$RUN_FOLDER/" | awk '{ print $4 }' > s3_remote_files
echo "checking if telemetry files are uploaded on s3"
cat s3_remote_files | grep critical-alerts-00.log || ( echo "FAILED: critical-alerts-00.log not uploaded" && exit 1 )
cat s3_remote_files | grep prometheus-00.tar || ( echo "FAILED: prometheus backup not uploaded" && exit 1 )
cat s3_remote_files | grep telemetry.json || ( echo "FAILED: telemetry.json not uploaded" && exit 1 )
echo "all files uploaded!"
echo "Telemetry Collection: Success"
}
functional_test_telemetry

View File

@@ -7,8 +7,12 @@ trap finish EXIT
function functional_test_time_scenario {
yq -i '.time_scenarios[0].label_selector="scenario=time-skew"' scenarios/openshift/time_scenarios_example.yml
yq -i '.time_scenarios[0].container_name=""' scenarios/openshift/time_scenarios_example.yml
yq -i '.time_scenarios[0].namespace="default"' scenarios/openshift/time_scenarios_example.yml
yq -i '.time_scenarios[1].label_selector="kubernetes.io/hostname=kind-worker2"' scenarios/openshift/time_scenarios_example.yml
export scenario_type="time_scenarios"
export scenario_file="CI/scenarios/time_scenarios.yml"
export scenario_file="scenarios/openshift/time_scenarios_example.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/time_config.yaml

273
CLAUDE.md Normal file
View File

@@ -0,0 +1,273 @@
# CLAUDE.md - Krkn Chaos Engineering Framework
## Project Overview
Krkn (Kraken) is a chaos engineering tool for Kubernetes/OpenShift clusters. It injects deliberate failures to validate cluster resilience. Plugin-based architecture with multi-cloud support (AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack).
## Repository Structure
```
krkn/
├── krkn/
│ ├── scenario_plugins/ # Chaos scenario plugins (pod, node, network, hogs, etc.)
│ ├── utils/ # Utility functions
│ ├── rollback/ # Rollback management
│ ├── prometheus/ # Prometheus integration
│ └── cerberus/ # Health monitoring
├── tests/ # Unit tests (unittest framework)
├── scenarios/ # Example scenario configs (openshift/, kube/, kind/)
├── config/ # Configuration files
└── CI/ # CI/CD test scripts
```
## Quick Start
```bash
# Setup (ALWAYS use virtual environment)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Run Krkn
python run_kraken.py --config config/config.yaml
# Note: Scenarios are specified in config.yaml under kraken.chaos_scenarios
# There is no --scenario flag; edit config/config.yaml to select scenarios
# Run tests
python -m unittest discover -s tests -v
python -m coverage run -a -m unittest discover -s tests -v
```
## Critical Requirements
### Python Environment
- **Python 3.9+** required
- **NEVER install packages globally** - always use virtual environment
- **CRITICAL**: `docker` must be <7.0 and `requests` must be <2.32 (Unix socket compatibility)
### Key Dependencies
- **krkn-lib** (5.1.13): Core library for Kubernetes/OpenShift operations
- **kubernetes** (34.1.0): Kubernetes Python client
- **docker** (<7.0), **requests** (<2.32): DO NOT upgrade without verifying compatibility
- Cloud SDKs: boto3 (AWS), azure-mgmt-* (Azure), google-cloud-compute (GCP), ibm_vpc (IBM), pyVmomi (VMware)
## Plugin Architecture (CRITICAL)
**Strictly enforced naming conventions:**
### Naming Rules
- **Module files**: Must end with `_scenario_plugin.py` and use snake_case
- Example: `pod_disruption_scenario_plugin.py`
- **Class names**: Must be CamelCase and end with `ScenarioPlugin`
- Example: `PodDisruptionScenarioPlugin`
- Must match module filename (snake_case ↔ CamelCase)
- **Directory structure**: Plugin dirs CANNOT contain "scenario" or "plugin"
- Location: `krkn/scenario_plugins/<plugin_name>/`
### Plugin Implementation
Every plugin MUST:
1. Extend `AbstractScenarioPlugin`
2. Implement `run()` method
3. Implement `get_scenario_types()` method
```python
from krkn.scenario_plugins import AbstractScenarioPlugin
class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
def run(self, config, scenarios_list, kubeconfig_path, wait_duration):
pass
def get_scenario_types(self):
return ["pod_scenarios", "pod_outage"]
```
### Creating a New Plugin
1. Create directory: `krkn/scenario_plugins/<plugin_name>/`
2. Create module: `<plugin_name>_scenario_plugin.py`
3. Create class: `<PluginName>ScenarioPlugin` extending `AbstractScenarioPlugin`
4. Implement `run()` and `get_scenario_types()`
5. Create unit test: `tests/test_<plugin_name>_scenario_plugin.py`
6. Add example scenario: `scenarios/<platform>/<scenario>.yaml`
**DO NOT**: Violate naming conventions (factory will reject), include "scenario"/"plugin" in directory names, create plugins without tests.
## Testing
### Unit Tests
```bash
# Run all tests
python -m unittest discover -s tests -v
# Specific test
python -m unittest tests.test_pod_disruption_scenario_plugin
# With coverage
python -m coverage run -a -m unittest discover -s tests -v
python -m coverage html
```
**Test requirements:**
- Naming: `test_<module>_scenario_plugin.py`
- Mock external dependencies (Kubernetes API, cloud providers)
- Test success, failure, and edge cases
- Keep tests isolated and independent
### Functional Tests
Located in `CI/tests/`. Can be run locally on a kind cluster with Prometheus and Elasticsearch set up.
**Setup for local testing:**
1. Deploy Prometheus and Elasticsearch on your kind cluster:
- Prometheus setup: https://krkn-chaos.dev/docs/developers-guide/testing-changes/#prometheus
- Elasticsearch setup: https://krkn-chaos.dev/docs/developers-guide/testing-changes/#elasticsearch
2. Or disable monitoring features in `config/config.yaml`:
```yaml
performance_monitoring:
enable_alerts: False
enable_metrics: False
check_critical_alerts: False
```
**Note:** Functional tests run automatically in CI with full monitoring enabled.
## Cloud Provider Implementations
Node chaos scenarios are cloud-specific. Each in `krkn/scenario_plugins/node_actions/<provider>_node_scenarios.py`:
- AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack, Bare Metal
Implement: stop, start, reboot, terminate instances.
**When modifying**: Maintain consistency with other providers, handle API errors, add logging, update tests.
### Adding Cloud Provider Support
1. Create: `krkn/scenario_plugins/node_actions/<provider>_node_scenarios.py`
2. Extend: `abstract_node_scenarios.AbstractNodeScenarios`
3. Implement: `stop_instances`, `start_instances`, `reboot_instances`, `terminate_instances`
4. Add SDK to `requirements.txt`
5. Create unit test with mocked SDK
6. Add example scenario: `scenarios/openshift/<provider>_node_scenarios.yml`
## Configuration
**Main config**: `config/config.yaml`
- `kraken`: Core settings
- `cerberus`: Health monitoring
- `performance_monitoring`: Prometheus
- `elastic`: Elasticsearch telemetry
**Scenario configs**: `scenarios/` directory
```yaml
- config:
scenario_type: <type> # Must match plugin's get_scenario_types()
```
## Code Style
- **Import order**: Standard library, third-party, local imports
- **Naming**: snake_case (functions/variables), CamelCase (classes)
- **Logging**: Use Python's `logging` module
- **Error handling**: Return appropriate exit codes
- **Docstrings**: Required for public functions/classes
## Exit Codes
Krkn uses specific exit codes to communicate execution status:
- `0`: Success - all scenarios passed, no critical alerts
- `1`: Scenario failure - one or more scenarios failed
- `2`: Critical alerts fired during execution
- `3+`: Health check failure (Cerberus monitoring detected issues)
**When implementing scenarios:**
- Return `0` on success
- Return `1` on scenario-specific failures
- Propagate health check failures appropriately
- Log exit code reasons clearly
## Container Support
Krkn can run inside a container. See `containers/` directory.
**Building custom image:**
```bash
cd containers
./compile_dockerfile.sh # Generates Dockerfile from template
docker build -t krkn:latest .
```
**Running containerized:**
```bash
docker run -v ~/.kube:/root/.kube:Z \
-v $(pwd)/config:/config:Z \
-v $(pwd)/scenarios:/scenarios:Z \
krkn:latest
```
## Git Workflow
- **NEVER commit directly to main**
- **NEVER use `--force` without approval**
- **ALWAYS create feature branches**: `git checkout -b feature/description`
- **ALWAYS run tests before pushing**
**Conventional commits**: `feat:`, `fix:`, `test:`, `docs:`, `refactor:`
```bash
git checkout main && git pull origin main
git checkout -b feature/your-feature-name
# Make changes, write tests
python -m unittest discover -s tests -v
git add <specific-files>
git commit -m "feat: description"
git push -u origin feature/your-feature-name
```
## Environment Variables
- `KUBECONFIG`: Path to kubeconfig
- `AWS_*`, `AZURE_*`, `GOOGLE_APPLICATION_CREDENTIALS`: Cloud credentials
- `PROMETHEUS_URL`, `ELASTIC_URL`, `ELASTIC_PASSWORD`: Monitoring config
**NEVER commit credentials or API keys.**
## Common Pitfalls
1. Missing virtual environment - always activate venv
2. Running functional tests without cluster setup
3. Ignoring exit codes
4. Modifying krkn-lib directly (it's a separate package)
5. Upgrading docker/requests beyond version constraints
## Before Writing Code
1. Check for existing implementations
2. Review existing plugins as examples
3. Maintain consistency with cloud provider patterns
4. Plan rollback logic
5. Write tests alongside code
6. Update documentation
## When Adding Dependencies
1. Check if functionality exists in krkn-lib or current dependencies
2. Verify compatibility with existing versions
3. Pin specific versions in `requirements.txt`
4. Check for security vulnerabilities
5. Test thoroughly for conflicts
## Common Development Tasks
### Modifying Existing Plugin
1. Read plugin code and corresponding test
2. Make changes
3. Update/add unit tests
4. Run: `python -m unittest tests.test_<plugin>_scenario_plugin`
### Writing Unit Tests
1. Create: `tests/test_<module>_scenario_plugin.py`
2. Import `unittest` and plugin class
3. Mock external dependencies
4. Test success, failure, and edge cases
5. Run: `python -m unittest tests.test_<module>_scenario_plugin`

View File

@@ -1,21 +1,50 @@
# Contributor Covenant Code of Conduct
## CNCF Community Code of Conduct v1.3
## Our Pledge
Other languages available:
- [Arabic/العربية](code-of-conduct-languages/ar.md)
- [Bulgarian/Български](code-of-conduct-languages/bg.md)
- [Chinese/中文](code-of-conduct-languages/zh.md)
- [Czech/Česky](code-of-conduct-languages/cs.md)
- [Farsi/فارسی](code-of-conduct-languages/fa.md)
- [French/Français](code-of-conduct-languages/fr.md)
- [German/Deutsch](code-of-conduct-languages/de.md)
- [Hindi/हिन्दी](code-of-conduct-languages/hi.md)
- [Indonesian/Bahasa Indonesia](code-of-conduct-languages/id.md)
- [Italian/Italiano](code-of-conduct-languages/it.md)
- [Japanese/日本語](code-of-conduct-languages/jp.md)
- [Korean/한국어](code-of-conduct-languages/ko.md)
- [Polish/Polski](code-of-conduct-languages/pl.md)
- [Portuguese/Português](code-of-conduct-languages/pt.md)
- [Russian/Русский](code-of-conduct-languages/ru.md)
- [Spanish/Español](code-of-conduct-languages/es.md)
- [Turkish/Türkçe](code-of-conduct-languages/tr.md)
- [Ukrainian/Українська](code-of-conduct-languages/uk.md)
- [Vietnamese/Tiếng Việt](code-of-conduct-languages/vi.md)
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
### Community Code of Conduct
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
As contributors, maintainers, and participants in the CNCF community, and in the interest of fostering
an open and welcoming community, we pledge to respect all people who participate or contribute
through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, attending conferences or events, or engaging in other community or project activities.
We are committed to making participation in the CNCF community a harassment-free experience for everyone, regardless of age, body size, caste, disability, ethnicity, level of experience, family status, gender, gender identity and expression, marital status, military or veteran status, nationality, personal appearance, race, religion, sexual orientation, socioeconomic status, tribe, or any other dimension of diversity.
## Scope
This code of conduct applies:
* within project and community spaces,
* in other spaces when an individual CNCF community participant's words or actions are directed at or are about a CNCF project, the CNCF community, or another CNCF community participant.
### CNCF Events
CNCF events that are produced by the Linux Foundation with professional events staff are governed by the Linux Foundation [Events Code of Conduct](https://events.linuxfoundation.org/code-of-conduct/) available on the event page. This is designed to be used in conjunction with the CNCF Code of Conduct.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
The CNCF Community is open, inclusive and respectful. Every member of our community has the right to have their identity respected.
Examples of behavior that contributes to a positive environment include but are not limited to:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
@@ -24,104 +53,52 @@ community include:
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
* Using welcoming and inclusive language
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
Examples of unacceptable behavior include but are not limited to:
* The use of sexualized language or imagery
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Public or private harassment in any form
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Violence, threatening violence, or encouraging others to engage in violent behavior
* Stalking or following someone without their consent
* Unwelcome physical contact
* Unwelcome sexual or romantic attention or advances
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
The following behaviors are also prohibited:
* Providing knowingly false or misleading information in connection with a Code of Conduct investigation or otherwise intentionally tampering with an investigation.
* Retaliating against a person because they reported an incident or provided information about an incident as a witness.
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct.
By adopting this Code of Conduct, project maintainers commit themselves to fairly and consistently applying these principles to every aspect
of managing a CNCF project.
Project maintainers who do not follow or enforce the Code of Conduct may be temporarily or permanently removed from the project team.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Reporting
## Scope
For incidents occurring in the Kubernetes community, contact the [Kubernetes Code of Conduct Committee](https://git.k8s.io/community/committee-code-of-conduct) via <conduct@kubernetes.io>. You can expect a response within three business days.
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
For other projects, or for incidents that are project-agnostic or impact multiple CNCF projects, please contact the [CNCF Code of Conduct Committee](https://www.cncf.io/conduct/committee/) via <conduct@cncf.io>. Alternatively, you can contact any of the individual members of the [CNCF Code of Conduct Committee](https://www.cncf.io/conduct/committee/) to submit your report. For more detailed instructions on how to submit a report, including how to submit a report anonymously, please see our [Incident Resolution Procedures](https://github.com/cncf/foundation/blob/main/code-of-conduct/coc-incident-resolution-procedures.md). You can expect a response within three business days.
For incidents occurring at CNCF event that is produced by the Linux Foundation, please contact <eventconduct@cncf.io>.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement.
All complaints will be reviewed and investigated promptly and fairly.
Upon review and investigation of a reported incident, the CoC response team that has jurisdiction will determine what action is appropriate based on this Code of Conduct and its related documentation.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
For information about which Code of Conduct incidents are handled by project leadership, which incidents are handled by the CNCF Code of Conduct Committee, and which incidents are handled by the Linux Foundation (including its events team), see our [Jurisdiction Policy](https://github.com/cncf/foundation/blob/main/code-of-conduct/coc-committee-jurisdiction-policy.md).
## Enforcement Guidelines
## Amendments
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
Consistent with the CNCF Charter, any substantive changes to this Code of Conduct must be approved by the Technical Oversight Committee.
### 1. Correction
## Acknowledgements
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.
This Code of Conduct is adapted from the Contributor Covenant
(http://contributor-covenant.org), version 2.0 available at
http://contributor-covenant.org/version/2/0/code_of_conduct/

83
GOVERNANCE.md Normal file
View File

@@ -0,0 +1,83 @@
The governance model adopted here is heavily influenced by a set of CNCF projects, especially drew
reference from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md).
*For similar structures some of the same wordings from kubernetes governance are borrowed to adhere
to the originally construed meaning.*
## Principles
- **Open**: Krkn is open source community.
- **Welcoming and respectful**: See [Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
- **Transparent and accessible**: Work and collaboration should be done in public.
Changes to the Krkn organization, Krkn code repositories, and CNCF related activities (e.g.
level, involvement, etc) are done in public.
- **Merit**: Ideas and contributions are accepted according to their technical merit
and alignment with project objectives, scope and design principles.
## Code of Conduct
Krkn follows the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
Here is an excerpt:
> As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
## Maintainer Levels
### Contributor
Contributors contribute to the community. Anyone can become a contributor by participating in discussions, reporting bugs, or contributing code or documentation.
#### Responsibilities:
Be active in the community and adhere to the Code of Conduct.
Report bugs and suggest new features.
Contribute high-quality code and documentation.
### Member
Members are active contributors to the community. Members have demonstrated a strong understanding of the project's codebase and conventions.
#### Responsibilities:
Review pull requests for correctness, quality, and adherence to project standards.
Provide constructive and timely feedback to contributors.
Ensure that all contributions are well-tested and documented.
Work with maintainers to ensure a smooth and efficient release process.
### Maintainer
Maintainers are responsible for the overall health and direction of the project. They are long-standing contributors who have shown a deep commitment to the project's success.
#### Responsibilities:
Set the technical direction and vision for the project.
Manage releases and ensure the stability of the main branch.
Make decisions on feature inclusion and project priorities.
Mentor other contributors and help grow the community.
Resolve disputes and make final decisions when consensus cannot be reached.
### Owner
Owners have administrative access to the project and are the final decision-makers.
#### Responsibilities:
Manage the core team of maintainers and approvers.
Set the overall vision and strategy for the project.
Handle administrative tasks, such as managing the project's repository and other resources.
Represent the project in the broader open-source community.
# Credits
Sections of this document have been borrowed from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md)

View File

@@ -1,12 +1,34 @@
## Overview
This document contains a list of maintainers in this repo.
This file lists the maintainers and committers of the Krkn project.
In short, maintainers are people who are in charge of the maintenance of the Krkn project. Committers are active community members who have shown that they are committed to the continuous development of the project through ongoing engagement with the community.
For detailed description of the roles, see [Governance](./GOVERNANCE.md) page.
## Current Maintainers
| Maintainer | GitHub ID | Email |
|---------------------| --------------------------------------------------------- | ----------------------- |
| Ravi Elluri | [chaitanyaenr](https://github.com/chaitanyaenr) | nelluri@redhat.com |
| Pradeep Surisetty | [psuriset](https://github.com/psuriset) | psuriset@redhat.com |
| Paige Rubendall | [paigerube14](https://github.com/paigerube14) | prubenda@redhat.com |
| Tullio Sebastiani | [tsebastiani](https://github.com/tsebastiani) | tsebasti@redhat.com |
| Maintainer | GitHub ID | Email | Contribution Level |
|---------------------| --------------------------------------------------------- | ----------------------- | ---------------------- |
| Ravi Elluri | [chaitanyaenr](https://github.com/chaitanyaenr) | nelluri@redhat.com | Owner |
| Pradeep Surisetty | [psuriset](https://github.com/psuriset) | psuriset@redhat.com | Owner |
| Paige Patton | [paigerube14](https://github.com/paigerube14) | prubenda@redhat.com | Maintainer |
| Tullio Sebastiani | [tsebastiani](https://github.com/tsebastiani) | tsebasti@redhat.com | Maintainer |
| Yogananth Subramanian | [yogananth-subramanian](https://github.com/yogananth-subramanian) | ysubrama@redhat.com |Maintainer |
| Sahil Shah | [shahsahil264](https://github.com/shahsahil264) | sahshah@redhat.com | Member |
Note : It is mandatory for all Krkn community members to follow our [Code of Conduct](./CODE_OF_CONDUCT.md)
## Contributor Ladder
This project follows a contributor ladder model, where contributors can take on more responsibilities as they gain experience and demonstrate their commitment to the project.
The roles are:
* Contributor: A contributor to the community whether it be with code, docs or issues
* Member: A contributor who is active in the community and reviews pull requests.
* Maintainer: A contributor who is responsible for the overall health and direction of the project.
* Owner: A contributor who has administrative ownership of the project.

124
README.md
View File

@@ -1,117 +1,29 @@
# Krkn aka Kraken
[![Docker Repository on Quay](https://quay.io/repository/redhat-chaos/krkn/status "Docker Repository on Quay")](https://quay.io/repository/redhat-chaos/krkn?tab=tags&tag=latest)
![Workflow-Status](https://github.com/redhat-chaos/krkn/actions/workflows/docker-image.yml/badge.svg)
![Workflow-Status](https://github.com/krkn-chaos/krkn/actions/workflows/docker-image.yml/badge.svg)
![coverage](https://krkn-chaos.github.io/krkn-lib-docs/coverage_badge_krkn.svg)
![action](https://github.com/krkn-chaos/krkn/actions/workflows/tests.yml/badge.svg)
[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/10548/badge)](https://www.bestpractices.dev/projects/10548)
![Krkn logo](media/logo.png)
Chaos and resiliency testing tool for Kubernetes and OpenShift.
Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to turbulent conditions.
Chaos and resiliency testing tool for Kubernetes.
Kraken injects deliberate failures into Kubernetes clusters to check if it is resilient to turbulent conditions.
### Workflow
![Kraken workflow](media/kraken-workflow.png)
### Demo
[![Kraken demo](media/KrakenStarting.png)](https://youtu.be/LN-fZywp_mo "Kraken Demo - Click to Watch!")
![Kraken workflow](media/kraken-workflow.png)
### Chaos Testing Guide
[Guide](docs/index.md) encapsulates:
- Test methodology that needs to be embraced.
- Best practices that an OpenShift cluster, platform and applications running on top of it should take into account for best user experience, performance, resilience and reliability.
- Tooling.
- Scenarios supported.
- Test environment recommendations as to how and where to run chaos tests.
- Chaos testing in practice.
The guide is hosted at https://redhat-chaos.github.io/krkn.
<!-- ### Demo
[![Kraken demo](media/KrakenStarting.png)](https://youtu.be/LN-fZywp_mo "Kraken Demo - Click to Watch!") -->
### How to Get Started
Instructions on how to setup, configure and run Kraken can be found at [Installation](docs/installation.md).
See the [getting started doc](docs/getting_started.md) on support on how to get started with your own custom scenario or editing current scenarios for your specific usage.
After installation, refer back to the below sections for supported scenarios and how to tweak the kraken config to load them on your cluster.
Instructions on how to setup, configure and run Kraken can be found in the [documentation](https://krkn-chaos.dev/docs/).
#### Running Kraken with minimal configuration tweaks
For cases where you want to run Kraken with minimal configuration changes, refer to [Kraken-hub](https://github.com/redhat-chaos/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
### Setting up infrastructure dependencies
Kraken indexes the metrics specified in the profile into Elasticsearch in addition to leveraging Cerberus for understanding the health of the Kubernetes/OpenShift cluster under test. More information on the features is documented below. The infrastructure pieces can be easily installed and uninstalled by running:
```
$ cd kraken
$ podman-compose up or $ docker-compose up # Spins up the containers specified in the docker-compose.yml file present in the run directory.
$ podman-compose down or $ docker-compose down # Delete the containers installed.
```
This will manage the Cerberus and Elasticsearch containers on the host on which you are running Kraken.
**NOTE**: Make sure you have enough resources (memory and disk) on the machine on top of which the containers are running as Elasticsearch is resource intensive. Cerberus monitors the system components by default, the [config](config/cerberus.yaml) can be tweaked to add applications namespaces, routes and other components to monitor as well. The command will keep running until killed since detached mode is not supported as of now.
### Config
Instructions on how to setup the config and the options supported can be found at [Config](docs/config.md).
### Kubernetes/OpenShift chaos scenarios supported
Scenario type | Kubernetes | OpenShift
--------------------------- | ------------- |--------------------|
[Pod Scenarios](docs/pod_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Pod Network Scenarios](docs/pod_network_scenarios.md) | :x: | :heavy_check_mark: |
[Container Scenarios](docs/container_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Node Scenarios](docs/node_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Time Scenarios](docs/time_scenarios.md) | :x: | :heavy_check_mark: |
[Hog Scenarios: CPU, Memory](docs/arcaflow_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Cluster Shut Down Scenarios](docs/cluster_shut_down_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Service Disruption Scenarios](docs/service_disruption_scenarios.md.md) | :heavy_check_mark: | :heavy_check_mark: |
[Zone Outage Scenarios](docs/zone_outage.md) | :heavy_check_mark: | :heavy_check_mark: |
[Application_outages](docs/application_outages.md) | :heavy_check_mark: | :heavy_check_mark: |
[PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: | :heavy_check_mark: |
[Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: | :heavy_check_mark: |
[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: | :question: |
### Kraken scenario pass/fail criteria and report
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
- Leveraging [kube-burner](docs/alerts.md) alerting feature to fail the runs in case of critical alerts.
### Signaling
In CI runs or any external job it is useful to stop Kraken once a certain test or state gets reached. We created a way to signal to kraken to pause the chaos or stop it completely using a signal posted to a port of your choice.
For example if we have a test run loading the cluster running and kraken separately running; we want to be able to know when to start/stop the kraken run based on when the test run completes or gets to a certain loaded state.
More detailed information on enabling and leveraging this feature can be found [here](docs/signal.md).
### Performance monitoring
Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chaos scenarios on various components is key to find out the bottlenecks as it is important to make sure the cluster is healthy in terms if both recovery as well as performance during/after the failure has been injected. Instructions on enabling it can be found [here](docs/performance_dashboards.md).
### Scraping and storing metrics long term
Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes then into Elasticsearch to be able to store and evaluate the state of the runs long term. The indexed metrics can be visualized with the help of Grafana. It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Information on enabling and leveraging this feature can be found [here](docs/metrics.md).
### SLOs validation during and post chaos
- In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics.
- Kraken also provides ability to check if any critical alerts are firing in the cluster post chaos and pass/fail's.
Information on enabling and leveraging this feature can be found [here](docs/SLOs_validation.md)
### OCM / ACM integration
Kraken supports injecting faults into [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management) managed clusters through [ManagedCluster Scenarios](docs/managedcluster_scenarios.md).
### Blogs and other useful resources
- Blog post on introduction to Kraken: https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes
- Discussion and demo on how Kraken can be leveraged to ensure OpenShift is reliable, performant and scalable: https://www.youtube.com/watch?v=s1PvupI5sD0&ab_channel=OpenShift
- Blog post emphasizing the importance of making Chaos part of Performance and Scale runs to mimic the production environments: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests
- Blog post on findings from Chaos test runs: https://cloud.redhat.com/blog/openshift/kubernetes-chaos-stories
### Blogs, podcasts and interviews
Additional resources, including blog posts, podcasts, and community interviews, can be found on the [website](https://krkn-chaos.dev/blog)
### Roadmap
@@ -121,13 +33,11 @@ Enhancements being planned can be found in the [roadmap](ROADMAP.md).
### Contributions
We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.
[More information on how to Contribute](docs/contribute.md)
If adding a new scenario or tweaking the main config, be sure to add in updates into the CI to be sure the CI is up to date.
Please read [this file]((CI/README.md#adding-a-test-case)) for more information on updates.
[More information on how to Contribute](https://krkn-chaos.dev/docs/contribution-guidelines/)
### Community
Key Members(slack_usernames/full name): paigerube14/Paige Rubendall, mffiedler/Mike Fiedler, ravielluri/Naga Ravi Chaitanya Elluri.
* [**#krkn on Kubernetes Slack**](https://kubernetes.slack.com)
* [**#forum-chaos on CoreOS Slack internal to Red Hat**](https://coreos.slack.com)
Key Members(slack_usernames/full name): paigerube14/Paige Rubendall, mffiedler/Mike Fiedler, tsebasti/Tullio Sebastiani, yogi/Yogananth Subramanian, sahil/Sahil Shah, pradeep/Pradeep Surisetty and ravielluri/Naga Ravi Chaitanya Elluri.
* [**#krkn on Kubernetes Slack**](https://kubernetes.slack.com/messages/C05SFMHRWK1)
The Linux Foundation® (TLF) has registered trademarks and uses trademarks. For a list of TLF trademarks, see [Trademark Usage](https://www.linuxfoundation.org/legal/trademark-usage).

55
RELEASE.md Normal file
View File

@@ -0,0 +1,55 @@
### Release Protocol: The Community-First Cycle
This document outlines the project's release protocol, a methodology designed to ensure a responsive and transparent development process that is closely aligned with the needs of our users and contributors. This protocol is tailored for projects in their early stages, prioritizing agility and community feedback over a rigid, time-boxed schedule.
#### 1. Key Principles
* **Community as the Compass:** The primary driver for all development is feedback from our user and contributor community.
* **Prioritization by Impact:** Tasks are prioritized based on their impact on user experience, the urgency of bug fixes, and the value of community-contributed features.
* **Event-Driven Releases:** Releases are not bound by a fixed calendar. New versions are published when a significant body of work is complete, a critical issue is resolved, or a new feature is ready for adoption.
* **Transparency and Communication:** All development decisions, progress, and plans are communicated openly through our issue tracker, pull requests, and community channels.
#### 2. The Release Lifecycle
The release cycle is a continuous flow of activities rather than a series of sequential phases.
**2.1. Discovery & Prioritization**
* New features and bug fixes are identified through user feedback on our issue tracker, community discussions, and direct contributions.
* The core maintainers, in collaboration with the community, continuously evaluate and tag issues to create an open and dynamic backlog.
**2.2. Development & Code Review**
* Work is initiated based on the highest-priority items in the backlog.
* All code contributions are made via pull requests (PRs).
* PRs are reviewed by maintainers and other contributors to ensure code quality, adherence to project standards, and overall stability.
**2.3. Release Readiness**
A new release is considered ready when one of the following conditions is met:
* A major new feature has been completed and thoroughly tested.
* A critical security vulnerability or bug has been addressed.
* A sufficient number of smaller improvements and fixes have been merged, providing meaningful value to users.
**2.4. Versioning**
We adhere to [**Semantic Versioning 2.0.0**](https://semver.org/).
* **Major version (`X.y.z`)**: Reserved for releases that introduce breaking changes.
* **Minor version (`x.Y.z`)**: Used for new features or significant non-breaking changes.
* **Patch version (`x.y.Z`)**: Used for bug fixes and small, non-functional improvements.
#### 3. Roles and Responsibilities
* **Members:** The [core team](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md) responsible for the project's health. Their duties include:
* Reviewing pull requests.
* Contributing code and documentation via pull requests.
* Engaging in discussions and providing feedback.
* **Maintainers and Owners:** The [core team](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md) responsible for the project's health. Their duties include:
* Facilitating community discussions and prioritization.
* Reviewing and merging pull requests.
* Cutting and announcing official releases.
* **Contributors:** The community. Their duties include:
* Reporting bugs and suggesting new features.
* Contributing code and documentation via pull requests.
* Engaging in discussions and providing feedback.
#### 4. Adoption and Future Evolution
This protocol is designed for the current stage of the project. As the project matures and the contributor base grows, the maintainers will evaluate the need for a more structured methodology to ensure continued scalability and stability.

View File

@@ -2,12 +2,19 @@
Following are a list of enhancements that we are planning to work on adding support in Krkn. Of course any help/contributions are greatly appreciated.
- [ ] [Ability to run multiple chaos scenarios in parallel under load to mimic real world outages](https://github.com/redhat-chaos/krkn/issues/424)
- [x] [Centralized storage for chaos experiments artifacts](https://github.com/redhat-chaos/krkn/issues/423)
- [ ] [Support for causing DNS outages](https://github.com/redhat-chaos/krkn/issues/394)
- [ ] [Support for pod level network traffic shaping](https://github.com/redhat-chaos/krkn/issues/393)
- [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/redhat-chaos/krkn/issues/124)
- [ ] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/redhat-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [ ] Continue to improve [Chaos Testing Guide](https://redhat-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [ ] [Switch documentation references to Kubernetes](https://github.com/redhat-chaos/krkn/issues/495)
- [ ] [OCP and Kubernetes functionalities segregation](https://github.com/redhat-chaos/krkn/issues/497)
- [x] [Ability to run multiple chaos scenarios in parallel under load to mimic real world outages](https://github.com/krkn-chaos/krkn/issues/424)
- [x] [Centralized storage for chaos experiments artifacts](https://github.com/krkn-chaos/krkn/issues/423)
- [x] [Support for causing DNS outages](https://github.com/krkn-chaos/krkn/issues/394)
- [x] [Chaos recommender](https://github.com/krkn-chaos/krkn/tree/main/utils/chaos-recommender) to suggest scenarios having probability of impacting the service under test using profiling results
- [x] Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time [krkn-chaos-ai](https://github.com/krkn-chaos/krkn-chaos-ai)
- [x] [Support for pod level network traffic shaping](https://github.com/krkn-chaos/krkn/issues/393)
- [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/krkn-chaos/krkn/issues/124)
- [x] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [x] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [x] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
- [x] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)
- [x] [AI Chat bot to help get started with Krkn and commands](https://github.com/krkn-chaos/krkn-lightspeed)
- [ ] [Ability to roll back cluster to original state if chaos fails](https://github.com/krkn-chaos/krkn/issues/804)
- [ ] Add recovery time metrics to each scenario for better regression analysis
- [ ] [Add resiliency scoring to chaos scenarios ran on cluster](https://github.com/krkn-chaos/krkn/issues/125)

43
SECURITY.md Normal file
View File

@@ -0,0 +1,43 @@
# Security Policy
We attach great importance to code security. We are very grateful to the users, security vulnerability researchers, etc. for reporting security vulnerabilities to the Krkn community. All reported security vulnerabilities will be carefully assessed and addressed in a timely manner.
## Security Checks
Krkn leverages [Snyk](https://snyk.io/) to ensure that any security vulnerabilities found
in the code base and dependencies are fixed and published in the latest release. Security
vulnerability checks are enabled for each pull request to enable developers to get insights
and proactively fix them.
## Reporting a Vulnerability
The Krkn project treats security vulnerabilities seriously, so we
strive to take action quickly when required.
The project requests that security issues be disclosed in a responsible
manner to allow adequate time to respond. If a security issue or
vulnerability has been found, please disclose the details to our
dedicated email address:
cncf-krkn-maintainers@lists.cncf.io
You can also use the [GitHub vulnerability report mechanism](https://docs.github.com/en/code-security/security-advisories/guidance-on-reporting-and-writing-information-about-vulnerabilities/privately-reporting-a-security-vulnerability#privately-reporting-a-security-vulnerability) to report the security vulnerability.
Please include as much information as possible with the report. The
following details assist with analysis efforts:
- Description of the vulnerability
- Affected component (version, commit, branch etc)
- Affected code (file path, line numbers)
- Exploit code
## Security Team
The security team currently consists of the [Maintainers of Krkn](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md)
## Process and Supported Releases
The Krkn security team will investigate and provide a fix in a timely manner depending on the severity. The fix will be included in the new release of Krkn and details will be included in the release notes.

129
config/alerts.yaml Normal file
View File

@@ -0,0 +1,129 @@
# etcd
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[10m:]) > 0.01
description: 10 minutes avg. 99th etcd fsync latency on {{$labels.pod}} higher than 10ms. {{$value}}s
severity: warning
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[10m:]) > 1
description: 10 minutes avg. 99th etcd fsync latency on {{$labels.pod}} higher than 1s. {{$value}}s
severity: error
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[2m]))[10m:]) > 0.03
description: 10 minutes avg. 99th etcd commit latency on {{$labels.pod}} higher than 30ms. {{$value}}s
severity: warning
- expr: rate(etcd_server_leader_changes_seen_total[2m]) > 0
description: etcd leader changes observed
severity: warning
- expr: (last_over_time(etcd_mvcc_db_total_size_in_bytes[5m]) / last_over_time(etcd_server_quota_backend_bytes[5m]))*100 > 95
description: etcd cluster database is running full.
severity: critical
- expr: (last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes[5m]) / last_over_time(etcd_mvcc_db_total_size_in_bytes[5m])) < 0.5
description: etcd database size in use is less than 50% of the actual allocated storage.
severity: warning
- expr: rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5
description: etcd cluster has high number of proposal failures.
severity: warning
- expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.15
description: etcd cluster member communication is slow.
severity: warning
- expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_method!="Defragment", grpc_type="unary"}[5m])) without(grpc_type)) > 0.15
description: etcd grpc requests are slow.
severity: critical
- expr: 100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code) / sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code) > 5
description: etcd cluster has high number of failed grpc requests.
severity: critical
- expr: etcd_server_has_leader{job=~".*etcd.*"} == 0
description: etcd cluster has no leader.
severity: warning
- expr: sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)
description: etcd cluster has insufficient number of members.
severity: warning
- expr: max without (endpoint) ( sum without (instance) (up{job=~".*etcd.*"} == bool 0) or count without (To) ( sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01 )) > 0
description: etcd cluster members are down.
severity: warning
# API server
- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"POST|PUT|DELETE|PATCH", subresource!~"log|exec|portforward|attach|proxy"}[2m])) by (le, resource, verb))[10m:]) > 1
description: 10 minutes avg. 99th mutating API call latency for {{$labels.verb}}/{{$labels.resource}} higher than 1 second. {{$value}}s
severity: error
- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"LIST|GET", subresource!~"log|exec|portforward|attach|proxy", scope="resource"}[2m])) by (le, resource, verb, scope))[5m:]) > 1
description: 5 minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 1 second. {{$value}}s
severity: error
- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"LIST|GET", subresource!~"log|exec|portforward|attach|proxy", scope="namespace"}[2m])) by (le, resource, verb, scope))[5m:]) > 5
description: 5 minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 5 seconds. {{$value}}s
severity: error
- expr: avg_over_time(histogram_quantile(0.99, sum(irate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb=~"LIST|GET", subresource!~"log|exec|portforward|attach|proxy", scope="cluster"}[2m])) by (le, resource, verb, scope))[5m:]) > 30
description: 5 minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 30 seconds. {{$value}}s
severity: error
# Control plane pods
- expr: up{job=~"crio|kubelet"} == 0
description: "{{$labels.node}}/{{$labels.job}} down"
severity: warning
- expr: up{job="ovnkube-node"} == 0
description: "{{$labels.instance}}/{{$labels.pod}} {{$labels.job}} down"
severity: warning
# Service sync latency
- expr: histogram_quantile(0.99, sum(rate(kubeproxy_network_programming_duration_seconds_bucket[2m])) by (le)) > 10
description: 99th Kubeproxy network programming latency higher than 10 seconds. {{$value}}s
severity: warning
# Prometheus alerts
- expr: ALERTS{severity="critical", alertstate="firing"} > 0
description: Critical prometheus alert. {{$labels.alertname}}
severity: warning
# etcd CPU and usage increase
- expr: sum(rate(container_cpu_usage_seconds_total{image!='', namespace='openshift-etcd', container='etcd'}[1m])) * 100 / sum(machine_cpu_cores) > 5
description: Etcd CPU usage increased significantly by {{$value}}%
severity: critical
# etcd memory usage increase
- expr: sum(deriv(container_memory_usage_bytes{image!='', namespace='openshift-etcd', container='etcd'}[5m])) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: Etcd memory usage increased significantly by {{$value}}%
severity: critical
# Openshift API server CPU and memory usage increase
- expr: sum(rate(container_cpu_usage_seconds_total{image!='', namespace='openshift-apiserver', container='openshift-apiserver'}[1m])) * 100 / sum(machine_cpu_cores) > 5
description: openshift apiserver cpu usage increased significantly by {{$value}}%
severity: critical
- expr: (sum(deriv(container_memory_usage_bytes{namespace='openshift-apiserver', container='openshift-apiserver'}[5m]))) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: openshift apiserver memory usage increased significantly by {{$value}}%
severity: critical
# Openshift kube API server CPU and memory usage increase
- expr: sum(rate(container_cpu_usage_seconds_total{image!='', namespace='openshift-kube-apiserver', container='kube-apiserver'}[1m])) * 100 / sum(machine_cpu_cores) > 5
description: openshift apiserver cpu usage increased significantly by {{$value}}%
severity: critical
- expr: (sum(deriv(container_memory_usage_bytes{namespace='openshift-kube-apiserver', container='kube-apiserver'}[5m]))) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: openshift apiserver memory usage increased significantly by {{$value}}%
severity: critical
# Master node CPU usage increase
- expr: (sum((sum(deriv(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod) * on(pod, namespace) group_left(node) (node_namespace_pod:kube_pod_info:) ) * on(node) group_left(role) (max by (node) (kube_node_role{role="master"})))) * 100 / sum(machine_cpu_cores) > 5
description: master nodes cpu usage increased significantly by {{$value}}%
severity: critical
# Master nodes memory usage increase
- expr: (sum((sum(deriv(container_memory_usage_bytes{container="",pod!=""}[5m])) BY (namespace, pod) * on(pod, namespace) group_left(node) (node_namespace_pod:kube_pod_info:) ) * on(node) group_left(role) (max by (node) (kube_node_role{role="master"})))) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: master nodes memory usage increased significantly by {{$value}}%
severity: critical

View File

@@ -99,3 +99,41 @@
- expr: ALERTS{severity="critical", alertstate="firing"} > 0
description: Critical prometheus alert. {{$labels.alertname}}
severity: warning
# etcd CPU and usage increase
- expr: sum(rate(container_cpu_usage_seconds_total{image!='', namespace='openshift-etcd', container='etcd'}[1m])) * 100 / sum(machine_cpu_cores) > 5
description: Etcd CPU usage increased significantly by {{$value}}%
severity: critical
# etcd memory usage increase
- expr: sum(deriv(container_memory_usage_bytes{image!='', namespace='openshift-etcd', container='etcd'}[5m])) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: Etcd memory usage increased significantly by {{$value}}%
severity: critical
# Openshift API server CPU and memory usage increase
- expr: sum(rate(container_cpu_usage_seconds_total{image!='', namespace='openshift-apiserver', container='openshift-apiserver'}[1m])) * 100 / sum(machine_cpu_cores) > 5
description: openshift apiserver cpu usage increased significantly by {{$value}}%
severity: critical
- expr: (sum(deriv(container_memory_usage_bytes{namespace='openshift-apiserver', container='openshift-apiserver'}[5m]))) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: openshift apiserver memory usage increased significantly by {{$value}}%
severity: critical
# Openshift kube API server CPU and memory usage increase
- expr: sum(rate(container_cpu_usage_seconds_total{image!='', namespace='openshift-kube-apiserver', container='kube-apiserver'}[1m])) * 100 / sum(machine_cpu_cores) > 5
description: openshift apiserver cpu usage increased significantly by {{$value}}%
severity: critical
- expr: (sum(deriv(container_memory_usage_bytes{namespace='openshift-kube-apiserver', container='kube-apiserver'}[5m]))) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: openshift apiserver memory usage increased significantly by {{$value}}%
severity: critical
# Master node CPU usage increase
- expr: (sum((sum(deriv(pod:container_cpu_usage:sum{container="",pod!=""}[5m])) BY (namespace, pod) * on(pod, namespace) group_left(node) (node_namespace_pod:kube_pod_info:) ) * on(node) group_left(role) (max by (node) (kube_node_role{role="master"})))) * 100 / sum(machine_cpu_cores) > 5
description: master nodes cpu usage increased significantly by {{$value}}%
severity: critical
# Master nodes memory usage increase
- expr: (sum((sum(deriv(container_memory_usage_bytes{container="",pod!=""}[5m])) BY (namespace, pod) * on(pod, namespace) group_left(node) (node_namespace_pod:kube_pod_info:) ) * on(node) group_left(role) (max by (node) (kube_node_role{role="master"})))) * 100 / sum(node_memory_MemTotal_bytes) > 5
description: master nodes memory usage increased significantly by {{$value}}%
severity: critical

View File

@@ -39,7 +39,7 @@ cerberus:
Sunday:
slack_team_alias: # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned
custom_checks: # Relative paths of files conataining additional user defined checks
custom_checks: # Relative paths of files containing additional user defined checks
tunings:
timeout: 3 # Number of seconds before requests fail

View File

@@ -1,89 +1,104 @@
kraken:
distribution: openshift # Distribution can be kubernetes or openshift
kubeconfig_path: ~/.kube/config # Path to kubeconfig
kubeconfig_path: ~/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
auto_rollback: True # Enable auto rollback for scenarios.
rollback_versions_directory: /tmp/kraken-rollback # Directory to store rollback version files.
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
port: 8081 # Signal port
chaos_scenarios:
# List of policies/chaos scenarios to load
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
- scenarios/arcaflow/memory-hog/input.yaml
- scenarios/arcaflow/io-hog/input.yaml
- application_outages:
- scenarios/openshift/app_outage.yaml
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/openshift/container_etcd.yml
- plugin_scenarios:
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/vmware_node_scenarios.yml
- scenarios/openshift/network_chaos_ingress.yml
- scenarios/openshift/prom_kill.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- plugin_scenarios:
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- litmus_scenarios: # List of litmus scenarios to load
- - scenarios/openshift/templates/litmus-rbac.yaml
- scenarios/openshift/node_cpu_hog_engine.yaml
- - scenarios/openshift/templates/litmus-rbac.yaml
- scenarios/openshift/node_mem_engine.yaml
- - scenarios/openshift/templates/litmus-rbac.yaml
- scenarios/openshift/node_io_engine.yaml
- cluster_shut_down_scenarios:
- - scenarios/openshift/cluster_shut_down_scenario.yml
- scenarios/openshift/post_action_shut_down.py
- service_disruption_scenarios:
- - scenarios/openshift/regex_namespace.yaml
- - scenarios/openshift/ingress_namespace.yaml
- scenarios/openshift/post_action_namespace.py
- zone_outages:
- scenarios/openshift/zone_outage.yaml
- pvc_scenarios:
- scenarios/openshift/pvc_scenario.yaml
- network_chaos:
- scenarios/openshift/network_chaos.yaml
# List of policies/chaos scenarios to load
- hog_scenarios:
- scenarios/kube/cpu-hog.yml
- scenarios/kube/memory-hog.yml
- scenarios/kube/io-hog.yml
- application_outages_scenarios:
- scenarios/openshift/app_outage.yaml
- container_scenarios: # List of chaos pod scenarios to load
- scenarios/openshift/container_etcd.yml
- pod_network_scenarios:
- scenarios/openshift/network_chaos_ingress.yml
- scenarios/openshift/pod_network_outage.yml
- pod_disruption_scenarios:
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/prom_kill.yml
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/aws_node_scenarios.yml
- scenarios/openshift/vmware_node_scenarios.yml
- scenarios/openshift/ibmcloud_node_scenarios.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- cluster_shut_down_scenarios:
- scenarios/openshift/cluster_shut_down_scenario.yml
- service_disruption_scenarios:
- scenarios/openshift/regex_namespace.yaml
- scenarios/openshift/ingress_namespace.yaml
- zone_outages_scenarios:
- scenarios/openshift/zone_outage.yaml
- pvc_scenarios:
- scenarios/openshift/pvc_scenario.yaml
- network_chaos_scenarios:
- scenarios/openshift/network_chaos.yaml
- service_hijacking_scenarios:
- scenarios/kube/service_hijacking.yaml
- syn_flood_scenarios:
- scenarios/kube/syn_flood.yaml
- network_chaos_ng_scenarios:
- scenarios/kube/pod-network-filter.yml
- scenarios/kube/node-network-filter.yml
- kubevirt_vm_outage:
- scenarios/kubevirt/kubevirt-vm-outage.yaml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
kube_burner_binary_url: "https://github.com/cloud-bulldozer/kube-burner/releases/download/v1.7.0/kube-burner-1.7.0-Linux-x86_64.tar.gz"
capture_metrics: False
config_path: config/kube_burner.yaml # Define the Elasticsearch url and index name in this config
metrics_profile_path: config/metrics-aggregated.yaml
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_url: '' # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path or URL to alert profile with the prometheus queries
enable_metrics: False
alert_profile: config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile: config/metrics-report.yaml
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos
elastic:
enable_elastic: False
verify_certs: False
elastic_url: "" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port: 32766
username: "elastic"
password: "test"
metrics_index: "krkn-metrics"
alerts_index: "krkn-alerts"
telemetry_index: "krkn-telemetry"
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
wait_duration: 1 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled: False # enable/disables the telemetry collection feature
api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
username: username # telemetry service username
password: password # telemetry service password
password: password # telemetry service password
prometheus_backup: True # enables/disables prometheus data collection
prometheus_namespace: "" # namespace where prometheus is deployed (if distribution is kubernetes)
prometheus_container_name: "" # name of the prometheus container name (if distribution is kubernetes)
prometheus_pod_name: "" # name of the prometheus pod (if distribution is kubernetes)
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads: 5 # number of telemetry download/upload threads
archive_path: /tmp # local path where the archive files will be temporarly stored
archive_path: /tmp # local path where the archive files will be temporarily stored
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size: 10000 # the size of the prometheus data archive size in KB. The lower the size of archive is
archive_size: 500000
telemetry_group: '' # if set will archive the telemetry in the S3 bucket on a folder named after the value, otherwise will use "default"
# the size of the prometheus data archive size in KB. The lower the size of archive is
# the higher the number of archive files will be produced and uploaded (and processed by backup_threads
# simultaneously).
# For unstable/slow connection is better to keep this value low
@@ -95,4 +110,22 @@ telemetry:
- "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+" # kinit 2023/09/15 11:20:36 log
- "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+" # 2023-09-15T11:20:36.123425532Z log
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
events_backup: True # enables/disables cluster events collection
health_checks: # Utilizing health check endpoints to observe application behavior during chaos injection.
interval: # Interval in seconds to perform health checks, default value is 2 seconds
config: # Provide list of health check configurations for applications
- url: # Provide application endpoint
bearer_token: # Bearer token for authentication if any
auth: # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure: # If value is True exits when health check failed for application, values can be True/False
kubevirt_checks: # Utilizing virt check endpoints to observe ssh ability to VMI's during chaos injection.
interval: 2 # Interval in seconds to perform virt checks, default value is 2 seconds
namespace: # Namespace where to find VMI's
name: # Regex Name style of VMI's to watch, optional, will watch all VMI names in the namespace if left blank
only_failures: False # Boolean of whether to show all VMI's failures and successful ssh connection (False), or only failure status' (True)
disconnected: False # Boolean of how to try to connect to the VMIs; if True will use the ip_address to try ssh from within a node, if false will use the name and uses virtctl to try to connect; Default is False
ssh_node: "" # If set, will be a backup way to ssh to a node. Will want to set to a node that isn't targeted in chaos
node_names: ""
exit_on_failure: # If value is True and VMI's are failing post chaos returns failure, values can be True/False

View File

@@ -6,35 +6,34 @@ kraken:
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
litmus_install: True # Installs specified version, set to False if it's already setup
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- plugin_scenarios:
- scenarios/kind/scheduler.yml
- node_scenarios:
- scenarios/kind/node_scenarios_example.yml
- pod_disruption_scenarios:
- scenarios/kube/pod.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
kube_burner_binary_url: "https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz"
capture_metrics: False
config_path: config/kube_burner.yaml # Define the Elasticsearch url and index name in this config
metrics_profile_path: config/metrics-aggregated.yaml
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path to alert profile with the prometheus queries
alert_profile: config/alerts.yaml # Path to alert profile with the prometheus queries
elastic:
enable_elastic: False
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled: False # enable/disables the telemetry collection feature
archive_path: /tmp # local path where the archive files will be temporarily stored
events_backup: False # enables/disables cluster events collection
logs_backup: False
health_checks: # Utilizing health check endpoints to observe application behavior during chaos injection.

View File

@@ -5,33 +5,23 @@ kraken:
port: 8081
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
litmus_install: True # Installs specified version, set to False if it's already setup
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/kube/container_dns.yml
- scenarios/kube/container_dns.yml
- plugin_scenarios:
- scenarios/kube/scheduler.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
kube_burner_binary_url: "https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz"
capture_metrics: False
config_path: config/kube_burner.yaml # Define the Elasticsearch url and index name in this config
metrics_profile_path: config/metrics-aggregated.yaml
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path to alert profile with the prometheus queries
alert_profile: config/alerts.yaml # Path to alert profile with the prometheus queries
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos after soak time for the cluster to settle down
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario

View File

@@ -6,27 +6,20 @@ kraken:
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
port: 8081 # Signal port
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- plugin_scenarios: # List of chaos pod scenarios to load
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/prom_kill.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- scenarios/openshift/node_scenarios_example.yml
- plugin_scenarios:
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- litmus_scenarios: # List of litmus scenarios to load
- - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml
- scenarios/openshift/node_cpu_hog_engine.yaml
- cluster_shut_down_scenarios:
- - scenarios/openshift/cluster_shut_down_scenario.yml
- scenarios/openshift/post_action_shut_down.py
- scenarios/openshift/cluster_shut_down_scenario.yml
- service_disruption_scenarios:
- scenarios/openshift/regex_namespace.yaml
- scenarios/openshift/ingress_namespace.yaml
@@ -42,22 +35,49 @@ kraken:
cerberus:
cerberus_enabled: True # Enable it when cerberus is previously installed
cerberus_url: http://0.0.0.0:8080 # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards: True # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
kube_burner_binary_url: "https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz"
capture_metrics: True
config_path: config/kube_burner.yaml # Define the Elasticsearch url and index name in this config
metrics_profile_path: config/metrics-aggregated.yaml
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: True # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path to alert profile with the prometheus queries
alert_profile: config/alerts.yaml # Path to alert profile with the prometheus queries
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled: False # enable/disables the telemetry collection feature
api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
username: username # telemetry service username
password: password # telemetry service password
prometheus_backup: True # enables/disables prometheus data collection
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads: 5 # number of telemetry download/upload threads
archive_path: /tmp # local path where the archive files will be temporarily stored
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size: 500000 # the size of the prometheus data archive size in KB. The lower the size of archive is
# the higher the number of archive files will be produced and uploaded (and processed by backup_threads
# simultaneously).
# For unstable/slow connection is better to keep this value low
# increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
# failed chunk without affecting the whole upload.
logs_backup: True
logs_filter_patterns:
- "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+" # Sep 9 11:20:36.123425532
- "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+" # kinit 2023/09/15 11:20:36 log
- "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+" # 2023-09-15T11:20:36.123425532Z log
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
elastic:
elastic_url: "" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_index: "" # Elastic search index pattern to post results to

View File

@@ -1,15 +0,0 @@
---
global:
writeToFile: true
metricsDirectory: collected-metrics
measurements:
- name: podLatency
esIndex: kraken
indexerConfig:
enabled: true
esServers: [http://0.0.0.0:9200] # Please change this to the respective Elasticsearch in use if you haven't run the podman-compose command to setup the infrastructure containers
insecureSkipVerify: true
defaultIndex: kraken
type: elastic

View File

@@ -1,133 +1,126 @@
metrics:
# API server
- query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
metricName: API99thLatency
- query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
metricName: APIRequestRate
instant: True
- query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
metricName: APIInflightRequests
instant: True
- query: histogram_quantile(0.99, rate(apiserver_current_inflight_requests[5m]))
metricName: APIInflightRequests
instant: True
# Container & pod metrics
- query: (sum(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(etcd|oauth-apiserver|.*apiserver|ovn-kubernetes|sdn|ingress|authentication|.*controller-manager|.*scheduler)"}) by (container, pod, namespace, node) and on (node) kube_node_role{role="master"}) > 0
metricName: containerMemory-Masters
instant: true
- query: (sum(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(etcd|oauth-apiserver|sdn|ovn-kubernetes|.*apiserver|authentication|.*controller-manager|.*scheduler)"}[2m]) * 100) by (container, pod, namespace, node) and on (node) kube_node_role{role="master"}) > 0
metricName: containerCPU-Masters
instant: true
- query: (sum(irate(container_cpu_usage_seconds_total{pod!="",container="prometheus",namespace="openshift-monitoring"}[2m]) * 100) by (container, pod, namespace, node) and on (node) kube_node_role{role="infra"}) > 0
metricName: containerCPU-Prometheus
instant: true
- query: (avg(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress)"}[2m]) * 100 and on (node) kube_node_role{role="worker"}) by (namespace, container)) > 0
metricName: containerCPU-AggregatedWorkers
instant: true
- query: (avg(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress|monitoring|image-registry|logging)"}[2m]) * 100 and on (node) kube_node_role{role="infra"}) by (namespace, container)) > 0
metricName: containerCPU-AggregatedInfra
- query: (sum(container_memory_rss{pod!="",namespace="openshift-monitoring",name!="",container="prometheus"}) by (container, pod, namespace, node) and on (node) kube_node_role{role="infra"}) > 0
metricName: containerMemory-Prometheus
instant: True
- query: avg(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress)"} and on (node) kube_node_role{role="worker"}) by (container, namespace)
metricName: containerMemory-AggregatedWorkers
instant: True
- query: avg(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress|monitoring|image-registry|logging)"} and on (node) kube_node_role{role="infra"}) by (container, namespace)
metricName: containerMemory-AggregatedInfra
instant: True
# Node metrics
- query: (sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) > 0
metricName: nodeCPU-Masters
instant: True
- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: maxCPU-Masters
instant: true
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Masters
instant: true
- query: (avg((sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))) by (mode)) > 0
metricName: nodeCPU-AggregatedWorkers
instant: True
- query: (avg((sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))) by (mode)) > 0
metricName: nodeCPU-AggregatedInfra
instant: True
- query: avg(node_memory_MemAvailable_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeMemoryAvailable-Masters
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Masters
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: maxMemory-Masters
instant: true
- query: avg(node_memory_MemAvailable_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryAvailable-AggregatedWorkers
instant: True
- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: maxCPU-Workers
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: maxMemory-Workers
instant: true
- query: avg(node_memory_MemAvailable_bytes and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryAvailable-AggregatedInfra
instant: True
- query: avg(node_memory_Active_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeMemoryActive-Masters
instant: True
- query: avg(node_memory_Active_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryActive-AggregatedWorkers
instant: True
- query: avg(avg(node_memory_Active_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryActive-AggregatedInfra
- query: avg(node_memory_Cached_bytes) by (instance) + avg(node_memory_Buffers_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeMemoryCached+nodeMemoryBuffers-Masters
- query: avg(node_memory_Cached_bytes + node_memory_Buffers_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryCached+nodeMemoryBuffers-AggregatedWorkers
- query: avg(node_memory_Cached_bytes + node_memory_Buffers_bytes and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemoryCached+nodeMemoryBuffers-AggregatedInfra
- query: irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: rxNetworkBytes-Masters
- query: avg(irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: rxNetworkBytes-AggregatedWorkers
- query: avg(irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: rxNetworkBytes-AggregatedInfra
- query: irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: txNetworkBytes-Masters
- query: avg(irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: txNetworkBytes-AggregatedWorkers
- query: avg(irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: txNetworkBytes-AggregatedInfra
- query: rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeDiskWrittenBytes-Masters
- query: avg(rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskWrittenBytes-AggregatedWorkers
- query: avg(rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskWrittenBytes-AggregatedInfra
- query: rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName: nodeDiskReadBytes-Masters
- query: avg(rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskReadBytes-AggregatedWorkers
- query: avg(rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName: nodeDiskReadBytes-AggregatedInfra
instant: True
# Etcd metrics
- query: sum(rate(etcd_server_leader_changes_seen_total[2m]))
metricName: etcdLeaderChangesRate
instant: True
- query: etcd_server_is_leader > 0
metricName: etcdServerIsLeader
instant: True
- query: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[2m]))
metricName: 99thEtcdDiskBackendCommitDurationSeconds
instant: True
- query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
metricName: 99thEtcdDiskWalFsyncDurationSeconds
instant: True
- query: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))
metricName: 99thEtcdRoundTripTimeSeconds
- query: etcd_mvcc_db_total_size_in_bytes
metricName: etcdDBPhysicalSizeBytes
- query: etcd_mvcc_db_total_size_in_use_in_bytes
metricName: etcdDBLogicalSizeBytes
instant: True
- query: sum by (cluster_version)(etcd_cluster_version)
metricName: etcdVersion
@@ -135,83 +128,16 @@ metrics:
- query: sum(rate(etcd_object_counts{}[5m])) by (resource) > 0
metricName: etcdObjectCount
instant: True
- query: histogram_quantile(0.99,sum(rate(etcd_request_duration_seconds_bucket[2m])) by (le,operation,apiserver)) > 0
metricName: P99APIEtcdRequestLatency
- query: sum(grpc_server_started_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})
metricName: ActiveWatchStreams
- query: sum(grpc_server_started_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{namespace="openshift-etcd",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})
metricName: ActiveLeaseStreams
- query: sum(rate(etcd_debugging_snap_save_total_duration_seconds_sum{namespace="openshift-etcd"}[2m]))
metricName: snapshotSaveLatency
- query: sum(rate(etcd_server_heartbeat_send_failures_total{namespace="openshift-etcd"}[2m]))
metricName: HeartBeatFailures
- query: sum(rate(etcd_server_health_failures{namespace="openshift-etcd"}[2m]))
metricName: HealthFailures
- query: sum(rate(etcd_server_slow_apply_total{namespace="openshift-etcd"}[2m]))
metricName: SlowApplies
- query: sum(rate(etcd_server_slow_read_indexes_total{namespace="openshift-etcd"}[2m]))
metricName: SlowIndexRead
- query: sum(etcd_server_proposals_pending)
metricName: PendingProposals
- query: histogram_quantile(1.0, sum(rate(etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket[1m])) by (le, instance))
metricName: CompactionMaxPause
instant: True
- query: sum by (instance) (apiserver_storage_objects)
metricName: etcdTotalObjectCount
instant: True
- query: topk(500, max by(resource) (apiserver_storage_objects))
metricName: etcdTopObectCount
# Cluster metrics
- query: count(kube_namespace_created)
metricName: namespaceCount
- query: sum(kube_pod_status_phase{}) by (phase)
metricName: podStatusCount
- query: count(kube_secret_info{})
metricName: secretCount
- query: count(kube_deployment_labels{})
metricName: deploymentCount
- query: count(kube_configmap_info{})
metricName: configmapCount
- query: count(kube_service_info{})
metricName: serviceCount
- query: kube_node_role
metricName: nodeRoles
instant: true
- query: sum(kube_node_status_condition{status="true"}) by (condition)
metricName: nodeStatus
- query: (sum(rate(container_fs_writes_bytes_total{container!="",device!~".+dm.+"}[5m])) by (device, container, node) and on (node) kube_node_role{role="master"}) > 0
metricName: containerDiskUsage
- query: cluster_version{type="completed"}
metricName: clusterVersion
instant: true
# Golang metrics
- query: go_memstats_heap_alloc_bytes{job=~"apiserver|api|etcd"}
metricName: goHeapAllocBytes
- query: go_memstats_heap_inuse_bytes{job=~"apiserver|api|etcd"}
metricName: goHeapInuseBytes
- query: go_gc_duration_seconds{job=~"apiserver|api|etcd",quantile="1"}
metricName: goGCDurationSeconds
instant: True

248
config/metrics-report.yaml Normal file
View File

@@ -0,0 +1,248 @@
metrics:
# API server
- query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
metricName: APIInflightRequests
instant: true
# Kubelet & CRI-O
# Average and max of the CPU usage from all worker's kubelet
- query: avg(avg_over_time(irate(process_cpu_seconds_total{service="kubelet",job="kubelet"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: cpu-kubelet
instant: true
- query: max(max_over_time(irate(process_cpu_seconds_total{service="kubelet",job="kubelet"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: max-cpu-kubelet
instant: true
# Average of the memory usage from all worker's kubelet
- query: avg(avg_over_time(process_resident_memory_bytes{service="kubelet",job="kubelet"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: memory-kubelet
instant: true
# Max of the memory usage from all worker's kubelet
- query: max(max_over_time(process_resident_memory_bytes{service="kubelet",job="kubelet"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: max-memory-kubelet
instant: true
- query: max_over_time(sum(process_resident_memory_bytes{service="kubelet",job="kubelet"} and on (node) kube_node_role{role="worker"})[.elapsed:])
metricName: max-memory-sum-kubelet
instant: true
# Average and max of the CPU usage from all worker's CRI-O
- query: avg(avg_over_time(irate(process_cpu_seconds_total{service="kubelet",job="crio"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: cpu-crio
instant: true
- query: max(max_over_time(irate(process_cpu_seconds_total{service="kubelet",job="crio"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: max-cpu-crio
instant: true
# Average of the memory usage from all worker's CRI-O
- query: avg(avg_over_time(process_resident_memory_bytes{service="kubelet",job="crio"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: memory-crio
instant: true
# Max of the memory usage from all worker's CRI-O
- query: max(max_over_time(process_resident_memory_bytes{service="kubelet",job="crio"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName: max-memory-crio
instant: true
# Etcd
- query: avg(avg_over_time(histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[2m]))[.elapsed:]))
metricName: 99thEtcdDiskBackendCommit
instant: true
- query: avg(avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[.elapsed:]))
metricName: 99thEtcdDiskWalFsync
instant: true
- query: avg(avg_over_time(histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[2m]))[.elapsed:]))
metricName: 99thEtcdRoundTripTime
instant: true
# Control-plane
- query: avg(avg_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-controller-manager"}[2m])) by (pod))[.elapsed:]))
metricName: cpu-kube-controller-manager
instant: true
- query: max(max_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-controller-manager"}[2m])) by (pod))[.elapsed:]))
metricName: max-cpu-kube-controller-manager
instant: true
- query: avg(avg_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-kube-controller-manager"}) by (pod))[.elapsed:]))
metricName: memory-kube-controller-manager
instant: true
- query: max(max_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-kube-controller-manager"}) by (pod))[.elapsed:]))
metricName: max-memory-kube-controller-manager
instant: true
- query: avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-apiserver"}[2m])) by (pod))[.elapsed:]))
metricName: cpu-kube-apiserver
instant: true
- query: avg(avg_over_time(topk(3, sum(container_memory_rss{name!="", namespace="openshift-kube-apiserver"}) by (pod))[.elapsed:]))
metricName: memory-kube-apiserver
instant: true
- query: avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-apiserver"}[2m])) by (pod))[.elapsed:]))
metricName: cpu-openshift-apiserver
instant: true
- query: avg(avg_over_time(topk(3, sum(container_memory_rss{name!="", namespace="openshift-apiserver"}) by (pod))[.elapsed:]))
metricName: memory-openshift-apiserver
instant: true
- query: avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-etcd"}[2m])) by (pod))[.elapsed:]))
metricName: cpu-etcd
instant: true
- query: avg(avg_over_time(topk(3,sum(container_memory_rss{name!="", namespace="openshift-etcd"}) by (pod))[.elapsed:]))
metricName: memory-etcd
instant: true
- query: avg(avg_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-controller-manager"}[2m])) by (pod))[.elapsed:]))
metricName: cpu-openshift-controller-manager
instant: true
- query: avg(avg_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-controller-manager"}) by (pod))[.elapsed:]))
metricName: memory-openshift-controller-manager
instant: true
# multus
- query: avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-multus", pod=~"(multus).+", container!="POD"}[2m])[.elapsed:])) by (container)
metricName: cpu-multus
instant: true
- query: avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-multus", pod=~"(multus).+", container!="POD"}[.elapsed:])) by (container)
metricName: memory-multus
instant: true
# OVNKubernetes - standard & IC
- query: avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ovn-kubernetes", pod=~"(ovnkube-master|ovnkube-control-plane).+", container!="POD"}[2m])[.elapsed:])) by (container)
metricName: cpu-ovn-control-plane
instant: true
- query: avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-ovn-kubernetes", pod=~"(ovnkube-master|ovnkube-control-plane).+", container!="POD"}[.elapsed:])) by (container)
metricName: memory-ovn-control-plane
instant: true
- query: avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ovn-kubernetes", pod=~"ovnkube-node.+", container!="POD"}[2m])[.elapsed:])) by (container)
metricName: cpu-ovnkube-node
instant: true
- query: avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-ovn-kubernetes", pod=~"ovnkube-node.+", container!="POD"}[.elapsed:])) by (container)
metricName: memory-ovnkube-node
instant: true
# Nodes
- query: avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: cpu-masters
instant: true
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: memory-masters
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: max-memory-masters
instant: true
- query: avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: cpu-workers
instant: true
- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: max-cpu-workers
instant: true
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: memory-workers
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: max-memory-workers
instant: true
- query: sum( (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)") )
metricName: memory-sum-workers
instant: true
- query: avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: cpu-infra
instant: true
- query: max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName: max-cpu-infra
instant: true
- query: avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: memory-infra
instant: true
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName: max-memory-infra
instant: true
- query: max_over_time(sum((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))[.elapsed:])
metricName: max-memory-sum-infra
instant: true
# Monitoring and ingress
- query: avg(avg_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}[2m])) by (pod)[.elapsed:]))
metricName: cpu-prometheus
instant: true
- query: max(max_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}[2m])) by (pod)[.elapsed:]))
metricName: max-cpu-prometheus
instant: true
- query: avg(avg_over_time(sum(container_memory_rss{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}) by (pod)[.elapsed:]))
metricName: memory-prometheus
instant: true
- query: max(max_over_time(sum(container_memory_rss{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}) by (pod)[.elapsed:]))
metricName: max-memory-prometheus
instant: true
- query: avg(avg_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ingress", pod=~"router-default.+"}[2m])) by (pod)[.elapsed:]))
metricName: cpu-router
instant: true
- query: avg(avg_over_time(sum(container_memory_rss{name!="", namespace="openshift-ingress", pod=~"router-default.+"}) by (pod)[.elapsed:]))
metricName: memory-router
instant: true
# Cluster
- query: avg_over_time(cluster:memory_usage:ratio[.elapsed:])
metricName: memory-cluster-usage-ratio
instant: true
- query: avg_over_time(cluster:node_cpu:ratio[.elapsed:])
metricName: cpu-cluster-usage-ratio
instant: true
# Retain the raw CPU seconds totals for comparison
- query: sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="worker",role!="infra"}, "instance", "$1", "node", "(.+)")) by (mode)
metricName: nodeCPUSeconds-Workers
instant: true
- query: sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (mode)
metricName: nodeCPUSeconds-Masters
instant: true
- query: sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (mode)
metricName: nodeCPUSeconds-Infra
instant: true

View File

@@ -1,13 +1,7 @@
metrics:
# API server
- query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
metricName: API99thLatency
- query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
metricName: APIRequestRate
- query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
metricName: APIInflightRequests
- query: irate(apiserver_request_total{verb="POST", resource="pods", subresource="binding",code="201"}[2m]) > 0
metricName: schedulingThroughput
# Containers & pod metrics
- query: sum(irate(container_cpu_usage_seconds_total{name!="",namespace=~"openshift-(etcd|oauth-apiserver|.*apiserver|ovn-kubernetes|sdn|ingress|authentication|.*controller-manager|.*scheduler|monitoring|logging|image-registry)"}[2m]) * 100) by (pod, namespace, node)
@@ -33,8 +27,17 @@ metrics:
metricName: crioMemory
# Node metrics
- query: sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) > 0
metricName: nodeCPU
- query: (sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) > 0
metricName: nodeCPU-Masters
- query: (avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Masters
- query: (sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) > 0
metricName: nodeCPU-Workers
- query: (avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[2m:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName: nodeMemory-Workers
- query: avg(node_memory_MemAvailable_bytes) by (instance)
metricName: nodeMemoryAvailable
@@ -42,6 +45,9 @@ metrics:
- query: avg(node_memory_Active_bytes) by (instance)
metricName: nodeMemoryActive
- query: max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName: maxMemory-Masters
- query: avg(node_memory_Cached_bytes) by (instance) + avg(node_memory_Buffers_bytes) by (instance)
metricName: nodeMemoryCached+nodeMemoryBuffers
@@ -84,34 +90,4 @@ metrics:
- query: sum by (cluster_version)(etcd_cluster_version)
metricName: etcdVersion
instant: true
# Cluster metrics
- query: count(kube_namespace_created)
metricName: namespaceCount
- query: sum(kube_pod_status_phase{}) by (phase)
metricName: podStatusCount
- query: count(kube_secret_info{})
metricName: secretCount
- query: count(kube_deployment_labels{})
metricName: deploymentCount
- query: count(kube_configmap_info{})
metricName: configmapCount
- query: count(kube_service_info{})
metricName: serviceCount
- query: kube_node_role
metricName: nodeRoles
instant: true
- query: sum(kube_node_status_condition{status="true"}) by (condition)
metricName: nodeStatus
- query: cluster_version{type="completed"}
metricName: clusterVersion
instant: true
instant: true

View File

@@ -0,0 +1,35 @@
application: openshift-etcd
namespaces: openshift-etcd
labels: app=openshift-etcd
kubeconfig: ~/.kube/config.yaml
prometheus_endpoint: <Prometheus_Endpoint>
auth_token: <Auth_Token>
scrape_duration: 10m
chaos_library: "kraken"
log_level: INFO
json_output_file: False
json_output_folder_path:
# for output purpose only do not change if not needed
chaos_tests:
GENERIC:
- pod_failure
- container_failure
- node_failure
- zone_outage
- time_skew
- namespace_failure
- power_outage
CPU:
- node_cpu_hog
NETWORK:
- application_outage
- node_network_chaos
- pod_network_chaos
MEM:
- node_memory_hog
- pvc_disk_fill
threshold: .7
cpu_threshold: .5
mem_threshold: .5

View File

@@ -1,30 +0,0 @@
# Dockerfile for kraken
FROM mcr.microsoft.com/azure-cli:latest as azure-cli
FROM registry.access.redhat.com/ubi8/ubi:latest
LABEL org.opencontainers.image.authors="Red Hat OpenShift Chaos Engineering"
ENV KUBECONFIG /root/.kube/config
# Copy azure client binary from azure-cli image
COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
# Install dependencies
RUN yum install -y git python39 python3-pip jq gettext wget && \
python3.9 -m pip install -U pip && \
git clone https://github.com/redhat-chaos/krkn.git --branch v1.4.7 /root/kraken && \
mkdir -p /root/.kube && cd /root/kraken && \
pip3.9 install -r requirements.txt && \
pip3.9 install virtualenv && \
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq && chmod +x /usr/bin/yq
# Get Kubernetes and OpenShift clients from stable releases
WORKDIR /tmp
RUN wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz && tar -xvf openshift-client-linux.tar.gz && cp oc /usr/local/bin/oc && cp kubectl /usr/local/bin/kubectl
WORKDIR /root/kraken
ENTRYPOINT ["python3.9", "run_kraken.py"]
CMD ["--config=config/config.yaml"]

View File

@@ -1,29 +0,0 @@
# Dockerfile for kraken
FROM ppc64le/centos:8
FROM mcr.microsoft.com/azure-cli:latest as azure-cli
LABEL org.opencontainers.image.authors="Red Hat OpenShift Chaos Engineering"
ENV KUBECONFIG /root/.kube/config
# Copy azure client binary from azure-cli image
COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
# Install dependencies
RUN yum install -y git python39 python3-pip jq gettext wget && \
python3.9 -m pip install -U pip && \
git clone https://github.com/redhat-chaos/krkn.git --branch v1.4.7 /root/kraken && \
mkdir -p /root/.kube && cd /root/kraken && \
pip3.9 install -r requirements.txt && \
pip3.9 install virtualenv && \
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq && chmod +x /usr/bin/yq
# Get Kubernetes and OpenShift clients from stable releases
WORKDIR /tmp
RUN wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz && tar -xvf openshift-client-linux.tar.gz && cp oc /usr/local/bin/oc && cp kubectl /usr/local/bin/kubectl
WORKDIR /root/kraken
ENTRYPOINT python3.9 run_kraken.py --config=config/config.yaml

View File

@@ -0,0 +1,90 @@
# oc build
FROM golang:1.24.9 AS oc-build
RUN apt-get update && apt-get install -y --no-install-recommends libkrb5-dev
WORKDIR /tmp
# oc build
RUN git clone --branch release-4.18 https://github.com/openshift/oc.git
WORKDIR /tmp/oc
RUN go mod edit -go 1.24.9 &&\
go mod edit -require github.com/moby/buildkit@v0.12.5 &&\
go mod edit -require github.com/containerd/containerd@v1.7.29&&\
go mod edit -require github.com/docker/docker@v27.5.1+incompatible&&\
go mod edit -require github.com/opencontainers/runc@v1.2.8&&\
go mod edit -require github.com/go-git/go-git/v5@v5.13.0&&\
go mod edit -require github.com/opencontainers/selinux@v1.13.0&&\
go mod edit -require github.com/ulikunitz/xz@v0.5.15&&\
go mod edit -require golang.org/x/net@v0.38.0&&\
go mod edit -require github.com/containerd/containerd@v1.7.27&&\
go mod edit -require golang.org/x/oauth2@v0.27.0&&\
go mod edit -require golang.org/x/crypto@v0.35.0&&\
go mod edit -replace github.com/containerd/containerd@v1.7.27=github.com/containerd/containerd@v1.7.29&&\
go mod tidy && go mod vendor
RUN make GO_REQUIRED_MIN_VERSION:= oc
# virtctl build
WORKDIR /tmp
RUN git clone https://github.com/kubevirt/kubevirt.git
WORKDIR /tmp/kubevirt
RUN go mod edit -go 1.24.9 &&\
go work use &&\
go build -o virtctl ./cmd/virtctl/
FROM fedora:40
ARG PR_NUMBER
ARG TAG
RUN groupadd -g 1001 krkn && useradd -m -u 1001 -g krkn krkn
RUN dnf update -y
ENV KUBECONFIG /home/krkn/.kube/config
# This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo
RUN dnf update && dnf install -y --setopt=install_weak_deps=False \
git python3.11 jq yq gettext wget which ipmitool openssh-server &&\
dnf clean all
# copy oc client binary from oc-build image
COPY --from=oc-build /tmp/oc/oc /usr/bin/oc
COPY --from=oc-build /tmp/kubevirt/virtctl /usr/bin/virtctl
# krkn build
RUN git clone https://github.com/krkn-chaos/krkn.git /home/krkn/kraken && \
mkdir -p /home/krkn/.kube
RUN mkdir -p /home/krkn/.ssh && \
chmod 700 /home/krkn/.ssh
WORKDIR /home/krkn/kraken
# default behaviour will be to build main
# if it is a PR trigger the PR itself will be checked out
RUN if [ -n "$PR_NUMBER" ]; then git fetch origin pull/${PR_NUMBER}/head:pr-${PR_NUMBER} && git checkout pr-${PR_NUMBER};fi
# if it is a TAG trigger checkout the tag
RUN if [ -n "$TAG" ]; then git checkout "$TAG";fi
RUN python3.11 -m ensurepip --upgrade --default-pip
RUN python3.11 -m pip install --upgrade pip setuptools==78.1.1
# removes the the vulnerable versions of setuptools and pip
RUN rm -rf "$(pip cache dir)"
RUN rm -rf /tmp/*
RUN rm -rf /usr/local/lib/python3.11/ensurepip/_bundled
RUN pip3.11 install -r requirements.txt
RUN pip3.11 install jsonschema
LABEL krknctl.title.global="Krkn Base Image"
LABEL krknctl.description.global="This is the krkn base image."
LABEL krknctl.input_fields.global='$KRKNCTL_INPUT'
# SSH setup script
RUN chmod +x /home/krkn/kraken/containers/setup-ssh.sh
# Main entrypoint script
RUN chmod +x /home/krkn/kraken/containers/entrypoint.sh
RUN chown -R krkn:krkn /home/krkn && chmod 755 /home/krkn
USER krkn
ENTRYPOINT ["/bin/bash", "/home/krkn/kraken/containers/entrypoint.sh"]
CMD ["--config=config/config.yaml"]

View File

@@ -1,53 +1,14 @@
### Kraken image
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/redhat-chaos/krkn).
### Run containerized version
Refer [instructions](https://github.com/redhat-chaos/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
Refer [instructions](https://krkn-chaos.dev/docs/installation/) for information on how to run the containerized version of kraken.
### Run Custom Kraken Image
Refer to [instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
### Kraken as a KubeApp
#### GENERAL NOTES:
- It is not generally recommended to run Kraken internal to the cluster as the pod which is running Kraken might get disrupted, the suggested use case to run kraken from inside k8s/OpenShift is to target **another** cluster (eg. to bypass network restrictions or to leverage cluster's computational resources)
- your kubeconfig might contain several cluster contexts and credentials so be sure, before creating the ConfigMap, to keep **only** the credentials related to the destination cluster. Please refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) for more details
- to add privileges to the service account you must be logged in the cluster with an highly privileged account (ideally kubeadmin)
To run containerized Kraken as a Kubernetes/OpenShift Deployment, follow these steps:
1. Configure the [config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) file according to your requirements.
**NOTE**: both the scenarios ConfigMaps are needed regardless you're running kraken in Kubernetes or OpenShift
2. Create a namespace under which you want to run the kraken pod using `kubectl create ns <namespace>`.
3. Switch to `<namespace>` namespace:
- In Kubernetes, use `kubectl config set-context --current --namespace=<namespace>`
- In OpenShift, use `oc project <namespace>`
4. Create a ConfigMap named kube-config using `kubectl create configmap kube-config --from-file=<path_to_kubeconfig>` *(eg. ~/.kube/config)*
5. Create a ConfigMap named kraken-config using `kubectl create configmap kraken-config --from-file=<path_to_kraken>/config`
6. Create a ConfigMap named scenarios-config using `kubectl create configmap scenarios-config --from-file=<path_to_kraken>/scenarios`
7. Create a ConfigMap named scenarios-openshift-config using `kubectl create configmap scenarios-openshift-config --from-file=<path_to_kraken>/scenarios/openshift`
8. Create a ConfigMap named scenarios-kube-config using `kubectl create configmap scenarios-kube-config --from-file=<path_to_kraken>/scenarios/kube`
9. Create a service account to run the kraken pod `kubectl create serviceaccount useroot`.
10. In Openshift, add privileges to service account and execute `oc adm policy add-scc-to-user privileged -z useroot`.
11. Create a Job using `kubectl apply -f <path_to_kraken>/containers/kraken.yml` and monitor the status using `oc get jobs` and `oc get pods`.

View File

@@ -1,13 +1,13 @@
# Building your own Kraken image
1. Git clone the Kraken repository using `git clone https://github.com/openshift-scale/kraken.git`.
1. Git clone the Kraken repository using `git clone https://github.com/redhat-chaos/krkn.git`.
2. Modify the python code and yaml files to address your needs.
3. Execute `podman build -t <new_image_name>:latest .` in the containers directory within kraken to build an image from a Dockerfile.
4. Execute `podman run --detach --name <container_name> <new_image_name>:latest` to start a container based on your new image.
# Building the Kraken image on IBM Power (ppc64le)
1. Git clone the Kraken repository using `git clone https://github.com/cloud-bulldozer/kraken.git` on an IBM Power Systems server.
1. Git clone the Kraken repository using `git clone https://github.com/redhat-chaos/krkn.git` on an IBM Power Systems server.
2. Modify the python code and yaml files to address your needs.
3. Execute `podman build -t <new_image_name>:latest -f Dockerfile-ppc64le` in the containers directory within kraken to build an image from the Dockerfile for Power.
4. Execute `podman run --detach --name <container_name> <new_image_name>:latest` to start a container based on your new image.

View File

@@ -0,0 +1,5 @@
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
export KRKNCTL_INPUT=$(cat krknctl-input.json|tr -d "\n")
envsubst '${KRKNCTL_INPUT}' < Dockerfile.template > Dockerfile

7
containers/entrypoint.sh Normal file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
# Run SSH setup
./containers/setup-ssh.sh
# Change to kraken directory
# Execute the main command
exec python3.9 run_kraken.py "$@"

View File

@@ -1,49 +0,0 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: kraken
spec:
parallelism: 1
completions: 1
template:
metadata:
labels:
tool: Kraken
spec:
serviceAccountName: useroot
containers:
- name: kraken
securityContext:
privileged: true
image: quay.io/redhat-chaos/krkn
command: ["/bin/sh", "-c"]
args: ["python3.9 run_kraken.py -c config/config.yaml"]
volumeMounts:
- mountPath: "/root/.kube"
name: config
- mountPath: "/root/kraken/config"
name: kraken-config
- mountPath: "/root/kraken/scenarios"
name: scenarios-config
- mountPath: "/root/kraken/scenarios/openshift"
name: scenarios-openshift-config
- mountPath: "/root/kraken/scenarios/kube"
name: scenarios-kube-config
restartPolicy: Never
volumes:
- name: config
configMap:
name: kube-config
- name: kraken-config
configMap:
name: kraken-config
- name: scenarios-config
configMap:
name: scenarios-config
- name: scenarios-openshift-config
configMap:
name: scenarios-openshift-config
- name: scenarios-kube-config
configMap:
name: scenarios-kube-config

View File

@@ -0,0 +1,553 @@
[
{
"name": "cerberus-enabled",
"short_description": "Enable Cerberus",
"description": "Enables Cerberus Support",
"variable": "CERBERUS_ENABLED",
"type": "enum",
"default": "False",
"allowed_values": "True,False",
"separator": ",",
"required": "false"
},
{
"name": "cerberus-url",
"short_description": "Cerberus URL",
"description": "Cerberus http url",
"variable": "CERBERUS_URL",
"type": "string",
"default": "http://0.0.0.0:8080",
"validator": "^(http|https):\/\/.*",
"required": "false"
},
{
"name": "distribution",
"short_description": "Orchestrator distribution",
"description": "Selects the orchestrator distribution",
"variable": "DISTRIBUTION",
"type": "enum",
"default": "openshift",
"allowed_values": "openshift,kubernetes",
"separator": ",",
"required": "false"
},
{
"name": "ssh-public-key",
"short_description": "Krkn ssh public key path",
"description": "Sets the path where krkn will search for ssh public key (in container)",
"variable": "KRKN_SSH_PUBLIC",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "ssh-private-key",
"short_description": "Krkn ssh private key path",
"description": "Sets the path where krkn will search for ssh private key (in container)",
"variable": "KRKN_SSH_PRIVATE",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "krkn-kubeconfig",
"short_description": "Krkn kubeconfig path",
"description": "Sets the path where krkn will search for kubeconfig (in container)",
"variable": "KRKN_KUBE_CONFIG",
"type": "string",
"default": "/home/krkn/.kube/config",
"required": "false"
},
{
"name": "wait-duration",
"short_description": "Post chaos wait duration",
"description": "waits for a certain amount of time after the scenario",
"variable": "WAIT_DURATION",
"type": "number",
"default": "1"
},
{
"name": "iterations",
"short_description": "Chaos scenario iterations",
"description": "number of times the same chaos scenario will be executed",
"variable": "ITERATIONS",
"type": "number",
"default": "1"
},
{
"name": "daemon-mode",
"short_description": "Sets krkn daemon mode",
"description": "if set the scenario will execute forever",
"variable": "DAEMON_MODE",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "prometheus-url",
"short_description": "Prometheus url",
"description": "Prometheus url for when running on kuberenetes",
"variable": "PROMETHEUS_URL",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "prometheus-token",
"short_description": "Prometheus bearer token",
"description": "Prometheus bearer token for prometheus url authentication",
"variable": "PROMETHEUS_TOKEN",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "uuid",
"short_description": "Sets krkn run uuid",
"description": "sets krkn run uuid instead of generating it",
"variable": "UUID",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "capture-metrics",
"short_description": "Enables metrics capture",
"description": "Enables metrics capture",
"variable": "CAPTURE_METRICS",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "enable-alerts",
"short_description": "Enables cluster alerts check",
"description": "Enables cluster alerts check",
"variable": "ENABLE_ALERTS",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "alerts-path",
"short_description": "Cluster alerts path file (in container)",
"description": "Allows to specify a different alert file path",
"variable": "ALERTS_PATH",
"type": "string",
"default": "config/alerts.yaml",
"required": "false"
},
{
"name": "metrics-path",
"short_description": "Cluster metrics path file (in container)",
"description": "Allows to specify a different metrics file path",
"variable": "METRICS_PATH",
"type": "string",
"default": "config/metrics-aggregated.yaml",
"required": "false"
},
{
"name": "enable-es",
"short_description": "Enables elastic search data collection",
"description": "Enables elastic search data collection",
"variable": "ENABLE_ES",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "es-server",
"short_description": "Elasticsearch instance URL",
"description": "Elasticsearch instance URL",
"variable": "ES_SERVER",
"type": "string",
"default": "http://0.0.0.0",
"required": "false"
},
{
"name": "es-port",
"short_description": "Elasticsearch instance port",
"description": "Elasticsearch instance port",
"variable": "ES_PORT",
"type": "number",
"default": "443",
"required": "false"
},
{
"name": "es-username",
"short_description": "Elasticsearch instance username",
"description": "Elasticsearch instance username",
"variable": "ES_USERNAME",
"type": "string",
"default": "elastic",
"required": "false"
},
{
"name": "es-password",
"short_description": "Elasticsearch instance password",
"description": "Elasticsearch instance password",
"variable": "ES_PASSWORD",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "es-verify-certs",
"short_description": "Enables elasticsearch TLS certificate verification",
"description": "Enables elasticsearch TLS certificate verification",
"variable": "ES_VERIFY_CERTS",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "es-metrics-index",
"short_description": "Elasticsearch metrics index",
"description": "Index name for metrics in Elasticsearch",
"variable": "ES_METRICS_INDEX",
"type": "string",
"default": "krkn-metrics",
"required": "false"
},
{
"name": "es-alerts-index",
"short_description": "Elasticsearch alerts index",
"description": "Index name for alerts in Elasticsearch",
"variable": "ES_ALERTS_INDEX",
"type": "string",
"default": "krkn-alerts",
"required": "false"
},
{
"name": "es-telemetry-index",
"short_description": "Elasticsearch telemetry index",
"description": "Index name for telemetry in Elasticsearch",
"variable": "ES_TELEMETRY_INDEX",
"type": "string",
"default": "krkn-telemetry",
"required": "false"
},
{
"name": "check-critical-alerts",
"short_description": "Check critical alerts",
"description": "Enables checking for critical alerts",
"variable": "CHECK_CRITICAL_ALERTS",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "telemetry-enabled",
"short_description": "Enable telemetry",
"description": "Enables telemetry support",
"variable": "TELEMETRY_ENABLED",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "telemetry-api-url",
"short_description": "Telemetry API URL",
"description": "API endpoint for telemetry data",
"variable": "TELEMETRY_API_URL",
"type": "string",
"default": "https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production",
"validator": "^(http|https):\/\/.*",
"required": "false"
},
{
"name": "telemetry-username",
"short_description": "Telemetry username",
"description": "Username for telemetry authentication",
"variable": "TELEMETRY_USERNAME",
"type": "string",
"default": "redhat-chaos",
"required": "false"
},
{
"name": "telemetry-password",
"short_description": "Telemetry password",
"description": "Password for telemetry authentication",
"variable": "TELEMETRY_PASSWORD",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "telemetry-prometheus-backup",
"short_description": "Prometheus backup for telemetry",
"description": "Enables Prometheus backup for telemetry",
"variable": "TELEMETRY_PROMETHEUS_BACKUP",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "True",
"required": "false"
},
{
"name": "telemetry-full-prometheus-backup",
"short_description": "Full Prometheus backup",
"description": "Enables full Prometheus backup for telemetry",
"variable": "TELEMETRY_FULL_PROMETHEUS_BACKUP",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "telemetry-backup-threads",
"short_description": "Telemetry backup threads",
"description": "Number of threads for telemetry backup",
"variable": "TELEMETRY_BACKUP_THREADS",
"type": "number",
"default": "5",
"required": "false"
},
{
"name": "telemetry-archive-path",
"short_description": "Telemetry archive path",
"description": "Path to save telemetry archive",
"variable": "TELEMETRY_ARCHIVE_PATH",
"type": "string",
"default": "/tmp",
"required": "false"
},
{
"name": "telemetry-max-retries",
"short_description": "Telemetry max retries",
"description": "Maximum retries for telemetry operations",
"variable": "TELEMETRY_MAX_RETRIES",
"type": "number",
"default": "0",
"required": "false"
},
{
"name": "telemetry-run-tag",
"short_description": "Telemetry run tag",
"description": "Tag for telemetry run",
"variable": "TELEMETRY_RUN_TAG",
"type": "string",
"default": "chaos",
"required": "false"
},
{
"name": "telemetry-group",
"short_description": "Telemetry group",
"description": "Group name for telemetry data",
"variable": "TELEMETRY_GROUP",
"type": "string",
"default": "default",
"required": "false"
},
{
"name": "telemetry-archive-size",
"short_description": "Telemetry archive size",
"description": "Maximum size for telemetry archives",
"variable": "TELEMETRY_ARCHIVE_SIZE",
"type": "number",
"default": "1000",
"required": "false"
},
{
"name": "telemetry-logs-backup",
"short_description": "Telemetry logs backup",
"description": "Enables logs backup for telemetry",
"variable": "TELEMETRY_LOGS_BACKUP",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "telemetry-filter-pattern",
"short_description": "Telemetry filter pattern",
"description": "Filter pattern for telemetry logs",
"variable": "TELEMETRY_FILTER_PATTERN",
"type": "string",
"default": "[\"(\\\\w{3}\\\\s\\\\d{1,2}\\\\s\\\\d{2}:\\\\d{2}:\\\\d{2}\\\\.\\\\d+).+\",\"kinit (\\\\d+/\\\\d+/\\\\d+\\\\s\\\\d{2}:\\\\d{2}:\\\\d{2})\\\\s+\",\"(\\\\d{4}-\\\\d{2}-\\\\d{2}T\\\\d{2}:\\\\d{2}:\\\\d{2}\\\\.\\\\d+Z).+\"]",
"required": "false"
},
{
"name": "telemetry-cli-path",
"short_description": "Telemetry CLI path (oc)",
"description": "Path to telemetry CLI tool (oc)",
"variable": "TELEMETRY_CLI_PATH",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "telemetry-events-backup",
"short_description": "Telemetry events backup",
"description": "Enables events backup for telemetry",
"variable": "TELEMETRY_EVENTS_BACKUP",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "True",
"required": "false"
},
{
"name": "health-check-interval",
"short_description": "Heath check interval",
"description": "How often to check the health check urls",
"variable": "HEALTH_CHECK_INTERVAL",
"type": "number",
"default": "2",
"required": "false"
},
{
"name": "health-check-url",
"short_description": "Health check url",
"description": "Url to check the health of",
"variable": "HEALTH_CHECK_URL",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-auth",
"short_description": "Health check authentication tuple",
"description": "Authentication tuple to authenticate into health check URL",
"variable": "HEALTH_CHECK_AUTH",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-bearer-token",
"short_description": "Health check bearer token",
"description": "Bearer token to authenticate into health check URL",
"variable": "HEALTH_CHECK_BEARER_TOKEN",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "health-check-exit",
"short_description": "Health check exit on failure",
"description": "Exit on failure when health check URL is not able to connect",
"variable": "HEALTH_CHECK_EXIT_ON_FAILURE",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "health-check-verify",
"short_description": "SSL Verification of health check url",
"description": "SSL Verification to authenticate into health check URL",
"variable": "HEALTH_CHECK_VERIFY",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "kubevirt-check-interval",
"short_description": "Kube Virt check interval",
"description": "How often to check the kube virt check Vms ssh status",
"variable": "KUBE_VIRT_CHECK_INTERVAL",
"type": "number",
"default": "2",
"required": "false"
},
{
"name": "kubevirt-namespace",
"short_description": "KubeVirt namespace to check",
"description": "KubeVirt namespace to check the health of",
"variable": "KUBE_VIRT_NAMESPACE",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "kubevirt-name",
"short_description": "KubeVirt regex names to watch",
"description": "KubeVirt regex names to check VMs",
"variable": "KUBE_VIRT_NAME",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "kubevirt-only-failures",
"short_description": "KubeVirt checks only report if failure occurs",
"description": "KubeVirt checks only report if failure occurs",
"variable": "KUBE_VIRT_FAILURES",
"type": "enum",
"allowed_values": "True,False,true,false",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "kubevirt-disconnected",
"short_description": "KubeVirt checks in disconnected mode",
"description": "KubeVirt checks in disconnected mode, bypassing the clusters Api",
"variable": "KUBE_VIRT_DISCONNECTED",
"type": "enum",
"allowed_values": "True,False,true,false",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "kubevirt-ssh-node",
"short_description": "KubeVirt node to ssh from",
"description": "KubeVirt node to ssh from, should be available whole chaos run",
"variable": "KUBE_VIRT_SSH_NODE",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "kubevirt-exit-on-failure",
"short_description": "KubeVirt fail if failed vms at end of run",
"description": "KubeVirt fails run if vms still have false status",
"variable": "KUBE_VIRT_EXIT_ON_FAIL",
"type": "enum",
"allowed_values": "True,False,true,false",
"separator": ",",
"default": "False",
"required": "false"
},
{
"name": "kubevirt-node-node",
"short_description": "KubeVirt node to filter vms on",
"description": "Only track VMs in KubeVirt on given node name",
"variable": "KUBE_VIRT_NODE_NAME",
"type": "string",
"default": "",
"required": "false"
},
{
"name": "krkn-debug",
"short_description": "Krkn debug mode",
"description": "Enables debug mode for Krkn",
"variable": "KRKN_DEBUG",
"type": "enum",
"allowed_values": "True,False",
"separator": ",",
"default": "False",
"required": "false"
}
]

Some files were not shown because too many files have changed in this diff Show More