* Add PVC outage scenario plugin to manage PVC annotations during outages
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Remove PvcOutageScenarioPlugin as it is no longer needed; refactor PvcScenarioPlugin to include rollback functionality for temporary file cleanup during PVC scenarios.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback_data handling in PvcScenarioPlugin to use str() instead of json.dumps() for resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Import json module in PvcScenarioPlugin for decoding rollback data from resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Encode rollback data in base64 format for resource_identifier in PvcScenarioPlugin to enhance data handling and security.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* feat: refactor: Update logging level from debug to info for temp file operations in PvcScenarioPlugin to improve visibility of command execution.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add unit tests for PvcScenarioPlugin methods and enhance test coverage
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add missed lines test cov
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor tests in test_pvc_scenario_plugin.py to use unittest framework and enhance test coverage for to_kbytes method
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Enhance rollback_temp_file test to verify logging of errors for invalid data
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor tests in TestPvcScenarioPluginRun to clarify pod_name behavior and enhance logging verification in rollback_temp_file tests
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactored imports
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor assertions in test cases to use assertEqual for consistency
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
* Add rollback functionality to ServiceHijackingScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback data handling in ServiceHijackingScenarioPlugin as json string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update rollback data handling in ServiceHijackingScenarioPlugin to decode directly from resource_identifier
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Add import statement for JSON handling in ServiceHijackingScenarioPlugin
This change introduces an import statement for the JSON module to facilitate the decoding of rollback data from the resource identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Enhance rollback data handling in ServiceHijackingScenarioPlugin by encoding and decoding as base64 strings.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add rollback tests for ServiceHijackingScenarioPlugin
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor rollback tests for ServiceHijackingScenarioPlugin to improve error logging and remove temporary path dependency
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Remove redundant import of yaml in test_service_hijacking_scenario_plugin.py
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor rollback tests for ServiceHijackingScenarioPlugin to enhance readability and consistency
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
* Kubevirt VM outage tests with improved mocking and validation scenarios at test_kubevirt_vm_outage.py
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor Kubevirt VM outage tests to improve time mocking and response handling
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Remove unused subproject reference for pvc_outage
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor Kubevirt VM outage tests to enhance time mocking and improve response handling
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Enhance VMI deletion test by mocking unchanged creationTimestamp to exercise timeout path
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor Kubevirt VM outage tests to use dynamic timestamps and improve mock handling
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m3s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback functionality to SynFloodScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback pod handling in SynFloodScenarioPlugin to handle podnames as string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update resource identifier handling in SynFloodScenarioPlugin to use list format for rollback functionality
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor chaos scenario configurations in config.yaml to comment out existing scenarios for clarity. Update rollback method in SynFloodScenarioPlugin to improve pod cleanup handling. Modify pvc_scenario.yaml with specific test values for better usability.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Enhance rollback functionality in SynFloodScenarioPlugin by encoding pod names in base64 format for improved data handling.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add unit tests for SynFloodScenarioPlugin methods and rollback functionality
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor TestSynFloodRun and TestRollbackSynFloodPods to inherit from unittest.TestCase
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor SynFloodRun tests to use tempfile for scenario file creation and improve error logging in rollback functionality
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Validate version file format
* Add validation for context dir, Exexcute all files by default
* Consolidate execute and cleanup, rename with .executed instead of
removing
* Respect auto_rollback config
* Add cleanup back but only for scenario successed
---------
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m48s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback functionality to SynFloodScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback pod handling in SynFloodScenarioPlugin to handle podnames as string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update resource identifier handling in SynFloodScenarioPlugin to use list format for rollback functionality
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor chaos scenario configurations in config.yaml to comment out existing scenarios for clarity. Update rollback method in SynFloodScenarioPlugin to improve pod cleanup handling. Modify pvc_scenario.yaml with specific test values for better usability.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Enhance rollback functionality in SynFloodScenarioPlugin by encoding pod names in base64 format for improved data handling.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
* Add rollback functionality to ServiceHijackingScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback data handling in ServiceHijackingScenarioPlugin as json string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update rollback data handling in ServiceHijackingScenarioPlugin to decode directly from resource_identifier
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Add import statement for JSON handling in ServiceHijackingScenarioPlugin
This change introduces an import statement for the JSON module to facilitate the decoding of rollback data from the resource identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Enhance rollback data handling in ServiceHijackingScenarioPlugin by encoding and decoding as base64 strings.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
* Add PVC outage scenario plugin to manage PVC annotations during outages
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Remove PvcOutageScenarioPlugin as it is no longer needed; refactor PvcScenarioPlugin to include rollback functionality for temporary file cleanup during PVC scenarios.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback_data handling in PvcScenarioPlugin to use str() instead of json.dumps() for resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Import json module in PvcScenarioPlugin for decoding rollback data from resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Encode rollback data in base64 format for resource_identifier in PvcScenarioPlugin to enhance data handling and security.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* feat: refactor: Update logging level from debug to info for temp file operations in PvcScenarioPlugin to improve visibility of command execution.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
* feat: Add exclude_label feature to pod network outage scenarios
This feature enables filtering out specific pods from network outage
chaos testing based on label selectors. Users can now target all pods
in a namespace except critical ones by specifying exclude_label.
- Added exclude_label parameter to list_pods() function
- Updated get_test_pods() to pass the exclude parameter
- Added exclude_label field to all relevant plugin classes
- Updated schema.json with the new parameter
- Added documentation and examples
- Created comprehensive unit tests
Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
* krkn-lib update
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
* removed plugin schema
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
---------
Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m38s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Adding node_label_selector for pod scenarios
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* using kubernetes function, adding node_name and removing extra config
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* adding CI test for custom pod scenario
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* fixing comment
* adding test to workflow
* adding list parsing logic for krkn hub
* parsing not needed, as input is always []
---------
Signed-off-by: Sahil Shah <sahshah@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m18s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback config
* Inject rollback handler to scenario plugin
* Add Serializer
* Add decorator
* Add test with SimpleRollbackScenarioPlugin
* Add logger for verbose debug flow
* Resolve review comment
- remove additional rollback config in config.yaml
- set KUBECONFIG to ~/.kube/config in test_rollback
* Simplify set_rollback_context_decorator
* Fix integration of rollback_handler in __load_plugins
* Refactor rollback.config module
- make it singleton class with register method to construct
- RollbackContext ( <timestamp>-<run_uuid> )
- add get_rollback_versions_directory for moduling the directory
format
* Adapt new rollback.config
* Refactor serialization
- respect rollback_callable_name
- refactor _parse_rollback_callable_code
- refine VERSION_FILE_TEMPLATE
* Add get_scenario_rollback_versions_directory in RollbackConfig
* Add rollback in ApplicationOutageScenarioPlugin
* Add RollbackCallable and RollbackContent for type annotation
* Refactor rollback_handler with limited arguments
* Refactor the serialization for rollback
- limited arguments: callback and rollback_content just these two!
- always constuct lib_openshift and lib_telemetry in version file
- add _parse_rollback_content_definition for retrieving scenaio specific
rollback_content
- remove utils for formating variadic function
* Refactor applicaton outage scenario
* Fix test_rollback
* Make RollbackContent with static fields
* simplify serialization
- Remove all unused format dynamic arguments utils
- Add jinja template for version file
- Replace set_context for serialization with passing version to serialize_callable
* Add rollback for hogs scenario
* Fix version file full path based on feedback
- {versions_directory}/<timestamp(ns)>-<run_uuid>/{scenario_type}-<timestamp(ns)>-<random_hash>.py
* Fix scenario plugins after rebase
* Add rollback config
* Inject rollback handler to scenario plugin
* Add test with SimpleRollbackScenarioPlugin
* Resolve review comment
- remove additional rollback config in config.yaml
- set KUBECONFIG to ~/.kube/config in test_rollback
* Fix integration of rollback_handler in __load_plugins
* Refactor rollback.config module
- make it singleton class with register method to construct
- RollbackContext ( <timestamp>-<run_uuid> )
- add get_rollback_versions_directory for moduling the directory
format
* Adapt new rollback.config
* Add rollback in ApplicationOutageScenarioPlugin
* Add RollbackCallable and RollbackContent for type annotation
* Refactor applicaton outage scenario
* Fix test_rollback
* Make RollbackContent with static fields
* simplify serialization
- Remove all unused format dynamic arguments utils
- Add jinja template for version file
- Replace set_context for serialization with passing version to serialize_callable
* Add rollback for hogs scenario
* Fix version file full path based on feedback
- {versions_directory}/<timestamp(ns)>-<run_uuid>/{scenario_type}-<timestamp(ns)>-<random_hash>.py
* Fix scenario plugins after rebase
* Add execute rollback
* Add CLI for list and execute rollback
* Replace subprocess with importlib
* Fix error after rebase
* fixup! Fix docstring
- Add telemetry_ocp in execute_rollback docstring
- Remove rollback_config in create_plugin docstring
- Remove scenario_types in set_rollback_callable docsting
* fixup! Replace os.urandom with krkn_lib.utils.get_random_string
* fixup! Add missing telemetry_ocp for execute_rollback_version_files
* fixup! Remove redundant import
- Remove duplicate TYPE_CHECKING in handler module
- Remove cast in signal module
- Remove RollbackConfig in scenario_plugin_factory
* fixup! Replace sys.exit(1) with return
* fixup! Remove duplicate rollback_network_policy
* fixup! Decouple Serializer initialization
* fixup! Rename callback to rollback_callable
* fixup! Refine comment for constructing AbstractScenarioPlugin with
placeholder value
* fixup! Add version in docstring
* fixup! Remove uv.lock
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m34s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit updates fedora tools image reference used by the network scenarios
to the one hosted in the krkn-chaos quay org. This also fixes the issues with
RHACS flagging runs when using latest tag by using tools tag instead.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Disable SSL verification for IBM node scenarios and fix node reboot scenario
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* adding disable ssl as a scenario parameter for ibmcloud
Signed-off-by: Sahil Shah <sahshah@redhat.com>
---------
Signed-off-by: Sahil Shah <sahshah@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m29s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Fix the logic in disk disruption scenario, which was returning the right set of disks to be off-lined.
Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
- Implemented methods for detaching and attaching disks to baremetal nodes.
- Added a new scenario `node_disk_detach_attach_scenario` to manage disk operations.
- Updated the YAML configuration to include the new scenario with disk details.
Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
Introduce a delay in network scenarios prior to imposing restrictions.
This ensures that chaos test case jobs are scheduled before any restrictions are put in place.
Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
This will enable users and organizations to share their Krkn adoption
journey for their chaos engineering use cases.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m15s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds a policy on how Krkn follows best practices and
addresses security vulnerabilities.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding vsphere updates to non native
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding node id to affected node
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixed the spelling mistake
Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding v4.0.8 version (#756)
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Add autodetecting distribution (#753)
Used is_openshift function from krkn lib
Remove distribution from config
Remove distribution from documentation
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Added the health check config in functional test config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Modified the health checks documentation
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for debugging the functional test failing
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* changed the code for debugging in run_test.sh
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removed the functional test running line
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the health check config in common_test_config for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixing functional test fialure
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the changes that are added for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* few modifications
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Renamed timestamp
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changed the start timestamp and end timestamp data type to the datetime
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding node id to affected node
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Added the health check config in functional test config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Modified the health checks documentation
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for debugging the functional test failing
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* changed the code for debugging in run_test.sh
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removed the functional test running line
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the health check config in common_test_config for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixing functional test fialure
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the changes that are added for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* few modifications
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Renamed timestamp
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding node id to affected node
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Added the health check config in functional test config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for debugging the functional test failing
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* changed the code for debugging in run_test.sh
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removed the functional test running line
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the health check config in common_test_config for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixing functional test fialure
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the changes that are added for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* few modifications
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Renamed timestamp
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* passing the health check response as HealthCheck object
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Updated the krkn-lib version in requirements.txt
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changed the coverage
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
---------
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Co-authored-by: Paige Patton <prubenda@redhat.com>
Co-authored-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
Co-authored-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m14s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds recommendation to test and ensure Pod Disruption
Budgets are set for critical applications to avoid downtime.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m12s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Used is_openshift function from krkn lib
Remove distribution from config
Remove distribution from documentation
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m22s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This is needed to avoid issues due to comparing two different data types:
TypeError: Invalid comparison between dtype=float64 and str. This commit also
avoids setting defaults for the thresholds to make it mandatory for the users
to define them as it plays a key role in determining the outliers.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* Document how to use Google's credentials associated with a user acccount
Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
* Change API from 'Google API Client' to 'Google Cloud Python Client'
According to the 'Google API Client' GH page:
```
This library is considered complete and is in maintenance mode. This means
that we will address critical bugs and security issues but will not add any
new features.
This library is officially supported by Google. However, the maintainers of
this repository recommend using Cloud Client Libraries for Python, where
possible, for new code development.
```
So change the code accordingly to adapt it to 'Google Cloud Python Client'.
Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
---------
Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
* Add support for user-provided default network ACL
Signed-off-by: henrick <self@thehenrick.com>
* Add logs to notify user when their provided acl is used
Signed-off-by: henrick <self@thehenrick.com>
* Update docs to include optional default_acl_id parameter in zone_outage
Signed-off-by: henrick <self@thehenrick.com>
---------
Signed-off-by: henrick <self@thehenrick.com>
Co-authored-by: henrick <self@thehenrick.com>
This is needed for the TRT/component readiness integration to improve
dashboard readability and tie results back to chaos.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* add workflows
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* update readme
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* rm my kubeconfig path
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* add workflow details to readme
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* mv arcaflow to utils
Signed-off-by: Matthew F Leader <mleader@redhat.com>
---------
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* adding aws bare metal
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
* no found reservations
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
---------
Co-authored-by: Auto User <auto@users.noreply.github.com>
* adding elastic set to none
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Auto User <auto@users.noreply.github.com>
* too many ls
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
---------
Signed-off-by: Auto User <auto@users.noreply.github.com>
Co-authored-by: Auto User <auto@users.noreply.github.com>
This option is enabled only for node_stop_start scenario where
user will want to stop the node for certain duration to understand
the impact before starting the node back on. This commit also bumps
the timeout for the scenario to 360 seconds from 120 seconds to make
sure there's enough time for the node to get to Ready state from the
Kubernetes side after the node is started on the infra side.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
This commit removes the instructions on running krkn as kubernetes
deployment as it is not supported/maintained and also not recommended.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
This commit:
- Also switches the rate queries severity to critical as 5%
threshold is high for low scale/density clusters and needs to be flagged.
- Adds rate queries to openshift alerts file
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
This commit also deprecates building container image for ppc64le as it
is not actively maintained. We will add support if users request for it
in the future.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
We are not using it in the krkn code base and removing it fixes one
of the license issues reported by FOSSA. This commit also removes
setting up dependencies using docker/podman compose as it not actively
maintained.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Avoids architecture issues such as "bash: /usr/bin/az: cannot execute: required file not found"
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* fixes system and oc vulnerabilities detected by trivy
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
* updated base image to run as krkn user instead of root
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
---------
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Added network_chaos to plugin step and job wait time to be based on the test duration and set the default wait_time to 30s
Signed-off-by: yogananth subramanian <ysubrama@redhat.com>
This will make sure oc and kubectl clients are accessible for users
with both /usr/bin and /usr/local/bin paths set on the host.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Output in terminal changed to use json structure.
The json output file names are in format
recommender_namespace_YYYY-MM-DD_HH-MM-SS.
The path to the json file can be specified. Default path is in
kraken/utils/chaos_recommender/recommender_output.
Signed-off-by: jtydlcak <139967002+jtydlack@users.noreply.github.com>
This covers use case where user wants to just check for critical alerts
post chaos without having to enable the alerts evaluation feature which
evaluates prom queries specified in an alerts file.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* taking out start and end time"
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* adding only break when alert fires
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* fail at end if alert had fired
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* adding new krkn-lib function with no range
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* updating requirements to new krkn-lib
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
---------
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* Fix github.io link in README.md
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* Fix krknChaos-hub link in README.md
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* Fix kube-burner link in README.md
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---------
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
The scenario introduces network latency, packet loss, and bandwidth restriction in the Pod's network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
Below example config applies ingress traffic shaping to openshift console.
````
- id: pod_ingress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to ingress traffic from the pod.
````
* basic structure working
* config and options refactoring
nits and changes
* removed unused function with typo + fixed duration
* removed unused arguments
* minor fixes
* adding service disruption
* fixing kil services
* service log changes
* remvoing extra logging
* adding daemon set
* adding service disruption name changes
* cerberus config back
* bad string
The scenario introduces network latency, packet loss, and bandwidth restriction in the Pod's network interface.
The purpose of this scenario is to observe faults caused by random variations in the network.
Below example config applies egress traffic shaping to openshift console.
````
- id: pod_egress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to egress traffic from the pod.
````
This makes sure latest clients are installed and used:
- This will avoid compatability issues with the server
- Fixes security vulnerabilities and CVEs
This commit:
- Also sets appropriate severity to avoid false failures for the
test cases especially given that theses are monitored during the chaos
vs post chaos. Critical alerts are all monitored post chaos with few
monitored during the chaos that represent overall health and performance
of the service.
- Renames Alerts to SLOs validation
Metrics reference: f09a492b13/cmd/kube-burner/ocp-config/alerts.yml
* Include check for inside k8s scenario
* Include check for inside k8s scenario (2)
* Include check for inside k8s scenario (3)
* Include check for inside k8s scenario (4)
This is the first step towards the goal to only have metrics tracking
the overall health and performance of the component/cluster. For instance,
for etcd disruption scenarios, leader elections are expected, we should instead
track etcd leader availability and fsync latency under critical catergory vs leader
elections.
Pod network outage chaos scenario blocks traffic at pod level irrespective of the network policy used.
With the current network policies, it is not possible to explicitly block ports which are enabled
by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules
to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
Below example config blocks access to openshift console.
````
- id: pod_network_outage
config:
namespace: openshift-console
direction:
- ingress
ingress_ports:
- 8443
label_selector: 'component=ui'
````
* kubeconfig management for arcaflow + hogs scenario refactoring
* kubeconfig authentication parsing refactored to support arcaflow kubernetes deployer
* reimplemented all the hog scenarios to allow multiple parallel containers of the same scenarios
(eg. to stress two or more nodes in the same run simultaneously)
* updated documentation
* removed sysbench scenarios
* recovered cpu hogs
* updated requirements.txt
* updated config.yaml
* added gitleaks file for test fixtures
* imported sys and logging
* removed config_arcaflow.yaml
* updated readme
* refactored arcaflow documentation entrypoint
Also renames retry_wait to expected_recovery_time to make it clear that
the Kraken will exit 1 if the container doesn't recover within the expected
time.
Fixes https://github.com/redhat-chaos/krkn/issues/414
This commit enables users to opt in to check for critical alerts firing
in the cluster post chaos at the end of each scenario. Chaos scenario is
considered as failed if the cluster is unhealthy in which case user can
start debugging to fix and harden respective areas.
Fixes https://github.com/redhat-chaos/krkn/issues/410
Moving the content around installing kraken using helm to the
chaos in practice section of the guide to showcase how startx-lab
is deploying and leveraging Kraken.
* Added some bits and pieces to the krkn k8s installation to make it easier
* updated k8s/Oc installation documentation
* gitignore
* doc reorg
* fixed numbering + removed italic
Co-authored-by: Tullio Sebastiani <tullio.sebastiani@x3solutions.it>
previously the test was looking for master label.
Recent kubernetes uses control-plane lable instead.
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
As it says:
Pod scenarios have been removed, please use plugin_scenarios
with the kill-pods configuration instead.
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
Documentation says we default to ~ for looking up the kubernetes config
but then we set everywhere /root. Fixed the config to really look for ~.
Should solve #327.
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
This releases includes the changes needed for the customer as well as
number of other fixes and enhancements:
- Support for VMware node sceanrios
- Support for ingress traffic shaping
- Other changes can be found at https://github.com/redhat-chaos/krkn/releases/tag/v1.1.0
<-- Provide a brief description of the changes made in this PR. -->
## Related Tickets & Documents
If no related issue, please create one and start the converasation on wants of
- Related Issue #:
- Closes #:
# Documentation
- [ ]**Is documentation needed for this update?**
If checked, a documentation PR must be created and merged in the [website repository](https://github.com/krkn-chaos/website/).
## Related Documentation PR (if applicable)
<-- Add the link to the corresponding documentation PR in the website repository -->
# Checklist before requesting a review
[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See [contributing guidelines](https://krkn-chaos.dev/docs/contribution-guidelines/)
See [testing your changes](https://krkn-chaos.dev/docs/developers-guide/testing-changes/) and run on any Kubernetes or OpenShift cluster to validate your changes
- [ ] I have performed a self-review of my code by running krkn and specific scenario
- [ ] If it is a core feature, I have added thorough unit tests with above 80% coverage
*REQUIRED*:
Description of combination of tests performed and output of run
```bash
python run_kraken.py
...
<---insert test results output--->
```
OR
```bash
python -m coverage run -a -m unittest discover -s tests -v
This is a list of organizations that have publicly acknowledged usage of Krkn and shared details of how they are leveraging it in their environment for chaos engineering use cases. Do you want to add yourself to this list? Please fork the repository and open a PR with the required change.
| Organization | Since | Website | Use-Case |
|:-|:-|:-|:-|
| MarketAxess | 2024 | https://www.marketaxess.com/ | Kraken enables us to achieve our goal of increasing the reliability of our cloud products on Kubernetes. The tool allows us to automatically run various chaos scenarios, identify resilience and performance bottlenecks, and seamlessly restore the system to its original state once scenarios finish. These chaos scenarios include pod disruptions, node (EC2) outages, simulating availability zone (AZ) outages, and filling up storage spaces like EBS and EFS. The community is highly responsive to requests and works on expanding the tool's capabilities. MarketAxess actively contributes to the project, adding features such as the ability to leverage existing network ACLs and proposing several feature improvements to enhance test coverage. |
| Red Hat Openshift | 2020 | https://www.redhat.com/ | Kraken is a highly reliable chaos testing tool used to ensure the quality and resiliency of Red Hat Openshift. The engineering team runs all the test scenarios under Kraken on different cloud platforms on both self-managed and cloud services environments prior to the release of a new version of the product. The team also contributes to the Kraken project consistently which helps the test scenarios to keep up with the new features introduced to the product. Inclusion of this test coverage has contributed to gaining the trust of new and existing customers of the product. |
| IBM | 2023 | https://www.ibm.com/ | While working on AI for Chaos Testing at IBM Research, we closely collaborated with the Kraken (Krkn) team to advance intelligent chaos engineering. Our contributions included developing AI-enabled chaos injection strategies and integrating reinforcement learning (RL)-based fault search techniques into the Krkn tool, enabling it to identify and explore system vulnerabilities more efficiently. Kraken stands out as one of the most user-friendly and effective tools for chaos engineering, and the Kraken team’s deep technical involvement played a crucial role in the success of this collaboration—helping bridge cutting-edge AI research with practical, real-world system reliability testing. |
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set.
enable_alerts:False# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error.
alert_profile:config/alerts # Path to alert profile with the prometheus queries.
enable_alerts:True# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
enable_metrics:True
alert_profile:config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile:config/metrics-report.yaml
check_critical_alerts:True# Path to alert profile with the prometheus queries.
tunings:
wait_duration:6# Duration to wait between each chaos scenario.
iterations:1# Number of times to execute the scenarios.
daemon_mode:False# Iterations are set to infinity which means that the kraken will cause chaos forever.
telemetry:
enabled:False# enable/disables the telemetry collection feature
api_url:https://yvnn4rfoi7.execute-api.us-west-2.amazonaws.com/test#telemetry service endpoint
username:$TELEMETRY_USERNAME # telemetry service username
password:$TELEMETRY_PASSWORD # telemetry service password
elastic_url:"https://192.168.39.196"# To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port:32766
username:"elastic"
password:"test"
metrics_index:"krkn-metrics"
alerts_index:"krkn-alerts"
telemetry_index:"krkn-telemetry"
health_checks:# Utilizing health check endpoints to observe application behavior during chaos injection.
interval:# Interval in seconds to perform health checks, default value is 2 seconds
config:# Provide list of health check configurations for applications
- url:# Provide application endpoint
bearer_token:# Bearer token for authentication if any
auth:# Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure:# If value is True exits when health check failed for application, values can be True/False
["${PAYLOAD_PATCH_1//[$'\t\r\n ']}"=="${OUT_PATCH//[$'\t\r\n ']}"]&&echo"Step 1 PATCH Payload OK"||(echo"Payload did not match. Test failed."&&exit 1)
["$OUT_STATUS_CODE"=="$STATUS_CODE_PATCH_1"]&&echo"Step 1 PATCH Status Code OK"||(echo"Step 1 PATCH status code did not match. Test failed."&&exit 1)
["$OUT_CONTENT"=="$TEXT_MIME"]&&echo"Step 1 PATCH MIME OK"||(echo" Step 1 PATCH MIME did not match. Test failed."&&exit 1)
# wait for the next step
sleep 16
#Checking Step 2 GET on /list/index.php
OUT_GET="`curl -X GET -s $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X GET -s -o /dev/null -I -w "%{content_type}"$SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X GET -s -o /dev/null -I -w "%{http_code}"$SERVICE_URL/list/index.php`
["${PAYLOAD_GET_2//[$'\t\r\n ']}"=="${OUT_GET//[$'\t\r\n ']}"]&&echo"Step 2 GET Payload OK"||(echo"Step 2 GET Payload did not match. Test failed."&&exit 1)
["$OUT_STATUS_CODE"=="$STATUS_CODE_GET_2"]&&echo"Step 2 GET Status Code OK"||(echo"Step 2 GET status code did not match. Test failed."&&exit 1)
["$OUT_CONTENT"=="$JSON_MIME"]&&echo"Step 2 GET MIME OK"||(echo" Step 2 GET MIME did not match. Test failed."&&exit 1)
#Checking Step 2 POST on /list/index.php
OUT_POST="`curl -s -X POST $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X POST -s -o /dev/null -I -w "%{content_type}"$SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X POST -s -o /dev/null -I -w "%{http_code}"$SERVICE_URL/list/index.php`
["${PAYLOAD_POST_2//[$'\t\r\n ']}"=="${OUT_POST//[$'\t\r\n ']}"]&&echo"Step 2 POST Payload OK"||(echo"Step 2 POST Payload did not match. Test failed."&&exit 1)
["$OUT_STATUS_CODE"=="$STATUS_CODE_POST_2"]&&echo"Step 2 POST Status Code OK"||(echo"Step 2 POST status code did not match. Test failed."&&exit 1)
["$OUT_CONTENT"=="$TEXT_MIME"]&&echo"Step 2 POST MIME OK"||(echo" Step 2 POST MIME did not match. Test failed."&&exit 1)
This guide explains how to add a new chaos scenario test to the v2 pytest framework. The layout is **folder-per-scenario**: each scenario has its own directory under `scenarios/<scenario_name>/` containing the test file, Kubernetes resources, and the Krkn scenario base YAML.
-`CI/tests_v2/scenarios/service_hijacking/test_service_hijacking.py` — A test class extending `BaseScenarioTest` with a stub `test_happy_path` and `WORKLOAD_MANIFEST` pointing to the folder’s `resource.yaml`.
-`CI/tests_v2/scenarios/service_hijacking/resource.yaml` — A placeholder Deployment (namespace is patched at deploy time).
-`CI/tests_v2/scenarios/service_hijacking/scenario_base.yaml` — A placeholder Krkn scenario; edit this with the structure expected by your scenario type.
The script automatically registers the marker in `CI/tests_v2/pytest.ini`. For example, it adds:
```
service_hijacking: marks a test as a service_hijacking scenario test
```
**Next steps after scaffolding:**
1. Verify the marker was added to `pytest.ini` (the scaffold does this automatically).
2. Edit `scenario_base.yaml` with the structure your Krkn scenario type expects (see `scenarios/application_outage/scenario_base.yaml` and `scenarios/pod_disruption/scenario_base.yaml` for examples). The top-level key should match `SCENARIO_NAME`.
3. If your scenario uses a **list** structure (like pod_disruption) instead of a **dict** with a top-level key, set `NAMESPACE_KEY_PATH` (e.g. `[0, "config", "namespace_pattern"]`) and `NAMESPACE_IS_REGEX = True` if the namespace is a regex pattern.
4. The generated `test_happy_path` already uses `self.run_scenario(self.tmp_path, ns)` and assertions. Add more test methods (e.g. negative tests with `@pytest.mark.no_workload`) as needed.
5. Adjust `resource.yaml` if your scenario needs a different workload (e.g. specific image or labels).
If your Kraken scenario type string is not `<scenario>_scenarios`, pass it explicitly:
Kubernetes manifest(s) for the workload (Deployment or Pod). Use a distinct label (e.g. `app: <scenario>-target`). Omit or leave `metadata.namespace`; the framework patches it at deploy time.
3.**Add scenario_base.yaml**
The canonical Krkn scenario structure. Tests will load this, patch namespace (and any overrides), write to `tmp_path`, and pass to `build_config`. See existing scenarios for the format your scenario type expects.
4.**Add test_<scenario>.py**
- Import `BaseScenarioTest` from `lib.base` and helpers from `lib.utils` (e.g. `assert_kraken_success`, `get_pods_list`, `scenario_dir` if needed).
- Define a class extending `BaseScenarioTest` with:
-`NAMESPACE_KEY_PATH`: path to the namespace field (e.g. `["application_outage", "namespace"]` for dict-based, or `[0, "config", "namespace_pattern"]` for list-based)
-`NAMESPACE_IS_REGEX = False` (or `True` for regex patterns like pod_disruption)
-`OVERRIDES_KEY_PATH = ["<top-level key>"]` if the scenario supports overrides (e.g. duration, block).
- Add `@pytest.mark.functional` and `@pytest.mark.<scenario>` on the class.
- In at least one test, call `self.run_scenario(self.tmp_path, self.ns)` and assert with `assert_kraken_success`, `assert_pod_count_unchanged`, and `assert_all_pods_running_and_ready`. Use `self.k8s_core`, `self.tmp_path`, etc. (injected by the base class).
5.**Register the marker**
In `CI/tests_v2/pytest.ini`, under `markers`:
```
<scenario>: marks a test as a <scenario> scenario test
```
## Conventions
- **Folder-per-scenario**: One directory per scenario under `scenarios/`. All assets (test, resource.yaml, scenario_base.yaml, and any extra YAMLs) live there for easy tracking and onboarding.
- **Ephemeral namespace**: Every test gets a unique `krkn-test-<uuid>` namespace. The base class deploys the workload into it before the test; no manual deploy is required.
- **Negative tests**: For tests that don’t need a workload (e.g. invalid scenario, bad namespace), use `@pytest.mark.no_workload`. The test will still get a namespace but no workload will be deployed.
- **Scenario type**: `SCENARIO_TYPE` must match the key in Kraken’s config (e.g. `application_outages_scenarios`, `pod_disruption_scenarios`). See `CI/tests_v2/config/common_test_config.yaml` and the scenario plugin’s `get_scenario_types()`.
- **Assertions**: Use `assert_kraken_success(result, context=f"namespace={ns}", tmp_path=self.tmp_path)` so failures include stdout/stderr and optional log files.
- **Timeouts**: Use constants from `lib.base` (`READINESS_TIMEOUT`, `POLICY_WAIT_TIMEOUT`, etc.) instead of magic numbers.
## Exit Code Handling
Kraken uses the following exit codes: **0** = success; **1** = scenario failure (e.g. post scenarios still failing); **2** = critical alerts fired; **3+** = health check / KubeVirt check failures; **-1** = infrastructure error (bad config, no kubeconfig).
- **Happy-path tests**: Use `assert_kraken_success(result, ...)`. By default only exit code 0 is accepted.
- **Alert-aware tests**: If you enable `check_critical_alerts` and expect alerts, use `assert_kraken_success(result, allowed_codes=(0, 2), ...)` so exit code 2 is treated as acceptable.
- **Expected-failure tests**: Use `assert_kraken_failure(result, context=..., tmp_path=self.tmp_path)` for negative tests (invalid scenario, bad namespace, etc.). This gives the same diagnostic quality (log dump, tmp_path hint) as success assertions. Prefer this over a bare `assert result.returncode != 0`.
## Running your new tests
```bash
pytest CI/tests_v2/ -v -m <scenario>
```
For debugging with logs and keeping failed namespaces:
| Env var overrides | `KRKN_TEST_<NAME>` | `KRKN_TEST_READINESS_TIMEOUT` |
### Folders
- One folder per scenario under `scenarios/`. The folder name is `snake_case` and must match the `SCENARIO_NAME` class attribute in the test.
- Shared framework code lives in `lib/`. Each module covers a single concern (`k8s`, `namespace`, `deploy`, `kraken`, `utils`, `base`, `preflight`).
- Do **not** add scenario-specific code to `lib/`; keep it in the scenario folder as module-level helpers.
### Files
- Test files: `test_<scenario>.py`. This is required for pytest discovery (`test_*.py`).
- Workload manifests: always `resource.yaml`. If a scenario needs additional K8s resources (e.g. a Service for traffic testing), use a descriptive name like `nginx_http.yaml`.
- Scenario config: always `scenario_base.yaml`. This is the template that `load_and_patch_scenario` loads and patches.
### Classes
- One test class per file: `Test<CamelCase>` extending `BaseScenarioTest`.
- The CamelCase name must be the PascalCase equivalent of the folder name (e.g. `pod_disruption` -> `TestPodDisruption`).
### Test Methods
- Prefix: `test_` (pytest requirement).
- Use descriptive names that convey **what is being verified**, not implementation details.
- **Public fixtures** (intended for use in tests): use `<verb>_<noun>` or plain `<noun>`. Examples: `run_kraken`, `deploy_workload`, `test_namespace`, `kubectl`.
- K8s client fixtures use the `k8s_` prefix: `k8s_core`, `k8s_apps`, `k8s_networking`, `k8s_client`.
### Helpers and Utilities
- **Assertions**: `assert_<what_is_expected>`. Always raise `AssertionError` with a message that includes the namespace.
- **K8s queries**: `get_<resource>_list` for direct API calls, `find_<resource>_by_<criteria>` for filtered lookups.
- **Private helpers**: prefix with `_` for module-internal functions (e.g. `_pods`, `_policies`, `_get_nested`).
### Constants and Environment Variables
- Timeout constants: `UPPER_CASE` in `lib/base.py`. Each is overridable via an env var prefixed `KRKN_TEST_`.
- Feature flags: `KRKN_TEST_DRY_RUN`, `KRKN_TEST_COVERAGE`. Always use the `KRKN_TEST_` prefix so all tunables are discoverable with `grep KRKN_TEST_`.
### Markers
- Every test class gets `@pytest.mark.functional` (framework-wide) and `@pytest.mark.<scenario>` (scenario-specific).
- The scenario marker name matches the folder name exactly.
- Behavioral modifiers use plain descriptive names: `no_workload`, `order`.
- Register all custom markers in `pytest.ini` to avoid warnings.
## Adding Dependencies
- **Runtime (Kraken needs it)**: Add to the **root** `requirements.txt`. Pin a version (e.g. `package==1.2.3` or `package>=1.2,<2`).
- **Test-only (only CI/tests_v2 needs it)**: Add to **`CI/tests_v2/requirements.txt`**. Pin a version there as well.
- After changing either file, run `make setup` (or `make -f CI/tests_v2/Makefile setup`) from the repo root to verify both files install cleanly together.
This directory contains a pytest-based functional test framework that runs **alongside** the existing bash tests in `CI/tests/`. It covers the **pod disruption** and **application outage** scenarios with proper assertions, retries, and reporting.
Each test runs in its **own ephemeral Kubernetes namespace** (`krkn-test-<uuid>`). Before the test, the framework creates the namespace, deploys the target workload, and waits for pods to be ready. After the test, the namespace is deleted (cascading all resources). **You do not need to deploy any workloads manually.**
## Prerequisites
Without a cluster, tests that need one will **skip** with a clear message (e.g. *"Could not load kube config"*). No manual workload deployment is required; workloads are deployed automatically into ephemeral namespaces per test.
- **KinD cluster** (or any Kubernetes cluster) running with `kubectl` configured (e.g. `KUBECONFIG` or default `~/.kube/config`).
- **Python 3.9+** and main repo deps: `pip install -r requirements.txt`.
### Supported clusters
- **KinD** (recommended): Use `make -f CI/tests_v2/Makefile setup` from the repo root. Fastest for local dev; uses a 2-node dev config by default. Override with `KIND_CONFIG=/path/to/kind-config.yml` for a larger cluster.
- **Minikube**: Should work; ensure `kubectl` context is set. Not tested in CI.
- **Remote/cloud cluster**: Tests create and delete namespaces; use with caution. Use `--require-kind` to avoid accidentally running against production (tests will skip unless context is kind/minikube).
### Setting up the cluster
**Option A: Use the setup script (recommended)**
From the repository root, with `kind` and `kubectl` installed:
```bash
# Create KinD cluster (defaults to CI/tests_v2/kind-config-dev.yml; override with KIND_CONFIG=...)
./CI/tests_v2/setup_env.sh
```
Then in the same shell (or after `export KUBECONFIG=~/.kube/config` in another terminal), activate your venv and install Python deps:
```bash
python3 -m venv venv
source venv/bin/activate # or: source venv/Scripts/activate on Windows
pip install -r requirements.txt
pip install -r CI/tests_v2/requirements.txt
```
**Option B: Manual setup**
1. Install [kind](https://kind.sigs.k8s.io/docs/user/quick-start/) and [kubectl](https://kubernetes.io/docs/tasks/tools/).
4. Create a virtualenv, activate it, and install dependencies (as in Option A).
5. Run tests from repo root: `pytest CI/tests_v2/ -v ...`
## Install test dependencies
From the repository root:
```bash
pip install -r CI/tests_v2/requirements.txt
```
This adds `pytest-rerunfailures`, `pytest-html`, `pytest-timeout`, and `pytest-order` (pytest and coverage come from the main `requirements.txt`).
## Dependency Management
Dependencies are split into two files:
- **Root `requirements.txt`** — Kraken runtime (cloud SDKs, Kubernetes client, krkn-lib, pytest, coverage, etc.). Required to run Kraken.
- **`CI/tests_v2/requirements.txt`** — Test-only pytest plugins (rerunfailures, html, timeout, order, xdist). Not needed by Kraken itself.
**Rule of thumb:** If Kraken needs it at runtime, add to root. If only the functional tests need it, add to `CI/tests_v2/requirements.txt`.
Running `make -f CI/tests_v2/Makefile setup` (or `make setup` from `CI/tests_v2`) creates the venv and installs **both** files automatically; you do not need to install them separately. The Makefile re-installs when either file changes (via the `.installed` sentinel).
## Run tests
All commands below are from the **repository root**.
- Failed tests are **retried up to 2 times** with a 10s delay (configurable in `CI/tests_v2/pytest.ini`).
- Each test has a **5-minute timeout**.
- Open `CI/tests_v2/report.html` in a browser for a detailed report.
### Run in parallel (faster suite)
```bash
pytest CI/tests_v2/ -v -n 4 --timeout=300
```
Ephemeral namespaces make tests parallel-safe; use `-n` with the number of workers (e.g. 4).
### Run without retries (for debugging)
```bash
pytest CI/tests_v2/ -v -p no:rerunfailures
```
### Run with coverage
```bash
python -m coverage run -m pytest CI/tests_v2/ -v
python -m coverage report
```
To append to existing coverage from unit tests, ensure coverage was started with `coverage run -a` for earlier runs, or run the full test suite in one go.
### Run only pod disruption tests
```bash
pytest CI/tests_v2/ -v -m pod_disruption
```
### Run only application outage tests
```bash
pytest CI/tests_v2/ -v -m application_outage
```
### Run with verbose output and no capture
```bash
pytest CI/tests_v2/ -v -s
```
### Keep failed test namespaces for debugging
When a test fails, its ephemeral namespace is normally deleted. To **keep** the namespace so you can inspect pods, logs, and network policies:
```bash
pytest CI/tests_v2/ -v --keep-ns-on-fail
```
On failure, the namespace name is printed (e.g. `[keep-ns-on-fail] Keeping namespace krkn-test-a1b2c3d4 for debugging`). Use `kubectl get pods -n krkn-test-a1b2c3d4` (and similar) to debug, then delete the namespace manually when done.
### Logging and cluster options
- **Structured logging**: Use `--log-cli-level=DEBUG` to see namespace creation, workload deploy, and readiness in the console. Use `--log-file=test.log` to capture logs to a file.
- **Require dev cluster**: To avoid running against the wrong cluster, use `--require-kind`. Tests will skip unless the current kube context cluster name contains "kind" or "minikube".
- **Stale namespace cleanup**: At session start, namespaces matching `krkn-test-*` that are older than 30 minutes are deleted (e.g. from a previous crashed run).
- **Timeout overrides**: Set env vars to tune timeouts (e.g. in CI): `KRKN_TEST_READINESS_TIMEOUT`, `KRKN_TEST_DEPLOY_TIMEOUT`, `KRKN_TEST_NS_CLEANUP_TIMEOUT`, `KRKN_TEST_POLICY_WAIT_TIMEOUT`, `KRKN_TEST_KRAKEN_PROC_WAIT_TIMEOUT`, `KRKN_TEST_TIMEOUT_BUDGET`.
## Architecture
- **Folder-per-scenario**: Each scenario lives under `scenarios/<scenario_name>/` with:
- **test_<scenario>.py** — Test class extending `BaseScenarioTest`; sets `WORKLOAD_MANIFEST`, `SCENARIO_NAME`, `SCENARIO_TYPE`, `NAMESPACE_KEY_PATH`, and optionally `OVERRIDES_KEY_PATH`.
- **resource.yaml** — Kubernetes resources (Deployment/Pod) for the scenario; namespace is patched at deploy time.
- **scenario_base.yaml** — Canonical Krkn scenario; the base class loads it, patches namespace (and overrides), and passes it to Kraken via `run_scenario()`. Optional extra YAMLs (e.g. `nginx_http.yaml` for application_outage) can live in the same folder.
- **conftest.py**: Re-exports fixtures from the lib modules and defines `pytest_addoption`, logging, and `repo_root`.
- **Adding a new scenario**: Use the scaffold script (see [CONTRIBUTING_TESTS.md](CONTRIBUTING_TESTS.md)) to create `scenarios/<name>/` with test file, `resource.yaml`, and `scenario_base.yaml`, or copy an existing scenario folder and adapt.
## What is tested
Each test runs in an isolated ephemeral namespace; workloads are deployed automatically before the test and the namespace is deleted after (unless `--keep-ns-on-fail` is set and the test failed).
- **scenarios/pod_disruption/**
Pod disruption scenario. `resource.yaml` is a deployment with label `app=krkn-pod-disruption-target`; `scenario_base.yaml` is loaded and `namespace_pattern` is patched to the test namespace. The test:
1. Records baseline pod UIDs and restart counts.
2. Runs Kraken with the pod disruption scenario.
3. Asserts that chaos had an effect (UIDs changed or restart count increased).
4. Waits for pods to be Running and all containers Ready.
5. Asserts pod count is unchanged and all pods are healthy.
- **scenarios/application_outage/**
Application outage scenario (block Ingress/Egress to target pods, then restore). `resource.yaml` is the main workload (outage pod); `scenario_base.yaml` is loaded and patched with namespace (and duration/block as needed). Optional `nginx_http.yaml` is used by the traffic test. Tests include:
- **test_app_outage_block_restore_and_variants**: Happy path with default, exclude_label, and block variants (Ingress, Egress, both); Krkn exit 0, pods still Running/Ready.
- **test_network_policy_created_then_deleted**: Policy with prefix `krkn-deny-` appears during run and is gone after.
- **test_traffic_blocked_during_outage** (disabled, planned): Deploys nginx with label `scenario=outage`, port-forwards; during outage curl fails, after run curl succeeds.
- **test_bad_namespace_fails**: Scenario targeting a non-existent namespace causes Kraken to exit non-zero.
## Configuration
- **pytest.ini**: Markers (`functional`, `pod_disruption`, `application_outage`, `no_workload`). Use `--timeout=300`, `--reruns=2`, `--reruns-delay=10` on the command line for full runs.
- **conftest.py**: Re-exports fixtures from `lib/k8s.py`, `lib/namespace.py`, `lib/deploy.py`, `lib/kraken.py` (e.g. `test_namespace`, `deploy_workload`, `k8s_core`, `wait_for_pods_running`, `run_kraken`, `build_config`). Configs are built from `CI/tests_v2/config/common_test_config.yaml` with monitoring disabled for local runs. Timeout constants in `lib/base.py` can be overridden via env vars.
- **Cluster access**: Reads and applies use the Kubernetes Python client; `kubectl` is still used for `port-forward` and for running Kraken.
- The **existing** bash tests in `CI/tests/` and `CI/run.sh` are **unchanged**. They continue to run as before in GitHub Actions.
- This framework is **additive**. To run it in CI later, add a separate job or step that runs `pytest CI/tests_v2/ ...` from the repo root.
## Troubleshooting
- **`pytest.skip: Could not load kube config`** — No cluster or bad KUBECONFIG. Run `make -f CI/tests_v2/Makefile setup` (or `make setup` from `CI/tests_v2`) or check `kubectl cluster-info`.
- **KinD cluster creation hangs** — Docker is not running. Start Docker Desktop or run `systemctl start docker`.
- **`Bind for 0.0.0.0:9090 failed: port is already allocated`** — Another process (e.g. Prometheus) is using the port. The default dev config (`kind-config-dev.yml`) no longer maps host ports; if you use `KIND_CONFIG=kind-config.yml` or a custom config with `extraPortMappings`, free the port or switch to `kind-config-dev.yml`.
- **`TimeoutError: Pods did not become ready`** — Slow image pull or node resource limits. Increase `KRKN_TEST_READINESS_TIMEOUT` or check node resources.
- **`ModuleNotFoundError: pytest_rerunfailures`** — Missing test deps. Run `pip install -r CI/tests_v2/requirements.txt` (or `make setup`).
- **Stale `krkn-test-*` namespaces** — Left over from a previous crashed run. They are auto-cleaned at session start (older than 30 min). To remove cluster and reports: `make -f CI/tests_v2/Makefile clean`.
- **Wrong cluster targeted** — Multiple kube contexts. Use `--require-kind` to skip unless context is kind/minikube, or set context explicitly: `kubectl config use-context kind-ci-krkn`.
- **`OSError: [Errno 48] Address already in use` when running tests in parallel** — Kraken normally starts an HTTP status server on port 8081. With `-n auto` (pytest-xdist), multiple Kraken processes would all try to bind to 8081. The test framework disables this server (`publish_kraken_status: False`) in the generated config, so parallel runs should not hit this. If you see it, ensure you're using the framework's `build_config` and not a config that has `publish_kraken_status: True`.
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set.
enable_alerts:True# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
enable_metrics:True
alert_profile:config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile:config/metrics-report.yaml
check_critical_alerts:True# Path to alert profile with the prometheus queries.
tunings:
wait_duration:6# Duration to wait between each chaos scenario.
iterations:1# Number of times to execute the scenarios.
daemon_mode:False# Iterations are set to infinity which means that the kraken will cause chaos forever.
telemetry:
enabled:False# enable/disables the telemetry collection feature
api_url:https://yvnn4rfoi7.execute-api.us-west-2.amazonaws.com/test#telemetry service endpoint
username:$TELEMETRY_USERNAME # telemetry service username
password:$TELEMETRY_PASSWORD # telemetry service password
elastic_url:"https://192.168.39.196"# To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port:32766
username:"elastic"
password:"test"
metrics_index:"krkn-metrics"
alerts_index:"krkn-alerts"
telemetry_index:"krkn-telemetry"
health_checks:# Utilizing health check endpoints to observe application behavior during chaos injection.
interval:# Interval in seconds to perform health checks, default value is 2 seconds
config:# Provide list of health check configurations for applications
- url:# Provide application endpoint
bearer_token:# Bearer token for authentication if any
auth:# Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure:# If value is True exits when health check failed for application, values can be True/False
Functional test for application outage scenario (block network to target pods, then restore).
Equivalent to CI/tests/test_app_outages.sh with proper assertions.
The main happy-path test reuses one namespace and workload for multiple scenario runs (default, exclude_label, block variants); other tests use their own ephemeral namespace as needed.
"""Default, exclude_label, and block-type variants (Ingress, Egress, both) run successfully in one namespace; each run restores and pods stay ready."""
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.