* Add PVC outage scenario plugin to manage PVC annotations during outages
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Remove PvcOutageScenarioPlugin as it is no longer needed; refactor PvcScenarioPlugin to include rollback functionality for temporary file cleanup during PVC scenarios.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback_data handling in PvcScenarioPlugin to use str() instead of json.dumps() for resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Import json module in PvcScenarioPlugin for decoding rollback data from resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Encode rollback data in base64 format for resource_identifier in PvcScenarioPlugin to enhance data handling and security.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* feat: refactor: Update logging level from debug to info for temp file operations in PvcScenarioPlugin to improve visibility of command execution.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add unit tests for PvcScenarioPlugin methods and enhance test coverage
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add missed lines test cov
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor tests in test_pvc_scenario_plugin.py to use unittest framework and enhance test coverage for to_kbytes method
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Enhance rollback_temp_file test to verify logging of errors for invalid data
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor tests in TestPvcScenarioPluginRun to clarify pod_name behavior and enhance logging verification in rollback_temp_file tests
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactored imports
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor assertions in test cases to use assertEqual for consistency
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
* Add rollback functionality to ServiceHijackingScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback data handling in ServiceHijackingScenarioPlugin as json string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update rollback data handling in ServiceHijackingScenarioPlugin to decode directly from resource_identifier
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Add import statement for JSON handling in ServiceHijackingScenarioPlugin
This change introduces an import statement for the JSON module to facilitate the decoding of rollback data from the resource identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Enhance rollback data handling in ServiceHijackingScenarioPlugin by encoding and decoding as base64 strings.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add rollback tests for ServiceHijackingScenarioPlugin
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor rollback tests for ServiceHijackingScenarioPlugin to improve error logging and remove temporary path dependency
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Remove redundant import of yaml in test_service_hijacking_scenario_plugin.py
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor rollback tests for ServiceHijackingScenarioPlugin to enhance readability and consistency
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
* Kubevirt VM outage tests with improved mocking and validation scenarios at test_kubevirt_vm_outage.py
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor Kubevirt VM outage tests to improve time mocking and response handling
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Remove unused subproject reference for pvc_outage
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor Kubevirt VM outage tests to enhance time mocking and improve response handling
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Enhance VMI deletion test by mocking unchanged creationTimestamp to exercise timeout path
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor Kubevirt VM outage tests to use dynamic timestamps and improve mock handling
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m3s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback functionality to SynFloodScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback pod handling in SynFloodScenarioPlugin to handle podnames as string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update resource identifier handling in SynFloodScenarioPlugin to use list format for rollback functionality
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor chaos scenario configurations in config.yaml to comment out existing scenarios for clarity. Update rollback method in SynFloodScenarioPlugin to improve pod cleanup handling. Modify pvc_scenario.yaml with specific test values for better usability.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Enhance rollback functionality in SynFloodScenarioPlugin by encoding pod names in base64 format for improved data handling.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Add unit tests for SynFloodScenarioPlugin methods and rollback functionality
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor TestSynFloodRun and TestRollbackSynFloodPods to inherit from unittest.TestCase
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* Refactor SynFloodRun tests to use tempfile for scenario file creation and improve error logging in rollback functionality
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m28s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Validate version file format
* Add validation for context dir, Exexcute all files by default
* Consolidate execute and cleanup, rename with .executed instead of
removing
* Respect auto_rollback config
* Add cleanup back but only for scenario successed
---------
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m48s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback functionality to SynFloodScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback pod handling in SynFloodScenarioPlugin to handle podnames as string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update resource identifier handling in SynFloodScenarioPlugin to use list format for rollback functionality
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor chaos scenario configurations in config.yaml to comment out existing scenarios for clarity. Update rollback method in SynFloodScenarioPlugin to improve pod cleanup handling. Modify pvc_scenario.yaml with specific test values for better usability.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Enhance rollback functionality in SynFloodScenarioPlugin by encoding pod names in base64 format for improved data handling.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
* Add rollback functionality to ServiceHijackingScenarioPlugin
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback data handling in ServiceHijackingScenarioPlugin as json string
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Update rollback data handling in ServiceHijackingScenarioPlugin to decode directly from resource_identifier
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Add import statement for JSON handling in ServiceHijackingScenarioPlugin
This change introduces an import statement for the JSON module to facilitate the decoding of rollback data from the resource identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Enhance rollback data handling in ServiceHijackingScenarioPlugin by encoding and decoding as base64 strings.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
* Add PVC outage scenario plugin to manage PVC annotations during outages
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Remove PvcOutageScenarioPlugin as it is no longer needed; refactor PvcScenarioPlugin to include rollback functionality for temporary file cleanup during PVC scenarios.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Refactor rollback_data handling in PvcScenarioPlugin to use str() instead of json.dumps() for resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* Import json module in PvcScenarioPlugin for decoding rollback data from resource_identifier.
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
* feat: Encode rollback data in base64 format for resource_identifier in PvcScenarioPlugin to enhance data handling and security.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
* feat: refactor: Update logging level from debug to info for temp file operations in PvcScenarioPlugin to improve visibility of command execution.
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
---------
Signed-off-by: sanjay7178 <saisanjay7660@gmail.com>
Signed-off-by: Sai Sanjay <saisanjay7660@gmail.com>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
* feat: Add exclude_label feature to pod network outage scenarios
This feature enables filtering out specific pods from network outage
chaos testing based on label selectors. Users can now target all pods
in a namespace except critical ones by specifying exclude_label.
- Added exclude_label parameter to list_pods() function
- Updated get_test_pods() to pass the exclude parameter
- Added exclude_label field to all relevant plugin classes
- Updated schema.json with the new parameter
- Added documentation and examples
- Created comprehensive unit tests
Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
* krkn-lib update
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
* removed plugin schema
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
---------
Signed-off-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Co-authored-by: Priyansh Saxena <130545865+Transcendental-Programmer@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m38s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Adding node_label_selector for pod scenarios
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* using kubernetes function, adding node_name and removing extra config
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* adding CI test for custom pod scenario
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* fixing comment
* adding test to workflow
* adding list parsing logic for krkn hub
* parsing not needed, as input is always []
---------
Signed-off-by: Sahil Shah <sahshah@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m18s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Add rollback config
* Inject rollback handler to scenario plugin
* Add Serializer
* Add decorator
* Add test with SimpleRollbackScenarioPlugin
* Add logger for verbose debug flow
* Resolve review comment
- remove additional rollback config in config.yaml
- set KUBECONFIG to ~/.kube/config in test_rollback
* Simplify set_rollback_context_decorator
* Fix integration of rollback_handler in __load_plugins
* Refactor rollback.config module
- make it singleton class with register method to construct
- RollbackContext ( <timestamp>-<run_uuid> )
- add get_rollback_versions_directory for moduling the directory
format
* Adapt new rollback.config
* Refactor serialization
- respect rollback_callable_name
- refactor _parse_rollback_callable_code
- refine VERSION_FILE_TEMPLATE
* Add get_scenario_rollback_versions_directory in RollbackConfig
* Add rollback in ApplicationOutageScenarioPlugin
* Add RollbackCallable and RollbackContent for type annotation
* Refactor rollback_handler with limited arguments
* Refactor the serialization for rollback
- limited arguments: callback and rollback_content just these two!
- always constuct lib_openshift and lib_telemetry in version file
- add _parse_rollback_content_definition for retrieving scenaio specific
rollback_content
- remove utils for formating variadic function
* Refactor applicaton outage scenario
* Fix test_rollback
* Make RollbackContent with static fields
* simplify serialization
- Remove all unused format dynamic arguments utils
- Add jinja template for version file
- Replace set_context for serialization with passing version to serialize_callable
* Add rollback for hogs scenario
* Fix version file full path based on feedback
- {versions_directory}/<timestamp(ns)>-<run_uuid>/{scenario_type}-<timestamp(ns)>-<random_hash>.py
* Fix scenario plugins after rebase
* Add rollback config
* Inject rollback handler to scenario plugin
* Add test with SimpleRollbackScenarioPlugin
* Resolve review comment
- remove additional rollback config in config.yaml
- set KUBECONFIG to ~/.kube/config in test_rollback
* Fix integration of rollback_handler in __load_plugins
* Refactor rollback.config module
- make it singleton class with register method to construct
- RollbackContext ( <timestamp>-<run_uuid> )
- add get_rollback_versions_directory for moduling the directory
format
* Adapt new rollback.config
* Add rollback in ApplicationOutageScenarioPlugin
* Add RollbackCallable and RollbackContent for type annotation
* Refactor applicaton outage scenario
* Fix test_rollback
* Make RollbackContent with static fields
* simplify serialization
- Remove all unused format dynamic arguments utils
- Add jinja template for version file
- Replace set_context for serialization with passing version to serialize_callable
* Add rollback for hogs scenario
* Fix version file full path based on feedback
- {versions_directory}/<timestamp(ns)>-<run_uuid>/{scenario_type}-<timestamp(ns)>-<random_hash>.py
* Fix scenario plugins after rebase
* Add execute rollback
* Add CLI for list and execute rollback
* Replace subprocess with importlib
* Fix error after rebase
* fixup! Fix docstring
- Add telemetry_ocp in execute_rollback docstring
- Remove rollback_config in create_plugin docstring
- Remove scenario_types in set_rollback_callable docsting
* fixup! Replace os.urandom with krkn_lib.utils.get_random_string
* fixup! Add missing telemetry_ocp for execute_rollback_version_files
* fixup! Remove redundant import
- Remove duplicate TYPE_CHECKING in handler module
- Remove cast in signal module
- Remove RollbackConfig in scenario_plugin_factory
* fixup! Replace sys.exit(1) with return
* fixup! Remove duplicate rollback_network_policy
* fixup! Decouple Serializer initialization
* fixup! Rename callback to rollback_callable
* fixup! Refine comment for constructing AbstractScenarioPlugin with
placeholder value
* fixup! Add version in docstring
* fixup! Remove uv.lock
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m34s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit updates fedora tools image reference used by the network scenarios
to the one hosted in the krkn-chaos quay org. This also fixes the issues with
RHACS flagging runs when using latest tag by using tools tag instead.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m9s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
* Disable SSL verification for IBM node scenarios and fix node reboot scenario
Signed-off-by: Sahil Shah <sahshah@redhat.com>
* adding disable ssl as a scenario parameter for ibmcloud
Signed-off-by: Sahil Shah <sahshah@redhat.com>
---------
Signed-off-by: Sahil Shah <sahshah@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 10m29s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Fix the logic in disk disruption scenario, which was returning the right set of disks to be off-lined.
Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Has been cancelled
Functional & Unit Tests / Generate Coverage Badge (push) Has been cancelled
- Implemented methods for detaching and attaching disks to baremetal nodes.
- Added a new scenario `node_disk_detach_attach_scenario` to manage disk operations.
- Updated the YAML configuration to include the new scenario with disk details.
Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
Introduce a delay in network scenarios prior to imposing restrictions.
This ensures that chaos test case jobs are scheduled before any restrictions are put in place.
Signed-off-by: Yogananth Subramanian <ysubrama@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Waiting to run
Functional & Unit Tests / Generate Coverage Badge (push) Blocked by required conditions
This will enable users and organizations to share their Krkn adoption
journey for their chaos engineering use cases.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 8m15s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds a policy on how Krkn follows best practices and
addresses security vulnerabilities.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding vsphere updates to non native
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding node id to affected node
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixed the spelling mistake
Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding v4.0.8 version (#756)
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Add autodetecting distribution (#753)
Used is_openshift function from krkn lib
Remove distribution from config
Remove distribution from documentation
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Added the health check config in functional test config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Modified the health checks documentation
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for debugging the functional test failing
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* changed the code for debugging in run_test.sh
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removed the functional test running line
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the health check config in common_test_config for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixing functional test fialure
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the changes that are added for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* few modifications
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Renamed timestamp
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changed the start timestamp and end timestamp data type to the datetime
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding node id to affected node
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Added the health check config in functional test config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Modified the health checks documentation
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for debugging the functional test failing
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* changed the code for debugging in run_test.sh
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removed the functional test running line
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the health check config in common_test_config for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixing functional test fialure
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the changes that are added for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* few modifications
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Renamed timestamp
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Hog scenario porting from arcaflow to native (#748)
* added new native hog scenario
* removed arcaflow dependency + legacy hog scenarios
* config update
* changed hog configuration structure + added average samples
* fix on cpu count
* removes tripledes warning
* changed selector format
* changed selector syntax
* number of nodes option
* documentation
* functional tests
* exception handling on hog deployment thread
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* adding node id to affected node
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes include health check doc and exit_on_failure config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* initial version of health checks
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for appending success response and health check config format
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Update config.yaml
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Added the health check config in functional test config
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changes for debugging the functional test failing
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* changed the code for debugging in run_test.sh
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removed the functional test running line
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the health check config in common_test_config for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Fixing functional test fialure
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Removing the changes that are added for debugging
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* few modifications
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Renamed timestamp
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* passing the health check response as HealthCheck object
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Updated the krkn-lib version in requirements.txt
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
* Changed the coverage
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
---------
Signed-off-by: kattameghana <meghanakatta8@gmail.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
Signed-off-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Co-authored-by: Tullio Sebastiani <tsebastiani@users.noreply.github.com>
Co-authored-by: Paige Patton <prubenda@redhat.com>
Co-authored-by: Meghana Katta <mkatta@mkatta-thinkpadt14gen4.bengluru.csb>
Co-authored-by: Paige Patton <64206430+paigerube14@users.noreply.github.com>
Co-authored-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m14s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This commit adds recommendation to test and ensure Pod Disruption
Budgets are set for critical applications to avoid downtime.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m12s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
Used is_openshift function from krkn lib
Remove distribution from config
Remove distribution from documentation
Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
Functional & Unit Tests / Functional & Unit Tests (push) Failing after 9m22s
Functional & Unit Tests / Generate Coverage Badge (push) Has been skipped
This is needed to avoid issues due to comparing two different data types:
TypeError: Invalid comparison between dtype=float64 and str. This commit also
avoids setting defaults for the thresholds to make it mandatory for the users
to define them as it plays a key role in determining the outliers.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* Document how to use Google's credentials associated with a user acccount
Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
* Change API from 'Google API Client' to 'Google Cloud Python Client'
According to the 'Google API Client' GH page:
```
This library is considered complete and is in maintenance mode. This means
that we will address critical bugs and security issues but will not add any
new features.
This library is officially supported by Google. However, the maintainers of
this repository recommend using Cloud Client Libraries for Python, where
possible, for new code development.
```
So change the code accordingly to adapt it to 'Google Cloud Python Client'.
Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
---------
Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
* Add support for user-provided default network ACL
Signed-off-by: henrick <self@thehenrick.com>
* Add logs to notify user when their provided acl is used
Signed-off-by: henrick <self@thehenrick.com>
* Update docs to include optional default_acl_id parameter in zone_outage
Signed-off-by: henrick <self@thehenrick.com>
---------
Signed-off-by: henrick <self@thehenrick.com>
Co-authored-by: henrick <self@thehenrick.com>
This is needed for the TRT/component readiness integration to improve
dashboard readability and tie results back to chaos.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* add workflows
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* update readme
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* rm my kubeconfig path
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* add workflow details to readme
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* mv arcaflow to utils
Signed-off-by: Matthew F Leader <mleader@redhat.com>
---------
Signed-off-by: Matthew F Leader <mleader@redhat.com>
* adding aws bare metal
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
* no found reservations
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
---------
Co-authored-by: Auto User <auto@users.noreply.github.com>
* adding elastic set to none
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Auto User <auto@users.noreply.github.com>
* too many ls
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
---------
Signed-off-by: Auto User <auto@users.noreply.github.com>
Co-authored-by: Auto User <auto@users.noreply.github.com>
This option is enabled only for node_stop_start scenario where
user will want to stop the node for certain duration to understand
the impact before starting the node back on. This commit also bumps
the timeout for the scenario to 360 seconds from 120 seconds to make
sure there's enough time for the node to get to Ready state from the
Kubernetes side after the node is started on the infra side.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
This commit removes the instructions on running krkn as kubernetes
deployment as it is not supported/maintained and also not recommended.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
This commit:
- Also switches the rate queries severity to critical as 5%
threshold is high for low scale/density clusters and needs to be flagged.
- Adds rate queries to openshift alerts file
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
This commit also deprecates building container image for ppc64le as it
is not actively maintained. We will add support if users request for it
in the future.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
We are not using it in the krkn code base and removing it fixes one
of the license issues reported by FOSSA. This commit also removes
setting up dependencies using docker/podman compose as it not actively
maintained.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Avoids architecture issues such as "bash: /usr/bin/az: cannot execute: required file not found"
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* fixes system and oc vulnerabilities detected by trivy
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
* updated base image to run as krkn user instead of root
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
---------
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Added network_chaos to plugin step and job wait time to be based on the test duration and set the default wait_time to 30s
Signed-off-by: yogananth subramanian <ysubrama@redhat.com>
This will make sure oc and kubectl clients are accessible for users
with both /usr/bin and /usr/local/bin paths set on the host.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
Output in terminal changed to use json structure.
The json output file names are in format
recommender_namespace_YYYY-MM-DD_HH-MM-SS.
The path to the json file can be specified. Default path is in
kraken/utils/chaos_recommender/recommender_output.
Signed-off-by: jtydlcak <139967002+jtydlack@users.noreply.github.com>
This covers use case where user wants to just check for critical alerts
post chaos without having to enable the alerts evaluation feature which
evaluates prom queries specified in an alerts file.
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
* taking out start and end time"
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* adding only break when alert fires
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* fail at end if alert had fired
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* adding new krkn-lib function with no range
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* updating requirements to new krkn-lib
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
---------
Signed-off-by: Paige Rubendall <prubenda@redhat.com>
* Fix github.io link in README.md
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* Fix krknChaos-hub link in README.md
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* Fix kube-burner link in README.md
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---------
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
The scenario introduces network latency, packet loss, and bandwidth restriction in the Pod's network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
Below example config applies ingress traffic shaping to openshift console.
````
- id: pod_ingress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to ingress traffic from the pod.
````
* basic structure working
* config and options refactoring
nits and changes
* removed unused function with typo + fixed duration
* removed unused arguments
* minor fixes
* adding service disruption
* fixing kil services
* service log changes
* remvoing extra logging
* adding daemon set
* adding service disruption name changes
* cerberus config back
* bad string
The scenario introduces network latency, packet loss, and bandwidth restriction in the Pod's network interface.
The purpose of this scenario is to observe faults caused by random variations in the network.
Below example config applies egress traffic shaping to openshift console.
````
- id: pod_egress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to egress traffic from the pod.
````
This makes sure latest clients are installed and used:
- This will avoid compatability issues with the server
- Fixes security vulnerabilities and CVEs
This commit:
- Also sets appropriate severity to avoid false failures for the
test cases especially given that theses are monitored during the chaos
vs post chaos. Critical alerts are all monitored post chaos with few
monitored during the chaos that represent overall health and performance
of the service.
- Renames Alerts to SLOs validation
Metrics reference: f09a492b13/cmd/kube-burner/ocp-config/alerts.yml
* Include check for inside k8s scenario
* Include check for inside k8s scenario (2)
* Include check for inside k8s scenario (3)
* Include check for inside k8s scenario (4)
This is the first step towards the goal to only have metrics tracking
the overall health and performance of the component/cluster. For instance,
for etcd disruption scenarios, leader elections are expected, we should instead
track etcd leader availability and fsync latency under critical catergory vs leader
elections.
Pod network outage chaos scenario blocks traffic at pod level irrespective of the network policy used.
With the current network policies, it is not possible to explicitly block ports which are enabled
by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules
to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
Below example config blocks access to openshift console.
````
- id: pod_network_outage
config:
namespace: openshift-console
direction:
- ingress
ingress_ports:
- 8443
label_selector: 'component=ui'
````
* kubeconfig management for arcaflow + hogs scenario refactoring
* kubeconfig authentication parsing refactored to support arcaflow kubernetes deployer
* reimplemented all the hog scenarios to allow multiple parallel containers of the same scenarios
(eg. to stress two or more nodes in the same run simultaneously)
* updated documentation
* removed sysbench scenarios
* recovered cpu hogs
* updated requirements.txt
* updated config.yaml
* added gitleaks file for test fixtures
* imported sys and logging
* removed config_arcaflow.yaml
* updated readme
* refactored arcaflow documentation entrypoint
Also renames retry_wait to expected_recovery_time to make it clear that
the Kraken will exit 1 if the container doesn't recover within the expected
time.
Fixes https://github.com/redhat-chaos/krkn/issues/414
This commit enables users to opt in to check for critical alerts firing
in the cluster post chaos at the end of each scenario. Chaos scenario is
considered as failed if the cluster is unhealthy in which case user can
start debugging to fix and harden respective areas.
Fixes https://github.com/redhat-chaos/krkn/issues/410
Moving the content around installing kraken using helm to the
chaos in practice section of the guide to showcase how startx-lab
is deploying and leveraging Kraken.
* Added some bits and pieces to the krkn k8s installation to make it easier
* updated k8s/Oc installation documentation
* gitignore
* doc reorg
* fixed numbering + removed italic
Co-authored-by: Tullio Sebastiani <tullio.sebastiani@x3solutions.it>
previously the test was looking for master label.
Recent kubernetes uses control-plane lable instead.
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
As it says:
Pod scenarios have been removed, please use plugin_scenarios
with the kill-pods configuration instead.
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
Documentation says we default to ~ for looking up the kubernetes config
but then we set everywhere /root. Fixed the config to really look for ~.
Should solve #327.
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
This releases includes the changes needed for the customer as well as
number of other fixes and enhancements:
- Support for VMware node sceanrios
- Support for ingress traffic shaping
- Other changes can be found at https://github.com/redhat-chaos/krkn/releases/tag/v1.1.0
<-- Provide a brief description of the changes made in this PR. -->
## Related Tickets & Documents
If no related issue, please create one and start the converasation on wants of
- Related Issue #:
- Closes #:
# Documentation
- [ ]**Is documentation needed for this update?**
If checked, a documentation PR must be created and merged in the [website repository](https://github.com/krkn-chaos/website/).
## Related Documentation PR (if applicable)
<-- Add the link to the corresponding documentation PR in the website repository -->
# Checklist before requesting a review
[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See [contributing guidelines](https://krkn-chaos.dev/docs/contribution-guidelines/)
See [testing your changes](https://krkn-chaos.dev/docs/developers-guide/testing-changes/) and run on any Kubernetes or OpenShift cluster to validate your changes
- [ ] I have performed a self-review of my code by running krkn and specific scenario
- [ ] If it is a core feature, I have added thorough unit tests with above 80% coverage
*REQUIRED*:
Description of combination of tests performed and output of run
```bash
python run_kraken.py
...
<---insert test results output--->
```
OR
```bash
python -m coverage run -a -m unittest discover -s tests -v
This is a list of organizations that have publicly acknowledged usage of Krkn and shared details of how they are leveraging it in their environment for chaos engineering use cases. Do you want to add yourself to this list? Please fork the repository and open a PR with the required change.
| Organization | Since | Website | Use-Case |
|:-|:-|:-|:-|
| MarketAxess | 2024 | https://www.marketaxess.com/ | Kraken enables us to achieve our goal of increasing the reliability of our cloud products on Kubernetes. The tool allows us to automatically run various chaos scenarios, identify resilience and performance bottlenecks, and seamlessly restore the system to its original state once scenarios finish. These chaos scenarios include pod disruptions, node (EC2) outages, simulating availability zone (AZ) outages, and filling up storage spaces like EBS and EFS. The community is highly responsive to requests and works on expanding the tool's capabilities. MarketAxess actively contributes to the project, adding features such as the ability to leverage existing network ACLs and proposing several feature improvements to enhance test coverage. |
| Red Hat Openshift | 2020 | https://www.redhat.com/ | Kraken is a highly reliable chaos testing tool used to ensure the quality and resiliency of Red Hat Openshift. The engineering team runs all the test scenarios under Kraken on different cloud platforms on both self-managed and cloud services environments prior to the release of a new version of the product. The team also contributes to the Kraken project consistently which helps the test scenarios to keep up with the new features introduced to the product. Inclusion of this test coverage has contributed to gaining the trust of new and existing customers of the product. |
| IBM | 2023 | https://www.ibm.com/ | While working on AI for Chaos Testing at IBM Research, we closely collaborated with the Kraken (Krkn) team to advance intelligent chaos engineering. Our contributions included developing AI-enabled chaos injection strategies and integrating reinforcement learning (RL)-based fault search techniques into the Krkn tool, enabling it to identify and explore system vulnerabilities more efficiently. Kraken stands out as one of the most user-friendly and effective tools for chaos engineering, and the Kraken team’s deep technical involvement played a crucial role in the success of this collaboration—helping bridge cutting-edge AI research with practical, real-world system reliability testing. |
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set.
enable_alerts:False# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error.
alert_profile:config/alerts # Path to alert profile with the prometheus queries.
enable_alerts:True# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
enable_metrics:True
alert_profile:config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile:config/metrics-report.yaml
check_critical_alerts:True# Path to alert profile with the prometheus queries.
tunings:
wait_duration:6# Duration to wait between each chaos scenario.
iterations:1# Number of times to execute the scenarios.
daemon_mode:False# Iterations are set to infinity which means that the kraken will cause chaos forever.
telemetry:
enabled:False# enable/disables the telemetry collection feature
api_url:https://yvnn4rfoi7.execute-api.us-west-2.amazonaws.com/test#telemetry service endpoint
username:$TELEMETRY_USERNAME # telemetry service username
password:$TELEMETRY_PASSWORD # telemetry service password
elastic_url:"https://192.168.39.196"# To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port:32766
username:"elastic"
password:"test"
metrics_index:"krkn-metrics"
alerts_index:"krkn-alerts"
telemetry_index:"krkn-telemetry"
health_checks:# Utilizing health check endpoints to observe application behavior during chaos injection.
interval:# Interval in seconds to perform health checks, default value is 2 seconds
config:# Provide list of health check configurations for applications
- url:# Provide application endpoint
bearer_token:# Bearer token for authentication if any
auth:# Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure:# If value is True exits when health check failed for application, values can be True/False
["${PAYLOAD_PATCH_1//[$'\t\r\n ']}"=="${OUT_PATCH//[$'\t\r\n ']}"]&&echo"Step 1 PATCH Payload OK"||(echo"Payload did not match. Test failed."&&exit 1)
["$OUT_STATUS_CODE"=="$STATUS_CODE_PATCH_1"]&&echo"Step 1 PATCH Status Code OK"||(echo"Step 1 PATCH status code did not match. Test failed."&&exit 1)
["$OUT_CONTENT"=="$TEXT_MIME"]&&echo"Step 1 PATCH MIME OK"||(echo" Step 1 PATCH MIME did not match. Test failed."&&exit 1)
# wait for the next step
sleep 16
#Checking Step 2 GET on /list/index.php
OUT_GET="`curl -X GET -s $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X GET -s -o /dev/null -I -w "%{content_type}"$SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X GET -s -o /dev/null -I -w "%{http_code}"$SERVICE_URL/list/index.php`
["${PAYLOAD_GET_2//[$'\t\r\n ']}"=="${OUT_GET//[$'\t\r\n ']}"]&&echo"Step 2 GET Payload OK"||(echo"Step 2 GET Payload did not match. Test failed."&&exit 1)
["$OUT_STATUS_CODE"=="$STATUS_CODE_GET_2"]&&echo"Step 2 GET Status Code OK"||(echo"Step 2 GET status code did not match. Test failed."&&exit 1)
["$OUT_CONTENT"=="$JSON_MIME"]&&echo"Step 2 GET MIME OK"||(echo" Step 2 GET MIME did not match. Test failed."&&exit 1)
#Checking Step 2 POST on /list/index.php
OUT_POST="`curl -s -X POST $SERVICE_URL/list/index.php`"
OUT_CONTENT=`curl -X POST -s -o /dev/null -I -w "%{content_type}"$SERVICE_URL/list/index.php`
OUT_STATUS_CODE=`curl -X POST -s -o /dev/null -I -w "%{http_code}"$SERVICE_URL/list/index.php`
["${PAYLOAD_POST_2//[$'\t\r\n ']}"=="${OUT_POST//[$'\t\r\n ']}"]&&echo"Step 2 POST Payload OK"||(echo"Step 2 POST Payload did not match. Test failed."&&exit 1)
["$OUT_STATUS_CODE"=="$STATUS_CODE_POST_2"]&&echo"Step 2 POST Status Code OK"||(echo"Step 2 POST status code did not match. Test failed."&&exit 1)
["$OUT_CONTENT"=="$TEXT_MIME"]&&echo"Step 2 POST MIME OK"||(echo" Step 2 POST MIME did not match. Test failed."&&exit 1)
Krkn (Kraken) is a chaos engineering tool for Kubernetes/OpenShift clusters. It injects deliberate failures to validate cluster resilience. Plugin-based architecture with multi-cloud support (AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack).
As contributors, maintainers, and participants in the CNCF community, and in the interest of fostering
an open and welcoming community, we pledge to respect all people who participate or contribute
through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, attending conferences or events, or engaging in other community or project activities.
We are committed to making participation in the CNCF community a harassment-free experience for everyone, regardless of age, body size, caste, disability, ethnicity, level of experience, family status, gender, gender identity and expression, marital status, military or veteran status, nationality, personal appearance, race, religion, sexual orientation, socioeconomic status, tribe, or any other dimension of diversity.
## Scope
This code of conduct applies:
* within project and community spaces,
* in other spaces when an individual CNCF community participant's words or actions are directed at or are about a CNCF project, the CNCF community, or another CNCF community participant.
### CNCF Events
CNCF events that are produced by the Linux Foundation with professional events staff are governed by the Linux Foundation [Events Code of Conduct](https://events.linuxfoundation.org/code-of-conduct/) available on the event page. This is designed to be used in conjunction with the CNCF Code of Conduct.
## Our Standards
The CNCF Community is open, inclusive and respectful. Every member of our community has the right to have their identity respected.
Examples of behavior that contributes to a positive environment include but are not limited to:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
* Using welcoming and inclusive language
Examples of unacceptable behavior include but are not limited to:
* The use of sexualized language or imagery
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment in any form
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Violence, threatening violence, or encouraging others to engage in violent behavior
* Stalking or following someone without their consent
* Unwelcome physical contact
* Unwelcome sexual or romantic attention or advances
* Other conduct which could reasonably be considered inappropriate in a
professional setting
The following behaviors are also prohibited:
* Providing knowingly false or misleading information in connection with a Code of Conduct investigation or otherwise intentionally tampering with an investigation.
* Retaliating against a person because they reported an incident or provided information about an incident as a witness.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct.
By adopting this Code of Conduct, project maintainers commit themselves to fairly and consistently applying these principles to every aspect
of managing a CNCF project.
Project maintainers who do not follow or enforce the Code of Conduct may be temporarily or permanently removed from the project team.
## Reporting
For incidents occurring in the Kubernetes community, contact the [Kubernetes Code of Conduct Committee](https://git.k8s.io/community/committee-code-of-conduct) via <conduct@kubernetes.io>. You can expect a response within three business days.
For other projects, or for incidents that are project-agnostic or impact multiple CNCF projects, please contact the [CNCF Code of Conduct Committee](https://www.cncf.io/conduct/committee/) via <conduct@cncf.io>. Alternatively, you can contact any of the individual members of the [CNCF Code of Conduct Committee](https://www.cncf.io/conduct/committee/) to submit your report. For more detailed instructions on how to submit a report, including how to submit a report anonymously, please see our [Incident Resolution Procedures](https://github.com/cncf/foundation/blob/main/code-of-conduct/coc-incident-resolution-procedures.md). You can expect a response within three business days.
For incidents occurring at CNCF event that is produced by the Linux Foundation, please contact <eventconduct@cncf.io>.
## Enforcement
Upon review and investigation of a reported incident, the CoC response team that has jurisdiction will determine what action is appropriate based on this Code of Conduct and its related documentation.
For information about which Code of Conduct incidents are handled by project leadership, which incidents are handled by the CNCF Code of Conduct Committee, and which incidents are handled by the Linux Foundation (including its events team), see our [Jurisdiction Policy](https://github.com/cncf/foundation/blob/main/code-of-conduct/coc-committee-jurisdiction-policy.md).
## Amendments
Consistent with the CNCF Charter, any substantive changes to this Code of Conduct must be approved by the Technical Oversight Committee.
## Acknowledgements
This Code of Conduct is adapted from the Contributor Covenant
(http://contributor-covenant.org), version 2.0 available at
The governance model adopted here is heavily influenced by a set of CNCF projects, especially drew
reference from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md).
*For similar structures some of the same wordings from kubernetes governance are borrowed to adhere
to the originally construed meaning.*
## Principles
- **Open**: Krkn is open source community.
- **Welcoming and respectful**: See [Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
- **Transparent and accessible**: Work and collaboration should be done in public.
Changes to the Krkn organization, Krkn code repositories, and CNCF related activities (e.g.
level, involvement, etc) are done in public.
- **Merit**: Ideas and contributions are accepted according to their technical merit
and alignment with project objectives, scope and design principles.
## Code of Conduct
Krkn follows the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
Here is an excerpt:
> As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
## Maintainer Levels
### Contributor
Contributors contribute to the community. Anyone can become a contributor by participating in discussions, reporting bugs, or contributing code or documentation.
#### Responsibilities:
Be active in the community and adhere to the Code of Conduct.
Report bugs and suggest new features.
Contribute high-quality code and documentation.
### Member
Members are active contributors to the community. Members have demonstrated a strong understanding of the project's codebase and conventions.
#### Responsibilities:
Review pull requests for correctness, quality, and adherence to project standards.
Provide constructive and timely feedback to contributors.
Ensure that all contributions are well-tested and documented.
Work with maintainers to ensure a smooth and efficient release process.
### Maintainer
Maintainers are responsible for the overall health and direction of the project. They are long-standing contributors who have shown a deep commitment to the project's success.
#### Responsibilities:
Set the technical direction and vision for the project.
Manage releases and ensure the stability of the main branch.
Make decisions on feature inclusion and project priorities.
Mentor other contributors and help grow the community.
Resolve disputes and make final decisions when consensus cannot be reached.
### Owner
Owners have administrative access to the project and are the final decision-makers.
#### Responsibilities:
Manage the core team of maintainers and approvers.
Set the overall vision and strategy for the project.
Handle administrative tasks, such as managing the project's repository and other resources.
Represent the project in the broader open-source community.
# Credits
Sections of this document have been borrowed from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md)
This document contains a list of maintainers in this repo.
This file lists the maintainers and committers of the Krkn project.
In short, maintainers are people who are in charge of the maintenance of the Krkn project. Committers are active community members who have shown that they are committed to the continuous development of the project through ongoing engagement with the community.
For detailed description of the roles, see [Governance](./GOVERNANCE.md) page.
| Sahil Shah | [shahsahil264](https://github.com/shahsahil264) | sahshah@redhat.com | Member |
Note : It is mandatory for all Krkn community members to follow our [Code of Conduct](./CODE_OF_CONDUCT.md)
## Contributor Ladder
This project follows a contributor ladder model, where contributors can take on more responsibilities as they gain experience and demonstrate their commitment to the project.
The roles are:
* Contributor: A contributor to the community whether it be with code, docs or issues
* Member: A contributor who is active in the community and reviews pull requests.
* Maintainer: A contributor who is responsible for the overall health and direction of the project.
* Owner: A contributor who has administrative ownership of the project.
[](https://quay.io/repository/chaos-kubox/krkn?tab=tags&tag=latest)
[](https://www.bestpractices.dev/projects/10548)

Chaos and resiliency testing tool for Kubernetes and OpenShift.
Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to turbulent conditions.
Chaos and resiliency testing tool for Kubernetes.
Kraken injects deliberate failures into Kubernetes clusters to check if it is resilient to turbulent conditions.
### Workflow

### Demo
[](https://youtu.be/LN-fZywp_mo "Kraken Demo - Click to Watch!")

### Chaos Testing Guide
[Guide](docs/index.md) encapsulates:
- Test methodology that needs to be embraced.
- Best practices that an OpenShift cluster, platform and applications running on top of it should take into account for best user experience, performance, resilience and reliability.
- Tooling.
- Scenarios supported.
- Test environment recommendations as to how and where to run chaos tests.
- Chaos testing in practice.
The guide is hosted at https://redhat-chaos.github.io/krkn.
<!-- ### Demo
[](https://youtu.be/LN-fZywp_mo "Kraken Demo - Click to Watch!") -->
### How to Get Started
Instructions on how to setup, configure and run Kraken can be found at [Installation](docs/installation.md).
See the [getting started doc](docs/getting_started.md) on support on how to get started with your own custom scenario or editing current scenarios for your specific usage.
After installation, refer back to the below sections for supported scenarios and how to tweak the kraken config to load them on your cluster.
Instructions on how to setup, configure and run Kraken can be found in the [documentation](https://krkn-chaos.dev/docs/).
#### Running Kraken with minimal configuration tweaks
For cases where you want to run Kraken with minimal configuration changes, refer to [Kraken-hub](https://github.com/redhat-chaos/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
### Setting up infrastructure dependencies
Kraken indexes the metrics specified in the profile into Elasticsearch in addition to leveraging Cerberus for understanding the health of the Kubernetes/OpenShift cluster under test. More information on the features is documented below. The infrastructure pieces can be easily installed and uninstalled by running:
```
$ cd kraken
$ podman-compose up or $ docker-compose up # Spins up the containers specified in the docker-compose.yml file present in the run directory.
$ podman-compose down or $ docker-compose down # Delete the containers installed.
```
This will manage the Cerberus and Elasticsearch containers on the host on which you are running Kraken.
**NOTE**: Make sure you have enough resources (memory and disk) on the machine on top of which the containers are running as Elasticsearch is resource intensive. Cerberus monitors the system components by default, the [config](config/cerberus.yaml) can be tweaked to add applications namespaces, routes and other components to monitor as well. The command will keep running until killed since detached mode is not supported as of now.
### Config
Instructions on how to setup the config and the options supported can be found at [Config](docs/config.md).
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
- Leveraging [kube-burner](docs/alerts.md) alerting feature to fail the runs in case of critical alerts.
### Signaling
In CI runs or any external job it is useful to stop Kraken once a certain test or state gets reached. We created a way to signal to kraken to pause the chaos or stop it completely using a signal posted to a port of your choice.
For example if we have a test run loading the cluster running and kraken separately running; we want to be able to know when to start/stop the kraken run based on when the test run completes or gets to a certain loaded state.
More detailed information on enabling and leveraging this feature can be found [here](docs/signal.md).
### Performance monitoring
Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chaos scenarios on various components is key to find out the bottlenecks as it is important to make sure the cluster is healthy in terms if both recovery as well as performance during/after the failure has been injected. Instructions on enabling it can be found [here](docs/performance_dashboards.md).
### Scraping and storing metrics long term
Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes then into Elasticsearch to be able to store and evaluate the state of the runs long term. The indexed metrics can be visualized with the help of Grafana. It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Information on enabling and leveraging this feature can be found [here](docs/metrics.md).
### Alerts
In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. Information on enabling and leveraging this feature can be found [here](docs/alerts.md).
### Blogs and other useful resources
- Blog post on introduction to Kraken: https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes
- Discussion and demo on how Kraken can be leveraged to ensure OpenShift is reliable, performant and scalable: https://www.youtube.com/watch?v=s1PvupI5sD0&ab_channel=OpenShift
- Blog post emphasizing the importance of making Chaos part of Performance and Scale runs to mimic the production environments: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests
### Blogs, podcasts and interviews
Additional resources, including blog posts, podcasts, and community interviews, can be found on the [website](https://krkn-chaos.dev/blog)
### Roadmap
Following is a list of enhancements that we are planning to work on adding support in Kraken. Of course any help/contributions are greatly appreciated.
- [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/redhat-chaos/krkn/issues/124)
- Continue to improve [Chaos Testing Guide](https://cloud-bulldozer.github.io/kraken/) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- Support for running Kraken on Kubernetes distribution - see https://github.com/redhat-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- Sweet logo for Kraken - see https://github.com/redhat-chaos/krkn/issues/195
Enhancements being planned can be found in the [roadmap](ROADMAP.md).
### Contributions
We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.
[More information on how to Contribute](docs/contribute.md)
If adding a new scenario or tweaking the main config, be sure to add in updates into the CI to be sure the CI is up to date.
Please read [this file]((CI/README.md#adding-a-test-case)) for more information on updates.
[More information on how to Contribute](https://krkn-chaos.dev/docs/contribution-guidelines/)
* [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com)
* [**#forum-chaos on CoreOS Slack internal to Red Hat**](https://coreos.slack.com)
Key Members(slack_usernames/full name): paigerube14/Paige Rubendall, mffiedler/Mike Fiedler, tsebasti/Tullio Sebastiani, yogi/Yogananth Subramanian, sahil/Sahil Shah, pradeep/Pradeep Surisetty and ravielluri/Naga Ravi Chaitanya Elluri.
* [**#krkn on Kubernetes Slack**](https://kubernetes.slack.com/messages/C05SFMHRWK1)
The Linux Foundation® (TLF) has registered trademarks and uses trademarks. For a list of TLF trademarks, see [Trademark Usage](https://www.linuxfoundation.org/legal/trademark-usage).
This document outlines the project's release protocol, a methodology designed to ensure a responsive and transparent development process that is closely aligned with the needs of our users and contributors. This protocol is tailored for projects in their early stages, prioritizing agility and community feedback over a rigid, time-boxed schedule.
#### 1. Key Principles
* **Community as the Compass:** The primary driver for all development is feedback from our user and contributor community.
* **Prioritization by Impact:** Tasks are prioritized based on their impact on user experience, the urgency of bug fixes, and the value of community-contributed features.
* **Event-Driven Releases:** Releases are not bound by a fixed calendar. New versions are published when a significant body of work is complete, a critical issue is resolved, or a new feature is ready for adoption.
* **Transparency and Communication:** All development decisions, progress, and plans are communicated openly through our issue tracker, pull requests, and community channels.
#### 2. The Release Lifecycle
The release cycle is a continuous flow of activities rather than a series of sequential phases.
**2.1. Discovery & Prioritization**
* New features and bug fixes are identified through user feedback on our issue tracker, community discussions, and direct contributions.
* The core maintainers, in collaboration with the community, continuously evaluate and tag issues to create an open and dynamic backlog.
**2.2. Development & Code Review**
* Work is initiated based on the highest-priority items in the backlog.
* All code contributions are made via pull requests (PRs).
* PRs are reviewed by maintainers and other contributors to ensure code quality, adherence to project standards, and overall stability.
**2.3. Release Readiness**
A new release is considered ready when one of the following conditions is met:
* A major new feature has been completed and thoroughly tested.
* A critical security vulnerability or bug has been addressed.
* A sufficient number of smaller improvements and fixes have been merged, providing meaningful value to users.
**2.4. Versioning**
We adhere to [**Semantic Versioning 2.0.0**](https://semver.org/).
***Major version (`X.y.z`)**: Reserved for releases that introduce breaking changes.
***Minor version (`x.Y.z`)**: Used for new features or significant non-breaking changes.
***Patch version (`x.y.Z`)**: Used for bug fixes and small, non-functional improvements.
#### 3. Roles and Responsibilities
* **Members:** The [core team](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md) responsible for the project's health. Their duties include:
* Reviewing pull requests.
* Contributing code and documentation via pull requests.
* Engaging in discussions and providing feedback.
* **Maintainers and Owners:** The [core team](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md) responsible for the project's health. Their duties include:
* Facilitating community discussions and prioritization.
* Reviewing and merging pull requests.
* Cutting and announcing official releases.
* **Contributors:** The community. Their duties include:
* Reporting bugs and suggesting new features.
* Contributing code and documentation via pull requests.
* Engaging in discussions and providing feedback.
#### 4. Adoption and Future Evolution
This protocol is designed for the current stage of the project. As the project matures and the contributor base grows, the maintainers will evaluate the need for a more structured methodology to ensure continued scalability and stability.
Following are a list of enhancements that we are planning to work on adding support in Krkn. Of course any help/contributions are greatly appreciated.
- [x] [Ability to run multiple chaos scenarios in parallel under load to mimic real world outages](https://github.com/krkn-chaos/krkn/issues/424)
- [x] [Centralized storage for chaos experiments artifacts](https://github.com/krkn-chaos/krkn/issues/423)
- [x] [Support for causing DNS outages](https://github.com/krkn-chaos/krkn/issues/394)
- [x] [Chaos recommender](https://github.com/krkn-chaos/krkn/tree/main/utils/chaos-recommender) to suggest scenarios having probability of impacting the service under test using profiling results
- [x] Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time [krkn-chaos-ai](https://github.com/krkn-chaos/krkn-chaos-ai)
- [x] [Support for pod level network traffic shaping](https://github.com/krkn-chaos/krkn/issues/393)
- [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/krkn-chaos/krkn/issues/124)
- [x] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [x] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [x] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
- [x] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)
- [x] [AI Chat bot to help get started with Krkn and commands](https://github.com/krkn-chaos/krkn-lightspeed)
- [ ] [Ability to roll back cluster to original state if chaos fails](https://github.com/krkn-chaos/krkn/issues/804)
- [ ] Add recovery time metrics to each scenario for better regression analysis
- [ ] [Add resiliency scoring to chaos scenarios ran on cluster](https://github.com/krkn-chaos/krkn/issues/125)
We attach great importance to code security. We are very grateful to the users, security vulnerability researchers, etc. for reporting security vulnerabilities to the Krkn community. All reported security vulnerabilities will be carefully assessed and addressed in a timely manner.
## Security Checks
Krkn leverages [Snyk](https://snyk.io/) to ensure that any security vulnerabilities found
in the code base and dependencies are fixed and published in the latest release. Security
vulnerability checks are enabled for each pull request to enable developers to get insights
and proactively fix them.
## Reporting a Vulnerability
The Krkn project treats security vulnerabilities seriously, so we
strive to take action quickly when required.
The project requests that security issues be disclosed in a responsible
manner to allow adequate time to respond. If a security issue or
vulnerability has been found, please disclose the details to our
dedicated email address:
cncf-krkn-maintainers@lists.cncf.io
You can also use the [GitHub vulnerability report mechanism](https://docs.github.com/en/code-security/security-advisories/guidance-on-reporting-and-writing-information-about-vulnerabilities/privately-reporting-a-security-vulnerability#privately-reporting-a-security-vulnerability) to report the security vulnerability.
Please include as much information as possible with the report. The
The security team currently consists of the [Maintainers of Krkn](https://github.com/krkn-chaos/krkn/blob/main/MAINTAINERS.md)
## Process and Supported Releases
The Krkn security team will investigate and provide a fix in a timely manner depending on the severity. The fix will be included in the new release of Krkn and details will be included in the release notes.
- expr:sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)
description:etcd cluster has insufficient number of members.
severity:warning
- expr:max without (endpoint) ( sum without (instance) (up{job=~".*etcd.*"} == bool 0) or count without (To) ( sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01 )) > 0
description:5minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 1 second. {{$value}}s
description:5minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 5 seconds. {{$value}}s
description:5minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 30 seconds. {{$value}}s
- expr:sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)
description:etcd cluster has insufficient number of members.
severity:warning
- expr:max without (endpoint) ( sum without (instance) (up{job=~".*etcd.*"} == bool 0) or count without (To) ( sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01 )) > 0
description:5minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 1 second. {{$value}}s
description:5minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 5 seconds. {{$value}}s
description:5minutes avg. 99th read-only API call latency for {{$labels.verb}}/{{$labels.resource}} in scope {{$labels.scope}} higher than 30 seconds. {{$value}}s
cerberus_enabled:False# Enable it when cerberus is previously installed
cerberus_url:# When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards:False# Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_url:''# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set
enable_alerts:False# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile:config/alerts # Path to alert profile with the prometheus queries
enable_metrics:False
alert_profile:config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile:config/metrics-report.yaml
check_critical_alerts:False# When enabled will check prometheus for critical alerts firing post chaos
elastic:
enable_elastic:False
verify_certs:False
elastic_url:""# To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port:32766
username:"elastic"
password:"test"
metrics_index:"krkn-metrics"
alerts_index:"krkn-alerts"
telemetry_index:"krkn-telemetry"
tunings:
wait_duration:60# Duration to wait between each chaos scenario
wait_duration:1# Duration to wait between each chaos scenario
iterations:1# Number of times to execute the scenarios
daemon_mode:False# Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled:False# enable/disables the telemetry collection feature
api_url:https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production#telemetry service endpoint
username:username # telemetry service username
password:password # telemetry service password
prometheus_backup:True# enables/disables prometheus data collection
prometheus_namespace:""# namespace where prometheus is deployed (if distribution is kubernetes)
prometheus_container_name:""# name of the prometheus container name (if distribution is kubernetes)
prometheus_pod_name:""# name of the prometheus pod (if distribution is kubernetes)
full_prometheus_backup:False# if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads:5# number of telemetry download/upload threads
archive_path:/tmp # local path where the archive files will be temporarily stored
max_retries:0# maximum number of upload retries (if 0 will retry forever)
run_tag:''# if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size:500000
telemetry_group:''# if set will archive the telemetry in the S3 bucket on a folder named after the value, otherwise will use "default"
# the size of the prometheus data archive size in KB. The lower the size of archive is
# the higher the number of archive files will be produced and uploaded (and processed by backup_threads
# simultaneously).
# For unstable/slow connection is better to keep this value low
# increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
# failed chunk without affecting the whole upload.
health_checks:# Utilizing health check endpoints to observe application behavior during chaos injection.
interval:# Interval in seconds to perform health checks, default value is 2 seconds
config:# Provide list of health check configurations for applications
- url:# Provide application endpoint
bearer_token:# Bearer token for authentication if any
auth:# Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure:# If value is True exits when health check failed for application, values can be True/False
kubevirt_checks:# Utilizing virt check endpoints to observe ssh ability to VMI's during chaos injection.
interval:2# Interval in seconds to perform virt checks, default value is 2 seconds
namespace:# Namespace where to find VMI's
name:# Regex Name style of VMI's to watch, optional, will watch all VMI names in the namespace if left blank
only_failures:False# Boolean of whether to show all VMI's failures and successful ssh connection (False), or only failure status' (True)
disconnected:False# Boolean of how to try to connect to the VMIs; if True will use the ip_address to try ssh from within a node, if false will use the name and uses virtctl to try to connect; Default is False
ssh_node:""# If set, will be a backup way to ssh to a node. Will want to set to a node that isn't targeted in chaos
node_names:""
exit_on_failure:# If value is True and VMI's are failing post chaos returns failure, values can be True/False
distribution:kubernetes # Distribution can be kubernetes or openshift
kubeconfig_path:~/.kube/config # Path to kubeconfig
exit_on_failure:False# Exit when a post action scenario fails
port:8081
publish_kraken_status:True# Can be accessed at http://0.0.0.0:8081
signal_state:RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address:0.0.0.0# Signal listening address
chaos_scenarios:# List of policies/chaos scenarios to load
- pod_disruption_scenarios:
- scenarios/kube/pod.yml
cerberus:
cerberus_enabled:False# Enable it when cerberus is previously installed
cerberus_url:# When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_application_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set
enable_alerts:False# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile:config/alerts.yaml # Path to alert profile with the prometheus queries
elastic:
enable_elastic:False
tunings:
wait_duration:60# Duration to wait between each chaos scenario
iterations:1# Number of times to execute the scenarios
daemon_mode:False# Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled:False# enable/disables the telemetry collection feature
archive_path:/tmp # local path where the archive files will be temporarily stored
distribution:kubernetes # Distribution can be kubernetes or openshift
kubeconfig_path:/root/.kube/config # Path to kubeconfig
distribution:kubernetes # Distribution can be kubernetes or openshift
kubeconfig_path:~/.kube/config # Path to kubeconfig
exit_on_failure:False# Exit when a post action scenario fails
port:8081
publish_kraken_status:True# Can be accessed at http://0.0.0.0:8081
signal_state:RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
litmus_install:True# Installs specified version, set to False if it's already setup
litmus_version:v1.13.6 # Litmus version to install
litmus_uninstall:False# If you want to uninstall litmus if failure
litmus_uninstall_before_run:True# If you want to uninstall litmus before a new run starts
chaos_scenarios:# List of policies/chaos scenarios to load
- container_scenarios:# List of chaos pod scenarios to load
- - scenarios/kube/container_dns.yml
- scenarios/kube/container_dns.yml
- plugin_scenarios:
- scenarios/kube/scheduler.yml
cerberus:
cerberus_enabled:False# Enable it when cerberus is previously installed
cerberus_url:# When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards:False# Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set
enable_alerts:False# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile:config/alerts# Path to alert profile with the prometheus queries
alert_profile:config/alerts.yaml# Path to alert profile with the prometheus queries
check_critical_alerts:False# When enabled will check prometheus for critical alerts firing post chaos after soak time for the cluster to settle down
tunings:
wait_duration:60# Duration to wait between each chaos scenario
iterations:1# Number of times to execute the scenarios
cerberus_enabled:True# Enable it when cerberus is previously installed
cerberus_url:http://0.0.0.0:8080 # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
check_application_routes:False# When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards:True# Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
prometheus_url:# The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token:# The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid:# uuid for the run is generated by default if not set
enable_alerts:True# Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile:config/alerts # Path to alert profile with the prometheus queries
alert_profile:config/alerts.yaml# Path to alert profile with the prometheus queries
tunings:
wait_duration:60# Duration to wait between each chaos scenario
iterations:1# Number of times to execute the scenarios
daemon_mode:False# Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled:False# enable/disables the telemetry collection feature
api_url:https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production#telemetry service endpoint
username:username # telemetry service username
password:password # telemetry service password
prometheus_backup:True# enables/disables prometheus data collection
full_prometheus_backup:False# if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads:5# number of telemetry download/upload threads
archive_path:/tmp # local path where the archive files will be temporarily stored
max_retries:0# maximum number of upload retries (if 0 will retry forever)
run_tag:''# if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size:500000# the size of the prometheus data archive size in KB. The lower the size of archive is
# the higher the number of archive files will be produced and uploaded (and processed by backup_threads
# simultaneously).
# For unstable/slow connection is better to keep this value low
# increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
# failed chunk without affecting the whole upload.
esServers:[http://0.0.0.0:9200] # Please change this to the respective Elasticsearch in use if you haven't run the podman-compose command to setup the infrastructure containers
- query:(sum(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(etcd|oauth-apiserver|.*apiserver|ovn-kubernetes|sdn|ingress|authentication|.*controller-manager|.*scheduler)"}) by (container, pod, namespace, node) and on (node) kube_node_role{role="master"}) > 0
metricName:containerMemory-Masters
instant:true
- query:(sum(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(etcd|oauth-apiserver|sdn|ovn-kubernetes|.*apiserver|authentication|.*controller-manager|.*scheduler)"}[2m]) * 100) by (container, pod, namespace, node) and on (node) kube_node_role{role="master"}) > 0
metricName:containerCPU-Masters
instant:true
- query:(sum(irate(container_cpu_usage_seconds_total{pod!="",container="prometheus",namespace="openshift-monitoring"}[2m]) * 100) by (container, pod, namespace, node) and on (node) kube_node_role{role="infra"}) > 0
metricName:containerCPU-Prometheus
instant:true
- query:(avg(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress)"}[2m]) * 100 and on (node) kube_node_role{role="worker"}) by (namespace, container)) > 0
metricName:containerCPU-AggregatedWorkers
instant:true
- query:(avg(irate(container_cpu_usage_seconds_total{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress|monitoring|image-registry|logging)"}[2m]) * 100 and on (node) kube_node_role{role="infra"}) by (namespace, container)) > 0
metricName:containerCPU-AggregatedInfra
- query:(sum(container_memory_rss{pod!="",namespace="openshift-monitoring",name!="",container="prometheus"}) by (container, pod, namespace, node) and on (node) kube_node_role{role="infra"}) > 0
metricName:containerMemory-Prometheus
instant:True
- query:avg(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress)"} and on (node) kube_node_role{role="worker"}) by (container, namespace)
metricName:containerMemory-AggregatedWorkers
instant:True
- query:avg(container_memory_rss{name!="",container!="POD",namespace=~"openshift-(sdn|ovn-kubernetes|ingress|monitoring|image-registry|logging)"} and on (node) kube_node_role{role="infra"}) by (container, namespace)
metricName:containerMemory-AggregatedInfra
instant:True
# Node metrics
- query:(sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) > 0
metricName:nodeCPU-Masters
instant:True
- query:max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:maxCPU-Masters
instant:true
- query:avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemory-Masters
instant:true
- query:(avg((sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))) by (mode)) > 0
metricName:nodeCPU-AggregatedWorkers
instant:True
- query:(avg((sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))) by (mode)) > 0
metricName:nodeCPU-AggregatedInfra
instant:True
- query:avg(node_memory_MemAvailable_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName:nodeMemoryAvailable-Masters
- query:avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemory-Masters
instant:true
- query:max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:maxMemory-Masters
instant:true
- query:avg(node_memory_MemAvailable_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemoryAvailable-AggregatedWorkers
instant:True
- query:max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:maxCPU-Workers
instant:true
- query:max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName:maxMemory-Workers
instant:true
- query:avg(node_memory_MemAvailable_bytes and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemoryAvailable-AggregatedInfra
instant:True
- query:avg(node_memory_Active_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName:nodeMemoryActive-Masters
instant:True
- query:avg(node_memory_Active_bytes and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemoryActive-AggregatedWorkers
instant:True
- query:avg(avg(node_memory_Active_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemoryActive-AggregatedInfra
- query:avg(node_memory_Cached_bytes) by (instance) + avg(node_memory_Buffers_bytes) by (instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
- query:irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName:rxNetworkBytes-Masters
- query:avg(irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:rxNetworkBytes-AggregatedWorkers
- query:avg(irate(node_network_receive_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:rxNetworkBytes-AggregatedInfra
- query:irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName:txNetworkBytes-Masters
- query:avg(irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:txNetworkBytes-AggregatedWorkers
- query:avg(irate(node_network_transmit_bytes_total{device=~"^(ens|eth|bond|team).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:txNetworkBytes-AggregatedInfra
- query:rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName:nodeDiskWrittenBytes-Masters
- query:avg(rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:nodeDiskWrittenBytes-AggregatedWorkers
- query:avg(rate(node_disk_written_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:nodeDiskWrittenBytes-AggregatedInfra
- query:rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")
metricName:nodeDiskReadBytes-Masters
- query:avg(rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (device)
metricName:nodeDiskReadBytes-AggregatedWorkers
- query:avg(rate(node_disk_read_bytes_total{device!~"^(dm|rb).*"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (device)
- query:sum by (cluster_version)(etcd_cluster_version)
metricName:etcdVersion
@@ -135,50 +128,16 @@ metrics:
- query:sum(rate(etcd_object_counts{}[5m])) by (resource) > 0
metricName:etcdObjectCount
instant:True
- query:histogram_quantile(0.99,sum(rate(etcd_request_duration_seconds_bucket[2m])) by (le,operation,apiserver)) > 0
metricName:P99APIEtcdRequestLatency
instant:True
# Cluster metrics
- query:count(kube_namespace_created)
metricName:namespaceCount
- query:sum by (instance) (apiserver_storage_objects)
metricName:etcdTotalObjectCount
instant:True
- query:sum(kube_pod_status_phase{}) by (phase)
metricName:podStatusCount
- query:count(kube_secret_info{})
metricName:secretCount
- query:count(kube_deployment_labels{})
metricName:deploymentCount
- query:count(kube_configmap_info{})
metricName:configmapCount
- query:count(kube_service_info{})
metricName:serviceCount
- query:kube_node_role
metricName:nodeRoles
instant:true
- query:sum(kube_node_status_condition{status="true"}) by (condition)
metricName:nodeStatus
- query:(sum(rate(container_fs_writes_bytes_total{container!="",device!~".+dm.+"}[5m])) by (device, container, node) and on (node) kube_node_role{role="master"}) > 0
- query:sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
metricName:APIInflightRequests
instant:true
# Kubelet & CRI-O
# Average and max of the CPU usage from all worker's kubelet
- query:avg(avg_over_time(irate(process_cpu_seconds_total{service="kubelet",job="kubelet"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:cpu-kubelet
instant:true
- query:max(max_over_time(irate(process_cpu_seconds_total{service="kubelet",job="kubelet"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:max-cpu-kubelet
instant:true
# Average of the memory usage from all worker's kubelet
- query:avg(avg_over_time(process_resident_memory_bytes{service="kubelet",job="kubelet"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:memory-kubelet
instant:true
# Max of the memory usage from all worker's kubelet
- query:max(max_over_time(process_resident_memory_bytes{service="kubelet",job="kubelet"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:max-memory-kubelet
instant:true
- query:max_over_time(sum(process_resident_memory_bytes{service="kubelet",job="kubelet"} and on (node) kube_node_role{role="worker"})[.elapsed:])
metricName:max-memory-sum-kubelet
instant:true
# Average and max of the CPU usage from all worker's CRI-O
- query:avg(avg_over_time(irate(process_cpu_seconds_total{service="kubelet",job="crio"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:cpu-crio
instant:true
- query:max(max_over_time(irate(process_cpu_seconds_total{service="kubelet",job="crio"}[2m])[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:max-cpu-crio
instant:true
# Average of the memory usage from all worker's CRI-O
- query:avg(avg_over_time(process_resident_memory_bytes{service="kubelet",job="crio"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
metricName:memory-crio
instant:true
# Max of the memory usage from all worker's CRI-O
- query:max(max_over_time(process_resident_memory_bytes{service="kubelet",job="crio"}[.elapsed:]) and on (node) kube_node_role{role="worker"})
- query:avg(avg_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-controller-manager"}[2m])) by (pod))[.elapsed:]))
metricName:cpu-kube-controller-manager
instant:true
- query:max(max_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-controller-manager"}[2m])) by (pod))[.elapsed:]))
metricName:max-cpu-kube-controller-manager
instant:true
- query:avg(avg_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-kube-controller-manager"}) by (pod))[.elapsed:]))
metricName:memory-kube-controller-manager
instant:true
- query:max(max_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-kube-controller-manager"}) by (pod))[.elapsed:]))
metricName:max-memory-kube-controller-manager
instant:true
- query:avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-kube-apiserver"}[2m])) by (pod))[.elapsed:]))
metricName:cpu-kube-apiserver
instant:true
- query:avg(avg_over_time(topk(3, sum(container_memory_rss{name!="", namespace="openshift-kube-apiserver"}) by (pod))[.elapsed:]))
metricName:memory-kube-apiserver
instant:true
- query:avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-apiserver"}[2m])) by (pod))[.elapsed:]))
metricName:cpu-openshift-apiserver
instant:true
- query:avg(avg_over_time(topk(3, sum(container_memory_rss{name!="", namespace="openshift-apiserver"}) by (pod))[.elapsed:]))
metricName:memory-openshift-apiserver
instant:true
- query:avg(avg_over_time(topk(3, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-etcd"}[2m])) by (pod))[.elapsed:]))
metricName:cpu-etcd
instant:true
- query:avg(avg_over_time(topk(3,sum(container_memory_rss{name!="", namespace="openshift-etcd"}) by (pod))[.elapsed:]))
metricName:memory-etcd
instant:true
- query:avg(avg_over_time(topk(1, sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-controller-manager"}[2m])) by (pod))[.elapsed:]))
metricName:cpu-openshift-controller-manager
instant:true
- query:avg(avg_over_time(topk(1, sum(container_memory_rss{name!="", namespace="openshift-controller-manager"}) by (pod))[.elapsed:]))
metricName:memory-openshift-controller-manager
instant:true
# multus
- query:avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-multus", pod=~"(multus).+", container!="POD"}[2m])[.elapsed:])) by (container)
metricName:cpu-multus
instant:true
- query:avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-multus", pod=~"(multus).+", container!="POD"}[.elapsed:])) by (container)
metricName:memory-multus
instant:true
# OVNKubernetes - standard & IC
- query:avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ovn-kubernetes", pod=~"(ovnkube-master|ovnkube-control-plane).+", container!="POD"}[2m])[.elapsed:])) by (container)
metricName:cpu-ovn-control-plane
instant:true
- query:avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-ovn-kubernetes", pod=~"(ovnkube-master|ovnkube-control-plane).+", container!="POD"}[.elapsed:])) by (container)
metricName:memory-ovn-control-plane
instant:true
- query:avg(avg_over_time(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ovn-kubernetes", pod=~"ovnkube-node.+", container!="POD"}[2m])[.elapsed:])) by (container)
metricName:cpu-ovnkube-node
instant:true
- query:avg(avg_over_time(container_memory_rss{name!="", namespace="openshift-ovn-kubernetes", pod=~"ovnkube-node.+", container!="POD"}[.elapsed:])) by (container)
metricName:memory-ovnkube-node
instant:true
# Nodes
- query:avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:cpu-masters
instant:true
- query:avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:memory-masters
instant:true
- query:max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:max-memory-masters
instant:true
- query:avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:cpu-workers
instant:true
- query:max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:max-cpu-workers
instant:true
- query:avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName:memory-workers
instant:true
- query:max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName:max-memory-workers
instant:true
- query:sum( (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)") )
metricName:memory-sum-workers
instant:true
- query:avg(avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:cpu-infra
instant:true
- query:max(max_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (instance)[.elapsed:]))
metricName:max-cpu-infra
instant:true
- query:avg(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName:memory-infra
instant:true
- query:max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))
metricName:max-memory-infra
instant:true
- query:max_over_time(sum((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)"))[.elapsed:])
metricName:max-memory-sum-infra
instant:true
# Monitoring and ingress
- query:avg(avg_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}[2m])) by (pod)[.elapsed:]))
metricName:cpu-prometheus
instant:true
- query:max(max_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}[2m])) by (pod)[.elapsed:]))
metricName:max-cpu-prometheus
instant:true
- query:avg(avg_over_time(sum(container_memory_rss{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}) by (pod)[.elapsed:]))
metricName:memory-prometheus
instant:true
- query:max(max_over_time(sum(container_memory_rss{name!="", namespace="openshift-monitoring", pod=~"prometheus-k8s.+"}) by (pod)[.elapsed:]))
metricName:max-memory-prometheus
instant:true
- query:avg(avg_over_time(sum(irate(container_cpu_usage_seconds_total{name!="", namespace="openshift-ingress", pod=~"router-default.+"}[2m])) by (pod)[.elapsed:]))
metricName:cpu-router
instant:true
- query:avg(avg_over_time(sum(container_memory_rss{name!="", namespace="openshift-ingress", pod=~"router-default.+"}) by (pod)[.elapsed:]))
# Retain the raw CPU seconds totals for comparison
- query:sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="worker",role!="infra"}, "instance", "$1", "node", "(.+)")) by (mode)
metricName:nodeCPUSeconds-Workers
instant:true
- query:sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) by (mode)
metricName:nodeCPUSeconds-Masters
instant:true
- query:sum(node_cpu_seconds_total and on (instance) label_replace(kube_node_role{role="infra"}, "instance", "$1", "node", "(.+)")) by (mode)
- query:sum(irate(container_cpu_usage_seconds_total{name!="",namespace=~"openshift-(etcd|oauth-apiserver|.*apiserver|ovn-kubernetes|sdn|ingress|authentication|.*controller-manager|.*scheduler|monitoring|logging|image-registry)"}[2m]) * 100) by (pod, namespace, node)
@@ -33,8 +27,17 @@ metrics:
metricName:crioMemory
# Node metrics
- query:sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) > 0
metricName:nodeCPU
- query:(sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) > 0
metricName:nodeCPU-Masters
- query:(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemory-Masters
- query:(sum(irate(node_cpu_seconds_total[2m])) by (mode,instance) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)")) > 0
metricName:nodeCPU-Workers
- query:(avg_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[2m:]) and on (instance) label_replace(kube_node_role{role="worker"}, "instance", "$1", "node", "(.+)"))
metricName:nodeMemory-Workers
- query:avg(node_memory_MemAvailable_bytes) by (instance)
metricName:nodeMemoryAvailable
@@ -42,6 +45,9 @@ metrics:
- query:avg(node_memory_Active_bytes) by (instance)
metricName:nodeMemoryActive
- query:max(max_over_time((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[.elapsed:]) and on (instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)"))
metricName:maxMemory-Masters
- query:avg(node_memory_Cached_bytes) by (instance) + avg(node_memory_Buffers_bytes) by (instance)
metricName:nodeMemoryCached+nodeMemoryBuffers
@@ -84,34 +90,4 @@ metrics:
- query:sum by (cluster_version)(etcd_cluster_version)
metricName:etcdVersion
instant:true
# Cluster metrics
- query:count(kube_namespace_created)
metricName:namespaceCount
- query:sum(kube_pod_status_phase{}) by (phase)
metricName:podStatusCount
- query:count(kube_secret_info{})
metricName:secretCount
- query:count(kube_deployment_labels{})
metricName:deploymentCount
- query:count(kube_configmap_info{})
metricName:configmapCount
- query:count(kube_service_info{})
metricName:serviceCount
- query:kube_node_role
metricName:nodeRoles
instant:true
- query:sum(kube_node_status_condition{status="true"}) by (condition)
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/chaos-kubox/krkn).
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/redhat-chaos/krkn).
### Run containerized version
Refer [instructions](https://github.com/redhat-chaos/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
Refer [instructions](https://krkn-chaos.dev/docs/installation/) for information on how to run the containerized version of kraken.
### Run Custom Kraken Image
Refer to [instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
### Kraken as a KubeApp
To run containerized Kraken as a Kubernetes/OpenShift Deployment, follow these steps:
1. Configure the [config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) file according to your requirements.
2. Create a namespace under which you want to run the kraken pod using `kubectl create ns <namespace>`.
3. Switch to `<namespace>` namespace:
- In Kubernetes, use `kubectl config set-context --current --namespace=<namespace>`
- In OpenShift, use `oc project <namespace>`
4. Create a ConfigMap named kube-config using `kubectl create configmap kube-config --from-file=<path_to_kubeconfig>`
5. Create a ConfigMap named kraken-config using `kubectl create configmap kraken-config --from-file=<path_to_kraken_config>`
6. Create a ConfigMap named scenarios-config using `kubectl create configmap scenarios-config --from-file=<path_to_scenarios_folder>`
7. Create a service account to run the kraken pod `kubectl create serviceaccount useroot`.
8. In Openshift, add privileges to service account and execute `oc adm policy add-scc-to-user privileged -z useroot`.
9. Create a Job using `kubectl apply -f kraken.yml` and monitor the status using `oc get jobs` and `oc get pods`.
NOTE: It is not recommended to run Kraken internal to the cluster as the pod which is running Kraken might get disrupted.
1. Git clone the Kraken repository using `git clone https://github.com/openshift-scale/kraken.git`.
1. Git clone the Kraken repository using `git clone https://github.com/redhat-chaos/krkn.git`.
2. Modify the python code and yaml files to address your needs.
3. Execute `podman build -t <new_image_name>:latest .` in the containers directory within kraken to build an image from a Dockerfile.
4. Execute `podman run --detach --name <container_name> <new_image_name>:latest` to start a container based on your new image.
# Building the Kraken image on IBM Power (ppc64le)
1. Git clone the Kraken repository using `git clone https://github.com/cloud-bulldozer/kraken.git` on an IBM Power Systems server.
1. Git clone the Kraken repository using `git clone https://github.com/redhat-chaos/krkn.git` on an IBM Power Systems server.
2. Modify the python code and yaml files to address your needs.
3. Execute `podman build -t <new_image_name>:latest -f Dockerfile-ppc64le` in the containers directory within kraken to build an image from the Dockerfile for Power.
4. Execute `podman run --detach --name <container_name> <new_image_name>:latest` to start a container based on your new image.
"description":"Authentication tuple to authenticate into health check URL",
"variable":"HEALTH_CHECK_AUTH",
"type":"string",
"default":"",
"required":"false"
},
{
"name":"health-check-bearer-token",
"short_description":"Health check bearer token",
"description":"Bearer token to authenticate into health check URL",
"variable":"HEALTH_CHECK_BEARER_TOKEN",
"type":"string",
"default":"",
"required":"false"
},
{
"name":"health-check-exit",
"short_description":"Health check exit on failure",
"description":"Exit on failure when health check URL is not able to connect",
"variable":"HEALTH_CHECK_EXIT_ON_FAILURE",
"type":"enum",
"allowed_values":"True,False",
"separator":",",
"default":"False",
"required":"false"
},
{
"name":"health-check-verify",
"short_description":"SSL Verification of health check url",
"description":"SSL Verification to authenticate into health check URL",
"variable":"HEALTH_CHECK_VERIFY",
"type":"enum",
"allowed_values":"True,False",
"separator":",",
"default":"False",
"required":"false"
},
{
"name":"kubevirt-check-interval",
"short_description":"Kube Virt check interval",
"description":"How often to check the kube virt check Vms ssh status",
"variable":"KUBE_VIRT_CHECK_INTERVAL",
"type":"number",
"default":"2",
"required":"false"
},
{
"name":"kubevirt-namespace",
"short_description":"KubeVirt namespace to check",
"description":"KubeVirt namespace to check the health of",
"variable":"KUBE_VIRT_NAMESPACE",
"type":"string",
"default":"",
"required":"false"
},
{
"name":"kubevirt-name",
"short_description":"KubeVirt regex names to watch",
"description":"KubeVirt regex names to check VMs",
"variable":"KUBE_VIRT_NAME",
"type":"string",
"default":"",
"required":"false"
},
{
"name":"kubevirt-only-failures",
"short_description":"KubeVirt checks only report if failure occurs",
"description":"KubeVirt checks only report if failure occurs",
"variable":"KUBE_VIRT_FAILURES",
"type":"enum",
"allowed_values":"True,False,true,false",
"separator":",",
"default":"False",
"required":"false"
},
{
"name":"kubevirt-disconnected",
"short_description":"KubeVirt checks in disconnected mode",
"description":"KubeVirt checks in disconnected mode, bypassing the clusters Api",
"variable":"KUBE_VIRT_DISCONNECTED",
"type":"enum",
"allowed_values":"True,False,true,false",
"separator":",",
"default":"False",
"required":"false"
},
{
"name":"kubevirt-ssh-node",
"short_description":"KubeVirt node to ssh from",
"description":"KubeVirt node to ssh from, should be available whole chaos run",
"variable":"KUBE_VIRT_SSH_NODE",
"type":"string",
"default":"",
"required":"false"
},
{
"name":"kubevirt-exit-on-failure",
"short_description":"KubeVirt fail if failed vms at end of run",
"description":"KubeVirt fails run if vms still have false status",
"variable":"KUBE_VIRT_EXIT_ON_FAIL",
"type":"enum",
"allowed_values":"True,False,true,false",
"separator":",",
"default":"False",
"required":"false"
},
{
"name":"kubevirt-node-node",
"short_description":"KubeVirt node to filter vms on",
"description":"Only track VMs in KubeVirt on given node name",
"variable":"KUBE_VIRT_NODE_NAME",
"type":"string",
"default":"",
"required":"false"
},
{
"name":"krkn-debug",
"short_description":"Krkn debug mode",
"description":"Enables debug mode for Krkn",
"variable":"KRKN_DEBUG",
"type":"enum",
"allowed_values":"True,False",
"separator":",",
"default":"False",
"required":"false"
}
]
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.