mirror of
https://github.com/krkn-chaos/krkn.git
synced 2026-02-19 20:40:33 +00:00
Compare commits
1 Commits
feature/co
...
paigerube1
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
717cb72f79 |
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
@@ -23,7 +23,6 @@ If checked, a documentation PR must be created and merged in the [website reposi
|
||||
<-- Add the link to the corresponding documentation PR in the website repository -->
|
||||
|
||||
# Checklist before requesting a review
|
||||
[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See [contributing guidelines](https://krkn-chaos.dev/docs/contribution-guidelines/)
|
||||
See [testing your changes](https://krkn-chaos.dev/docs/developers-guide/testing-changes/) and run on any Kubernetes or OpenShift cluster to validate your changes
|
||||
- [ ] I have performed a self-review of my code by running krkn and specific scenario
|
||||
- [ ] If it is a core feature, I have added thorough unit tests with above 80% coverage
|
||||
@@ -44,4 +43,4 @@ OR
|
||||
python -m coverage run -a -m unittest discover -s tests -v
|
||||
...
|
||||
<---insert test results output--->
|
||||
```
|
||||
```
|
||||
52
.github/workflows/stale.yml
vendored
52
.github/workflows/stale.yml
vendored
@@ -1,52 +0,0 @@
|
||||
name: Manage Stale Issues and Pull Requests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
# Run daily at 1:00 AM UTC
|
||||
- cron: '0 1 * * *'
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
issues: write
|
||||
pull-requests: write
|
||||
|
||||
jobs:
|
||||
stale:
|
||||
name: Mark and Close Stale Issues and PRs
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Mark and close stale issues and PRs
|
||||
uses: actions/stale@v9
|
||||
with:
|
||||
days-before-issue-stale: 60
|
||||
days-before-issue-close: 14
|
||||
stale-issue-label: 'stale'
|
||||
stale-issue-message: |
|
||||
This issue has been automatically marked as stale because it has not had any activity in the last 60 days.
|
||||
It will be closed in 14 days if no further activity occurs.
|
||||
If this issue is still relevant, please leave a comment or remove the stale label.
|
||||
Thank you for your contributions to krkn!
|
||||
close-issue-message: |
|
||||
This issue has been automatically closed due to inactivity.
|
||||
If you believe this issue is still relevant, please feel free to reopen it or create a new issue with updated information.
|
||||
Thank you for your understanding!
|
||||
close-issue-reason: 'not_planned'
|
||||
|
||||
days-before-pr-stale: 90
|
||||
days-before-pr-close: 14
|
||||
stale-pr-label: 'stale'
|
||||
stale-pr-message: |
|
||||
This pull request has been automatically marked as stale because it has not had any activity in the last 90 days.
|
||||
It will be closed in 14 days if no further activity occurs.
|
||||
If this PR is still relevant, please rebase it, address any pending reviews, or leave a comment.
|
||||
Thank you for your contributions to krkn!
|
||||
close-pr-message: |
|
||||
This pull request has been automatically closed due to inactivity.
|
||||
If you believe this PR is still relevant, please feel free to reopen it or create a new pull request with updated changes.
|
||||
Thank you for your understanding!
|
||||
|
||||
# Exempt labels
|
||||
exempt-issue-labels: 'bug,enhancement,good first issue'
|
||||
exempt-pr-labels: 'pending discussions,hold'
|
||||
|
||||
remove-stale-when-updated: true
|
||||
7
.github/workflows/tests.yml
vendored
7
.github/workflows/tests.yml
vendored
@@ -32,14 +32,13 @@ jobs:
|
||||
- name: Install Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.11'
|
||||
python-version: '3.9'
|
||||
architecture: 'x64'
|
||||
- name: Install environment
|
||||
run: |
|
||||
sudo apt-get install build-essential python3-dev
|
||||
pip install --upgrade pip
|
||||
pip install -r requirements.txt
|
||||
pip install coverage
|
||||
|
||||
- name: Deploy test workloads
|
||||
run: |
|
||||
@@ -96,7 +95,7 @@ jobs:
|
||||
echo "test_node" >> ./CI/tests/functional_tests
|
||||
echo "test_pod" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_error" >> ./CI/tests/functional_tests
|
||||
echo "test_service_hijacking" >> ./CI/tests/functional_tests
|
||||
echo "test_service_hijacking" > ./CI/tests/functional_tests
|
||||
echo "test_pod_network_filter" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_server" >> ./CI/tests/functional_tests
|
||||
echo "test_time" >> ./CI/tests/functional_tests
|
||||
@@ -182,7 +181,7 @@ jobs:
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.11'
|
||||
python-version: 3.9
|
||||
- name: Copy badge on GitHub Page Repo
|
||||
env:
|
||||
COLOR: yellow
|
||||
|
||||
273
CLAUDE.md
273
CLAUDE.md
@@ -1,273 +0,0 @@
|
||||
# CLAUDE.md - Krkn Chaos Engineering Framework
|
||||
|
||||
## Project Overview
|
||||
|
||||
Krkn (Kraken) is a chaos engineering tool for Kubernetes/OpenShift clusters. It injects deliberate failures to validate cluster resilience. Plugin-based architecture with multi-cloud support (AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack).
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
krkn/
|
||||
├── krkn/
|
||||
│ ├── scenario_plugins/ # Chaos scenario plugins (pod, node, network, hogs, etc.)
|
||||
│ ├── utils/ # Utility functions
|
||||
│ ├── rollback/ # Rollback management
|
||||
│ ├── prometheus/ # Prometheus integration
|
||||
│ └── cerberus/ # Health monitoring
|
||||
├── tests/ # Unit tests (unittest framework)
|
||||
├── scenarios/ # Example scenario configs (openshift/, kube/, kind/)
|
||||
├── config/ # Configuration files
|
||||
└── CI/ # CI/CD test scripts
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Setup (ALWAYS use virtual environment)
|
||||
python3 -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Run Krkn
|
||||
python run_kraken.py --config config/config.yaml
|
||||
|
||||
# Note: Scenarios are specified in config.yaml under kraken.chaos_scenarios
|
||||
# There is no --scenario flag; edit config/config.yaml to select scenarios
|
||||
|
||||
# Run tests
|
||||
python -m unittest discover -s tests -v
|
||||
python -m coverage run -a -m unittest discover -s tests -v
|
||||
```
|
||||
|
||||
## Critical Requirements
|
||||
|
||||
### Python Environment
|
||||
- **Python 3.9+** required
|
||||
- **NEVER install packages globally** - always use virtual environment
|
||||
- **CRITICAL**: `docker` must be <7.0 and `requests` must be <2.32 (Unix socket compatibility)
|
||||
|
||||
### Key Dependencies
|
||||
- **krkn-lib** (5.1.13): Core library for Kubernetes/OpenShift operations
|
||||
- **kubernetes** (34.1.0): Kubernetes Python client
|
||||
- **docker** (<7.0), **requests** (<2.32): DO NOT upgrade without verifying compatibility
|
||||
- Cloud SDKs: boto3 (AWS), azure-mgmt-* (Azure), google-cloud-compute (GCP), ibm_vpc (IBM), pyVmomi (VMware)
|
||||
|
||||
## Plugin Architecture (CRITICAL)
|
||||
|
||||
**Strictly enforced naming conventions:**
|
||||
|
||||
### Naming Rules
|
||||
- **Module files**: Must end with `_scenario_plugin.py` and use snake_case
|
||||
- Example: `pod_disruption_scenario_plugin.py`
|
||||
- **Class names**: Must be CamelCase and end with `ScenarioPlugin`
|
||||
- Example: `PodDisruptionScenarioPlugin`
|
||||
- Must match module filename (snake_case ↔ CamelCase)
|
||||
- **Directory structure**: Plugin dirs CANNOT contain "scenario" or "plugin"
|
||||
- Location: `krkn/scenario_plugins/<plugin_name>/`
|
||||
|
||||
### Plugin Implementation
|
||||
Every plugin MUST:
|
||||
1. Extend `AbstractScenarioPlugin`
|
||||
2. Implement `run()` method
|
||||
3. Implement `get_scenario_types()` method
|
||||
|
||||
```python
|
||||
from krkn.scenario_plugins import AbstractScenarioPlugin
|
||||
|
||||
class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
def run(self, config, scenarios_list, kubeconfig_path, wait_duration):
|
||||
pass
|
||||
|
||||
def get_scenario_types(self):
|
||||
return ["pod_scenarios", "pod_outage"]
|
||||
```
|
||||
|
||||
### Creating a New Plugin
|
||||
1. Create directory: `krkn/scenario_plugins/<plugin_name>/`
|
||||
2. Create module: `<plugin_name>_scenario_plugin.py`
|
||||
3. Create class: `<PluginName>ScenarioPlugin` extending `AbstractScenarioPlugin`
|
||||
4. Implement `run()` and `get_scenario_types()`
|
||||
5. Create unit test: `tests/test_<plugin_name>_scenario_plugin.py`
|
||||
6. Add example scenario: `scenarios/<platform>/<scenario>.yaml`
|
||||
|
||||
**DO NOT**: Violate naming conventions (factory will reject), include "scenario"/"plugin" in directory names, create plugins without tests.
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
```bash
|
||||
# Run all tests
|
||||
python -m unittest discover -s tests -v
|
||||
|
||||
# Specific test
|
||||
python -m unittest tests.test_pod_disruption_scenario_plugin
|
||||
|
||||
# With coverage
|
||||
python -m coverage run -a -m unittest discover -s tests -v
|
||||
python -m coverage html
|
||||
```
|
||||
|
||||
**Test requirements:**
|
||||
- Naming: `test_<module>_scenario_plugin.py`
|
||||
- Mock external dependencies (Kubernetes API, cloud providers)
|
||||
- Test success, failure, and edge cases
|
||||
- Keep tests isolated and independent
|
||||
|
||||
### Functional Tests
|
||||
Located in `CI/tests/`. Can be run locally on a kind cluster with Prometheus and Elasticsearch set up.
|
||||
|
||||
**Setup for local testing:**
|
||||
1. Deploy Prometheus and Elasticsearch on your kind cluster:
|
||||
- Prometheus setup: https://krkn-chaos.dev/docs/developers-guide/testing-changes/#prometheus
|
||||
- Elasticsearch setup: https://krkn-chaos.dev/docs/developers-guide/testing-changes/#elasticsearch
|
||||
|
||||
2. Or disable monitoring features in `config/config.yaml`:
|
||||
```yaml
|
||||
performance_monitoring:
|
||||
enable_alerts: False
|
||||
enable_metrics: False
|
||||
check_critical_alerts: False
|
||||
```
|
||||
|
||||
**Note:** Functional tests run automatically in CI with full monitoring enabled.
|
||||
|
||||
## Cloud Provider Implementations
|
||||
|
||||
Node chaos scenarios are cloud-specific. Each in `krkn/scenario_plugins/node_actions/<provider>_node_scenarios.py`:
|
||||
- AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack, Bare Metal
|
||||
|
||||
Implement: stop, start, reboot, terminate instances.
|
||||
|
||||
**When modifying**: Maintain consistency with other providers, handle API errors, add logging, update tests.
|
||||
|
||||
### Adding Cloud Provider Support
|
||||
1. Create: `krkn/scenario_plugins/node_actions/<provider>_node_scenarios.py`
|
||||
2. Extend: `abstract_node_scenarios.AbstractNodeScenarios`
|
||||
3. Implement: `stop_instances`, `start_instances`, `reboot_instances`, `terminate_instances`
|
||||
4. Add SDK to `requirements.txt`
|
||||
5. Create unit test with mocked SDK
|
||||
6. Add example scenario: `scenarios/openshift/<provider>_node_scenarios.yml`
|
||||
|
||||
## Configuration
|
||||
|
||||
**Main config**: `config/config.yaml`
|
||||
- `kraken`: Core settings
|
||||
- `cerberus`: Health monitoring
|
||||
- `performance_monitoring`: Prometheus
|
||||
- `elastic`: Elasticsearch telemetry
|
||||
|
||||
**Scenario configs**: `scenarios/` directory
|
||||
```yaml
|
||||
- config:
|
||||
scenario_type: <type> # Must match plugin's get_scenario_types()
|
||||
```
|
||||
|
||||
## Code Style
|
||||
|
||||
- **Import order**: Standard library, third-party, local imports
|
||||
- **Naming**: snake_case (functions/variables), CamelCase (classes)
|
||||
- **Logging**: Use Python's `logging` module
|
||||
- **Error handling**: Return appropriate exit codes
|
||||
- **Docstrings**: Required for public functions/classes
|
||||
|
||||
## Exit Codes
|
||||
|
||||
Krkn uses specific exit codes to communicate execution status:
|
||||
|
||||
- `0`: Success - all scenarios passed, no critical alerts
|
||||
- `1`: Scenario failure - one or more scenarios failed
|
||||
- `2`: Critical alerts fired during execution
|
||||
- `3+`: Health check failure (Cerberus monitoring detected issues)
|
||||
|
||||
**When implementing scenarios:**
|
||||
- Return `0` on success
|
||||
- Return `1` on scenario-specific failures
|
||||
- Propagate health check failures appropriately
|
||||
- Log exit code reasons clearly
|
||||
|
||||
## Container Support
|
||||
|
||||
Krkn can run inside a container. See `containers/` directory.
|
||||
|
||||
**Building custom image:**
|
||||
```bash
|
||||
cd containers
|
||||
./compile_dockerfile.sh # Generates Dockerfile from template
|
||||
docker build -t krkn:latest .
|
||||
```
|
||||
|
||||
**Running containerized:**
|
||||
```bash
|
||||
docker run -v ~/.kube:/root/.kube:Z \
|
||||
-v $(pwd)/config:/config:Z \
|
||||
-v $(pwd)/scenarios:/scenarios:Z \
|
||||
krkn:latest
|
||||
```
|
||||
|
||||
## Git Workflow
|
||||
|
||||
- **NEVER commit directly to main**
|
||||
- **NEVER use `--force` without approval**
|
||||
- **ALWAYS create feature branches**: `git checkout -b feature/description`
|
||||
- **ALWAYS run tests before pushing**
|
||||
|
||||
**Conventional commits**: `feat:`, `fix:`, `test:`, `docs:`, `refactor:`
|
||||
|
||||
```bash
|
||||
git checkout main && git pull origin main
|
||||
git checkout -b feature/your-feature-name
|
||||
# Make changes, write tests
|
||||
python -m unittest discover -s tests -v
|
||||
git add <specific-files>
|
||||
git commit -m "feat: description"
|
||||
git push -u origin feature/your-feature-name
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `KUBECONFIG`: Path to kubeconfig
|
||||
- `AWS_*`, `AZURE_*`, `GOOGLE_APPLICATION_CREDENTIALS`: Cloud credentials
|
||||
- `PROMETHEUS_URL`, `ELASTIC_URL`, `ELASTIC_PASSWORD`: Monitoring config
|
||||
|
||||
**NEVER commit credentials or API keys.**
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. Missing virtual environment - always activate venv
|
||||
2. Running functional tests without cluster setup
|
||||
3. Ignoring exit codes
|
||||
4. Modifying krkn-lib directly (it's a separate package)
|
||||
5. Upgrading docker/requests beyond version constraints
|
||||
|
||||
## Before Writing Code
|
||||
|
||||
1. Check for existing implementations
|
||||
2. Review existing plugins as examples
|
||||
3. Maintain consistency with cloud provider patterns
|
||||
4. Plan rollback logic
|
||||
5. Write tests alongside code
|
||||
6. Update documentation
|
||||
|
||||
## When Adding Dependencies
|
||||
|
||||
1. Check if functionality exists in krkn-lib or current dependencies
|
||||
2. Verify compatibility with existing versions
|
||||
3. Pin specific versions in `requirements.txt`
|
||||
4. Check for security vulnerabilities
|
||||
5. Test thoroughly for conflicts
|
||||
|
||||
## Common Development Tasks
|
||||
|
||||
### Modifying Existing Plugin
|
||||
1. Read plugin code and corresponding test
|
||||
2. Make changes
|
||||
3. Update/add unit tests
|
||||
4. Run: `python -m unittest tests.test_<plugin>_scenario_plugin`
|
||||
|
||||
### Writing Unit Tests
|
||||
1. Create: `tests/test_<module>_scenario_plugin.py`
|
||||
2. Import `unittest` and plugin class
|
||||
3. Mock external dependencies
|
||||
4. Test success, failure, and edge cases
|
||||
5. Run: `python -m unittest tests.test_<module>_scenario_plugin`
|
||||
|
||||
@@ -41,7 +41,7 @@ ENV KUBECONFIG /home/krkn/.kube/config
|
||||
|
||||
# This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo
|
||||
RUN dnf update && dnf install -y --setopt=install_weak_deps=False \
|
||||
git python3.11 jq yq gettext wget which ipmitool openssh-server &&\
|
||||
git python39 jq yq gettext wget which ipmitool openssh-server &&\
|
||||
dnf clean all
|
||||
|
||||
# copy oc client binary from oc-build image
|
||||
@@ -63,15 +63,15 @@ RUN if [ -n "$PR_NUMBER" ]; then git fetch origin pull/${PR_NUMBER}/head:pr-${PR
|
||||
# if it is a TAG trigger checkout the tag
|
||||
RUN if [ -n "$TAG" ]; then git checkout "$TAG";fi
|
||||
|
||||
RUN python3.11 -m ensurepip --upgrade --default-pip
|
||||
RUN python3.11 -m pip install --upgrade pip setuptools==78.1.1
|
||||
RUN python3.9 -m ensurepip --upgrade --default-pip
|
||||
RUN python3.9 -m pip install --upgrade pip setuptools==78.1.1
|
||||
|
||||
# removes the the vulnerable versions of setuptools and pip
|
||||
RUN rm -rf "$(pip cache dir)"
|
||||
RUN rm -rf /tmp/*
|
||||
RUN rm -rf /usr/local/lib/python3.11/ensurepip/_bundled
|
||||
RUN pip3.11 install -r requirements.txt
|
||||
RUN pip3.11 install jsonschema
|
||||
RUN rm -rf /usr/local/lib/python3.9/ensurepip/_bundled
|
||||
RUN pip3.9 install -r requirements.txt
|
||||
RUN pip3.9 install jsonschema
|
||||
|
||||
LABEL krknctl.title.global="Krkn Base Image"
|
||||
LABEL krknctl.description.global="This is the krkn base image."
|
||||
|
||||
@@ -146,7 +146,7 @@ class AbstractScenarioPlugin(ABC):
|
||||
if scenario_telemetry.exit_status != 0:
|
||||
failed_scenarios.append(scenario_config)
|
||||
scenario_telemetries.append(scenario_telemetry)
|
||||
logging.info(f"waiting {wait_duration} before running the next scenario")
|
||||
logging.info(f"wating {wait_duration} before running the next scenario")
|
||||
time.sleep(wait_duration)
|
||||
return failed_scenarios, scenario_telemetries
|
||||
|
||||
|
||||
@@ -53,7 +53,7 @@ class HogsScenarioPlugin(AbstractScenarioPlugin):
|
||||
raise Exception("no available nodes to schedule workload")
|
||||
|
||||
if not has_selector:
|
||||
available_nodes = [available_nodes[random.randint(0, len(available_nodes) - 1)]]
|
||||
available_nodes = [available_nodes[random.randint(0, len(available_nodes))]]
|
||||
|
||||
if scenario_config.number_of_nodes and len(available_nodes) > scenario_config.number_of_nodes:
|
||||
available_nodes = random.sample(available_nodes, scenario_config.number_of_nodes)
|
||||
|
||||
@@ -76,7 +76,7 @@ class abstract_node_scenarios:
|
||||
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
|
||||
|
||||
logging.info("The kubelet of the node %s has been stopped" % (node))
|
||||
logging.info("stop_kubelet_scenario has been successfully injected!")
|
||||
logging.info("stop_kubelet_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to stop the kubelet of the node. Encountered following "
|
||||
@@ -108,7 +108,7 @@ class abstract_node_scenarios:
|
||||
)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli,affected_node)
|
||||
logging.info("The kubelet of the node %s has been restarted" % (node))
|
||||
logging.info("restart_kubelet_scenario has been successfully injected!")
|
||||
logging.info("restart_kubelet_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to restart the kubelet of the node. Encountered following "
|
||||
@@ -128,7 +128,7 @@ class abstract_node_scenarios:
|
||||
"oc debug node/" + node + " -- chroot /host "
|
||||
"dd if=/dev/urandom of=/proc/sysrq-trigger"
|
||||
)
|
||||
logging.info("node_crash_scenario has been successfully injected!")
|
||||
logging.info("node_crash_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to crash the node. Encountered following exception: %s. "
|
||||
|
||||
@@ -379,7 +379,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been terminated" % (instance_id)
|
||||
)
|
||||
logging.info("node_termination_scenario has been successfully injected!")
|
||||
logging.info("node_termination_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to terminate node instance. Encountered following exception:"
|
||||
@@ -408,7 +408,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been rebooted" % (instance_id)
|
||||
)
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -229,7 +229,7 @@ class bm_node_scenarios(abstract_node_scenarios):
|
||||
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
|
||||
logging.info("Node with bmc address: %s has been rebooted" % (bmc_addr))
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -237,7 +237,7 @@ class docker_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with container ID: %s has been terminated" % (container_id)
|
||||
)
|
||||
logging.info("node_termination_scenario has been successfully injected!")
|
||||
logging.info("node_termination_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to terminate node instance. Encountered following exception:"
|
||||
@@ -264,7 +264,7 @@ class docker_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with container ID: %s has been rebooted" % (container_id)
|
||||
)
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -309,7 +309,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been terminated" % instance_id
|
||||
)
|
||||
logging.info("node_termination_scenario has been successfully injected!")
|
||||
logging.info("node_termination_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to terminate node instance. Encountered following exception:"
|
||||
@@ -341,7 +341,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been rebooted" % instance_id
|
||||
)
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -184,7 +184,7 @@ class openstack_node_scenarios(abstract_node_scenarios):
|
||||
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
|
||||
logging.info("Node with instance name: %s has been rebooted" % (node))
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
@@ -249,7 +249,7 @@ class openstack_node_scenarios(abstract_node_scenarios):
|
||||
node_ip.strip(), service, ssh_private_key, timeout
|
||||
)
|
||||
logging.info("Service status checked on %s" % (node_ip))
|
||||
logging.info("Check service status is successfully injected!")
|
||||
logging.info("Check service status is successfuly injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to check service status. Encountered following exception:"
|
||||
|
||||
@@ -43,7 +43,7 @@ class TimeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
cerberus.publish_kraken_status(
|
||||
krkn_config, not_reset, start_time, end_time
|
||||
)
|
||||
except (RuntimeError, Exception) as e:
|
||||
except (RuntimeError, Exception):
|
||||
logging.error(
|
||||
f"TimeActionsScenarioPlugin scenario {scenario} failed with exception: {e}"
|
||||
)
|
||||
|
||||
@@ -140,7 +140,7 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
network_association_ids[0], acl_id
|
||||
)
|
||||
|
||||
# capture the original_acl_id, created_acl_id and
|
||||
# capture the orginal_acl_id, created_acl_id and
|
||||
# new association_id to use during the recovery
|
||||
ids[new_association_id] = original_acl_id
|
||||
|
||||
@@ -156,7 +156,7 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
new_association_id, original_acl_id
|
||||
)
|
||||
logging.info(
|
||||
"Waiting for 60 seconds to make sure " "the changes are in place"
|
||||
"Wating for 60 seconds to make sure " "the changes are in place"
|
||||
)
|
||||
time.sleep(60)
|
||||
|
||||
|
||||
@@ -1,71 +0,0 @@
|
||||
import logging
|
||||
import threading
|
||||
from datetime import datetime, timezone
|
||||
from krkn.utils.ErrorLog import ErrorLog
|
||||
|
||||
|
||||
class ErrorCollectionHandler(logging.Handler):
|
||||
"""
|
||||
Custom logging handler that captures ERROR and CRITICAL level logs
|
||||
in structured format for telemetry collection.
|
||||
|
||||
Stores logs in memory as ErrorLog objects for later retrieval.
|
||||
Thread-safe for concurrent logging operations.
|
||||
"""
|
||||
|
||||
def __init__(self, level=logging.ERROR):
|
||||
"""
|
||||
Initialize the error collection handler.
|
||||
|
||||
Args:
|
||||
level: Minimum log level to capture (default: ERROR)
|
||||
"""
|
||||
super().__init__(level)
|
||||
self.error_logs: list[ErrorLog] = []
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def emit(self, record: logging.LogRecord):
|
||||
"""
|
||||
Capture ERROR and CRITICAL logs and store as ErrorLog objects.
|
||||
|
||||
Args:
|
||||
record: LogRecord from Python logging framework
|
||||
"""
|
||||
try:
|
||||
# Only capture ERROR (40) and CRITICAL (50) levels
|
||||
if record.levelno < logging.ERROR:
|
||||
return
|
||||
|
||||
# Format timestamp as ISO 8601 UTC
|
||||
timestamp = datetime.fromtimestamp(
|
||||
record.created, tz=timezone.utc
|
||||
).strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3] + "Z"
|
||||
|
||||
# Create ErrorLog object
|
||||
error_log = ErrorLog(
|
||||
timestamp=timestamp,
|
||||
message=record.getMessage()
|
||||
)
|
||||
|
||||
# Thread-safe append
|
||||
with self._lock:
|
||||
self.error_logs.append(error_log)
|
||||
|
||||
except Exception:
|
||||
# Handler should never raise exceptions (logging best practice)
|
||||
self.handleError(record)
|
||||
|
||||
def get_error_logs(self) -> list[dict]:
|
||||
"""
|
||||
Retrieve all collected error logs as list of dictionaries.
|
||||
|
||||
Returns:
|
||||
List of error log dictionaries with timestamp and message
|
||||
"""
|
||||
with self._lock:
|
||||
return [log.to_dict() for log in self.error_logs]
|
||||
|
||||
def clear(self):
|
||||
"""Clear all collected error logs (useful for testing)"""
|
||||
with self._lock:
|
||||
self.error_logs.clear()
|
||||
@@ -1,18 +0,0 @@
|
||||
from dataclasses import dataclass, asdict
|
||||
|
||||
|
||||
@dataclass
|
||||
class ErrorLog:
|
||||
"""
|
||||
Represents a single error log entry for telemetry collection.
|
||||
|
||||
Attributes:
|
||||
timestamp: ISO 8601 formatted timestamp (UTC)
|
||||
message: Full error message text
|
||||
"""
|
||||
timestamp: str
|
||||
message: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization"""
|
||||
return asdict(self)
|
||||
@@ -1,4 +1,2 @@
|
||||
from .TeeLogHandler import TeeLogHandler
|
||||
from .ErrorLog import ErrorLog
|
||||
from .ErrorCollectionHandler import ErrorCollectionHandler
|
||||
from .functions import *
|
||||
|
||||
@@ -6,6 +6,7 @@ azure-identity==1.16.1
|
||||
azure-keyvault==4.2.0
|
||||
azure-mgmt-compute==30.5.0
|
||||
azure-mgmt-network==27.0.0
|
||||
itsdangerous==2.0.1
|
||||
coverage==7.6.12
|
||||
datetime==5.4
|
||||
docker>=6.0,<7.0 # docker 7.0+ has breaking changes with Unix sockets
|
||||
@@ -15,7 +16,7 @@ google-cloud-compute==1.22.0
|
||||
ibm_cloud_sdk_core==3.18.0
|
||||
ibm_vpc==0.20.0
|
||||
jinja2==3.1.6
|
||||
krkn-lib==6.0.1
|
||||
krkn-lib==5.1.13
|
||||
lxml==5.1.0
|
||||
kubernetes==34.1.0
|
||||
numpy==1.26.4
|
||||
@@ -32,8 +33,9 @@ requests-unixsocket>=0.4.0 # Required for Docker Unix socket support
|
||||
service_identity==24.1.0
|
||||
PyYAML==6.0.1
|
||||
setuptools==78.1.1
|
||||
wheel>=0.44.0
|
||||
zope.interface==6.1
|
||||
werkzeug==3.1.4
|
||||
wheel==0.42.0
|
||||
zope.interface==5.4.0
|
||||
|
||||
git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.0.0
|
||||
cryptography>=42.0.4 # not directly required, pinned by Snyk to avoid a vulnerability
|
||||
|
||||
@@ -27,7 +27,7 @@ from krkn_lib.models.telemetry import ChaosRunTelemetry
|
||||
from krkn_lib.utils import SafeLogger
|
||||
from krkn_lib.utils.functions import get_yaml_item_value, get_junit_test_case
|
||||
|
||||
from krkn.utils import TeeLogHandler, ErrorCollectionHandler
|
||||
from krkn.utils import TeeLogHandler
|
||||
from krkn.utils.HealthChecker import HealthChecker
|
||||
from krkn.utils.VirtChecker import VirtChecker
|
||||
from krkn.scenario_plugins.scenario_plugin_factory import (
|
||||
@@ -425,22 +425,16 @@ def main(options, command: Optional[str]) -> int:
|
||||
logging.info("collecting Kubernetes cluster metadata....")
|
||||
telemetry_k8s.collect_cluster_metadata(chaos_telemetry)
|
||||
|
||||
# Collect error logs from handler
|
||||
error_logs = error_collection_handler.get_error_logs()
|
||||
if error_logs:
|
||||
logging.info(f"Collected {len(error_logs)} error logs for telemetry")
|
||||
chaos_telemetry.error_logs = error_logs
|
||||
else:
|
||||
logging.info("No error logs collected during chaos run")
|
||||
chaos_telemetry.error_logs = []
|
||||
|
||||
telemetry_json = chaos_telemetry.to_json()
|
||||
decoded_chaos_run_telemetry = ChaosRunTelemetry(json.loads(telemetry_json))
|
||||
chaos_output.telemetry = decoded_chaos_run_telemetry
|
||||
logging.info(f"Chaos data:\n{chaos_output.to_json()}")
|
||||
if enable_elastic:
|
||||
elastic_telemetry = ElasticChaosRunTelemetry(
|
||||
chaos_run_telemetry=decoded_chaos_run_telemetry
|
||||
)
|
||||
result = elastic_search.push_telemetry(
|
||||
decoded_chaos_run_telemetry, elastic_telemetry_index
|
||||
elastic_telemetry, elastic_telemetry_index
|
||||
)
|
||||
if result == -1:
|
||||
safe_logger.error(
|
||||
@@ -652,13 +646,10 @@ if __name__ == "__main__":
|
||||
# If no command or regular execution, continue with existing logic
|
||||
report_file = options.output
|
||||
tee_handler = TeeLogHandler()
|
||||
error_collection_handler = ErrorCollectionHandler(level=logging.ERROR)
|
||||
|
||||
handlers = [
|
||||
logging.FileHandler(report_file, mode="w"),
|
||||
logging.StreamHandler(),
|
||||
tee_handler,
|
||||
error_collection_handler,
|
||||
]
|
||||
|
||||
logging.basicConfig(
|
||||
|
||||
@@ -43,15 +43,14 @@ class TestGCP(unittest.TestCase):
|
||||
def setUp(self):
|
||||
"""Set up test fixtures"""
|
||||
# Mock google.auth before creating GCP instance
|
||||
self.auth_patcher = patch('krkn.scenario_plugins.node_actions.gcp_node_scenarios.google.auth.default')
|
||||
self.compute_patcher = patch('krkn.scenario_plugins.node_actions.gcp_node_scenarios.compute_v1.InstancesClient')
|
||||
self.auth_patcher = patch('google.auth.default')
|
||||
self.compute_patcher = patch('google.cloud.compute_v1.InstancesClient')
|
||||
|
||||
self.mock_auth = self.auth_patcher.start()
|
||||
self.mock_compute_client = self.compute_patcher.start()
|
||||
|
||||
# Configure auth mock to return credentials and project_id
|
||||
mock_credentials = MagicMock()
|
||||
self.mock_auth.return_value = (mock_credentials, 'test-project-123')
|
||||
self.mock_auth.return_value = (MagicMock(), 'test-project-123')
|
||||
|
||||
# Create GCP instance with mocked dependencies
|
||||
self.gcp = GCP()
|
||||
@@ -68,7 +67,7 @@ class TestGCP(unittest.TestCase):
|
||||
|
||||
def test_gcp_init_failure(self):
|
||||
"""Test GCP class initialization failure"""
|
||||
with patch('krkn.scenario_plugins.node_actions.gcp_node_scenarios.google.auth.default', side_effect=Exception("Auth error")):
|
||||
with patch('google.auth.default', side_effect=Exception("Auth error")):
|
||||
with self.assertRaises(Exception):
|
||||
GCP()
|
||||
|
||||
|
||||
@@ -1,31 +0,0 @@
|
||||
# ⚠️ DEPRECATED
|
||||
|
||||
This directory is **no longer actively maintained** and will not accept new changes.
|
||||
|
||||
## Migration Notice
|
||||
|
||||
All development efforts have been moved to:
|
||||
|
||||
**[github.com/krkn-chaos/krkn-ai](https://github.com/krkn-chaos/krkn-ai)**
|
||||
|
||||
## What This Means
|
||||
|
||||
- ❌ No new features will be added here
|
||||
- ❌ Bug fixes will not be accepted
|
||||
- ❌ Pull requests will be closed and redirected
|
||||
- ℹ️ Existing code remains for historical reference only
|
||||
|
||||
## Next Steps
|
||||
|
||||
If you're looking to:
|
||||
- **Use** chaos engineering AI features → Visit [krkn-chaos/krkn-ai](https://github.com/krkn-chaos/krkn-ai)
|
||||
- **Contribute** improvements → Submit to [krkn-chaos/krkn-ai](https://github.com/krkn-chaos/krkn-ai)
|
||||
- **Report issues** → Open issues at [krkn-chaos/krkn-ai](https://github.com/krkn-chaos/krkn-ai/issues)
|
||||
|
||||
## Questions?
|
||||
|
||||
Please visit the new repository for documentation, examples, and community support.
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** January 2026
|
||||
@@ -1,17 +1,3 @@
|
||||
# ⚠️ DEPRECATED - This project has moved
|
||||
|
||||
> **All development has moved to [github.com/krkn-chaos/krkn-ai](https://github.com/krkn-chaos/krkn-ai)**
|
||||
>
|
||||
> This directory is no longer maintained. Please visit the new repository for:
|
||||
> - Latest features and updates
|
||||
> - Active development and support
|
||||
> - Bug fixes and improvements
|
||||
> - Documentation and examples
|
||||
>
|
||||
> See [../README.md](../README.md) for more information.
|
||||
|
||||
---
|
||||
|
||||
# aichaos
|
||||
Enhancing Chaos Engineering with AI-assisted fault injection for better resiliency and non-functional testing.
|
||||
|
||||
|
||||
@@ -3,9 +3,8 @@ pandas
|
||||
notebook
|
||||
jupyterlab
|
||||
jupyter
|
||||
seaborn==0.13.2
|
||||
seaborn
|
||||
requests
|
||||
wheel
|
||||
Flask==2.2.5
|
||||
Flask==2.1.0
|
||||
flasgger==0.9.5
|
||||
pillow==10.3.0
|
||||
@@ -1,17 +1,3 @@
|
||||
# ⚠️ DEPRECATED - This project has moved
|
||||
|
||||
> **All development has moved to [github.com/krkn-chaos/krkn-ai](https://github.com/krkn-chaos/krkn-ai)**
|
||||
>
|
||||
> This directory is no longer maintained. Please visit the new repository for:
|
||||
> - Latest features and updates
|
||||
> - Active development and support
|
||||
> - Bug fixes and improvements
|
||||
> - Documentation and examples
|
||||
>
|
||||
> See [../README.md](../README.md) for more information.
|
||||
|
||||
---
|
||||
|
||||
# Chaos Recommendation Tool
|
||||
|
||||
This tool, designed for Redhat Kraken, operates through the command line and offers recommendations for chaos testing. It suggests probable chaos test cases that can disrupt application services by analyzing their behavior and assessing their susceptibility to specific fault types.
|
||||
|
||||
Reference in New Issue
Block a user