mirror of
https://github.com/krkn-chaos/krkn.git
synced 2026-02-19 20:40:33 +00:00
Compare commits
88 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8dad2a3996 | ||
|
|
cebc60f5a8 | ||
|
|
2065443622 | ||
|
|
b6ef7fa052 | ||
|
|
4f305e78aa | ||
|
|
b17e933134 | ||
|
|
beea484597 | ||
|
|
0222b0f161 | ||
|
|
f7e674d5ad | ||
|
|
7aea12ce6c | ||
|
|
625e1e90cf | ||
|
|
a9f1ce8f1b | ||
|
|
66e364e293 | ||
|
|
898ce76648 | ||
|
|
4a0f4e7cab | ||
|
|
819191866d | ||
|
|
37ca4bbce7 | ||
|
|
b9dd4e40d3 | ||
|
|
3fd249bb88 | ||
|
|
773107245c | ||
|
|
05bc201528 | ||
|
|
9a316550e1 | ||
|
|
9c261e2599 | ||
|
|
0cc82dc65d | ||
|
|
269e21e9eb | ||
|
|
d0dbe3354a | ||
|
|
4a0686daf3 | ||
|
|
822bebac0c | ||
|
|
a13150b0f5 | ||
|
|
0443637fe1 | ||
|
|
36585630f2 | ||
|
|
1401724312 | ||
|
|
fa204a515c | ||
|
|
b3a5fc2d53 | ||
|
|
05600b62b3 | ||
|
|
126599e02c | ||
|
|
b3d6a19d24 | ||
|
|
65100f26a7 | ||
|
|
10b6e4663e | ||
|
|
ce52183a26 | ||
|
|
e9ab3b47b3 | ||
|
|
3e14fe07b7 | ||
|
|
d9271a4bcc | ||
|
|
850930631e | ||
|
|
15eee80c55 | ||
|
|
ff3c4f5313 | ||
|
|
4c74df301f | ||
|
|
b60b66de43 | ||
|
|
2458022248 | ||
|
|
18385cba2b | ||
|
|
e7fa6bdebc | ||
|
|
c3f6b1a7ff | ||
|
|
f2ba8b85af | ||
|
|
ba3fdea403 | ||
|
|
42d18a8e04 | ||
|
|
4b3617bd8a | ||
|
|
eb7a1e243c | ||
|
|
197ce43f9a | ||
|
|
eecdeed73c | ||
|
|
ef606d0f17 | ||
|
|
9981c26304 | ||
|
|
4ebfc5dde5 | ||
|
|
4527d073c6 | ||
|
|
93d6967331 | ||
|
|
b462c46b28 | ||
|
|
ab4ae85896 | ||
|
|
6acd6f9bd3 | ||
|
|
787759a591 | ||
|
|
957cb355be | ||
|
|
35609484d4 | ||
|
|
959337eb63 | ||
|
|
f4bdbff9dc | ||
|
|
954202cab7 | ||
|
|
a373dcf453 | ||
|
|
d0c604a516 | ||
|
|
82582f5bc3 | ||
|
|
37f0f1eb8b | ||
|
|
d2eab21f95 | ||
|
|
d84910299a | ||
|
|
48f19c0a0e | ||
|
|
eb86885bcd | ||
|
|
967fd14bd7 | ||
|
|
5cefe80286 | ||
|
|
9ee76ce337 | ||
|
|
fd3e7ee2c8 | ||
|
|
c85c435b5d | ||
|
|
d5284ace25 | ||
|
|
c3098ec80b |
4
.coveragerc
Normal file
4
.coveragerc
Normal file
@@ -0,0 +1,4 @@
|
||||
[run]
|
||||
omit =
|
||||
tests/*
|
||||
krkn/tests/**
|
||||
40
.github/PULL_REQUEST_TEMPLATE.md
vendored
40
.github/PULL_REQUEST_TEMPLATE.md
vendored
@@ -1,27 +1,47 @@
|
||||
## Type of change
|
||||
# Type of change
|
||||
|
||||
- [ ] Refactor
|
||||
- [ ] New feature
|
||||
- [ ] Bug fix
|
||||
- [ ] Optimization
|
||||
|
||||
## Description
|
||||
<!-- Provide a brief description of the changes made in this PR. -->
|
||||
# Description
|
||||
<-- Provide a brief description of the changes made in this PR. -->
|
||||
|
||||
## Related Tickets & Documents
|
||||
If no related issue, please create one and start the converasation on wants of
|
||||
|
||||
- Related Issue #
|
||||
- Closes #
|
||||
- Related Issue #:
|
||||
- Closes #:
|
||||
|
||||
## Documentation
|
||||
# Documentation
|
||||
- [ ] **Is documentation needed for this update?**
|
||||
|
||||
If checked, a documentation PR must be created and merged in the [website repository](https://github.com/krkn-chaos/website/).
|
||||
|
||||
## Related Documentation PR (if applicable)
|
||||
<!-- Add the link to the corresponding documentation PR in the website repository -->
|
||||
<-- Add the link to the corresponding documentation PR in the website repository -->
|
||||
|
||||
## Checklist before requesting a review
|
||||
# Checklist before requesting a review
|
||||
[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See [contributing guidelines](https://krkn-chaos.dev/docs/contribution-guidelines/)
|
||||
See [testing your changes](https://krkn-chaos.dev/docs/developers-guide/testing-changes/) and run on any Kubernetes or OpenShift cluster to validate your changes
|
||||
- [ ] I have performed a self-review of my code by running krkn and specific scenario
|
||||
- [ ] If it is a core feature, I have added thorough unit tests with above 80% coverage
|
||||
|
||||
- [ ] I have performed a self-review of my code.
|
||||
- [ ] If it is a core feature, I have added thorough tests.
|
||||
*REQUIRED*:
|
||||
Description of combination of tests performed and output of run
|
||||
|
||||
```bash
|
||||
python run_kraken.py
|
||||
...
|
||||
<---insert test results output--->
|
||||
```
|
||||
|
||||
OR
|
||||
|
||||
|
||||
```bash
|
||||
python -m coverage run -a -m unittest discover -s tests -v
|
||||
...
|
||||
<---insert test results output--->
|
||||
```
|
||||
|
||||
52
.github/workflows/stale.yml
vendored
Normal file
52
.github/workflows/stale.yml
vendored
Normal file
@@ -0,0 +1,52 @@
|
||||
name: Manage Stale Issues and Pull Requests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
# Run daily at 1:00 AM UTC
|
||||
- cron: '0 1 * * *'
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
issues: write
|
||||
pull-requests: write
|
||||
|
||||
jobs:
|
||||
stale:
|
||||
name: Mark and Close Stale Issues and PRs
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Mark and close stale issues and PRs
|
||||
uses: actions/stale@v9
|
||||
with:
|
||||
days-before-issue-stale: 60
|
||||
days-before-issue-close: 14
|
||||
stale-issue-label: 'stale'
|
||||
stale-issue-message: |
|
||||
This issue has been automatically marked as stale because it has not had any activity in the last 60 days.
|
||||
It will be closed in 14 days if no further activity occurs.
|
||||
If this issue is still relevant, please leave a comment or remove the stale label.
|
||||
Thank you for your contributions to krkn!
|
||||
close-issue-message: |
|
||||
This issue has been automatically closed due to inactivity.
|
||||
If you believe this issue is still relevant, please feel free to reopen it or create a new issue with updated information.
|
||||
Thank you for your understanding!
|
||||
close-issue-reason: 'not_planned'
|
||||
|
||||
days-before-pr-stale: 90
|
||||
days-before-pr-close: 14
|
||||
stale-pr-label: 'stale'
|
||||
stale-pr-message: |
|
||||
This pull request has been automatically marked as stale because it has not had any activity in the last 90 days.
|
||||
It will be closed in 14 days if no further activity occurs.
|
||||
If this PR is still relevant, please rebase it, address any pending reviews, or leave a comment.
|
||||
Thank you for your contributions to krkn!
|
||||
close-pr-message: |
|
||||
This pull request has been automatically closed due to inactivity.
|
||||
If you believe this PR is still relevant, please feel free to reopen it or create a new pull request with updated changes.
|
||||
Thank you for your understanding!
|
||||
|
||||
# Exempt labels
|
||||
exempt-issue-labels: 'bug,enhancement,good first issue'
|
||||
exempt-pr-labels: 'pending discussions,hold'
|
||||
|
||||
remove-stale-when-updated: true
|
||||
99
.github/workflows/tests.yml
vendored
99
.github/workflows/tests.yml
vendored
@@ -32,21 +32,33 @@ jobs:
|
||||
- name: Install Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.9'
|
||||
python-version: '3.11'
|
||||
architecture: 'x64'
|
||||
- name: Install environment
|
||||
run: |
|
||||
sudo apt-get install build-essential python3-dev
|
||||
pip install --upgrade pip
|
||||
pip install -r requirements.txt
|
||||
pip install coverage
|
||||
|
||||
- name: Deploy test workloads
|
||||
run: |
|
||||
es_pod_name=$(kubectl get pods -l "app=elasticsearch-master" -o name)
|
||||
echo "POD_NAME: $es_pod_name"
|
||||
kubectl --namespace default port-forward $es_pod_name 9200 &
|
||||
prom_name=$(kubectl get pods -n monitoring -l "app.kubernetes.io/name=prometheus" -o name)
|
||||
kubectl --namespace monitoring port-forward $prom_name 9090 &
|
||||
# es_pod_name=$(kubectl get pods -l "app=elasticsearch-master" -o name)
|
||||
# echo "POD_NAME: $es_pod_name"
|
||||
# kubectl --namespace default port-forward $es_pod_name 9200 &
|
||||
# prom_name=$(kubectl get pods -n monitoring -l "app.kubernetes.io/name=prometheus" -o name)
|
||||
# kubectl --namespace monitoring port-forward $prom_name 9090 &
|
||||
|
||||
# Wait for Elasticsearch to be ready
|
||||
echo "Waiting for Elasticsearch to be ready..."
|
||||
for i in {1..30}; do
|
||||
if curl -k -s -u elastic:$ELASTIC_PASSWORD https://localhost:9200/_cluster/health > /dev/null 2>&1; then
|
||||
echo "Elasticsearch is ready!"
|
||||
break
|
||||
fi
|
||||
echo "Attempt $i: Elasticsearch not ready yet, waiting..."
|
||||
sleep 2
|
||||
done
|
||||
kubectl apply -f CI/templates/outage_pod.yaml
|
||||
kubectl wait --for=condition=ready pod -l scenario=outage --timeout=300s
|
||||
kubectl apply -f CI/templates/container_scenario_pod.yaml
|
||||
@@ -56,39 +68,43 @@ jobs:
|
||||
kubectl wait --for=condition=ready pod -l scenario=time-skew --timeout=300s
|
||||
kubectl apply -f CI/templates/service_hijacking.yaml
|
||||
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=proxy" --timeout=300s
|
||||
kubectl apply -f CI/legacy/scenarios/volume_scenario.yaml
|
||||
kubectl wait --for=condition=ready pod kraken-test-pod -n kraken --timeout=300s
|
||||
- name: Get Kind nodes
|
||||
run: |
|
||||
kubectl get nodes --show-labels=true
|
||||
# Pull request only steps
|
||||
- name: Run unit tests
|
||||
if: github.event_name == 'pull_request'
|
||||
run: python -m coverage run -a -m unittest discover -s tests -v
|
||||
|
||||
- name: Setup Pull Request Functional Tests
|
||||
if: |
|
||||
github.event_name == 'pull_request'
|
||||
- name: Setup Functional Tests
|
||||
run: |
|
||||
yq -i '.kraken.port="8081"' CI/config/common_test_config.yaml
|
||||
yq -i '.kraken.signal_address="0.0.0.0"' CI/config/common_test_config.yaml
|
||||
yq -i '.kraken.performance_monitoring="localhost:9090"' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.elastic_port=9200' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.elastic_url="https://localhost"' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.enable_elastic=True' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.enable_elastic=False' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.password="${{env.ELASTIC_PASSWORD}}"' CI/config/common_test_config.yaml
|
||||
yq -i '.performance_monitoring.prometheus_url="http://localhost:9090"' CI/config/common_test_config.yaml
|
||||
echo "test_service_hijacking" > ./CI/tests/functional_tests
|
||||
echo "test_app_outages" >> ./CI/tests/functional_tests
|
||||
echo "test_container" >> ./CI/tests/functional_tests
|
||||
echo "test_pod" >> ./CI/tests/functional_tests
|
||||
echo "test_customapp_pod" >> ./CI/tests/functional_tests
|
||||
echo "test_namespace" >> ./CI/tests/functional_tests
|
||||
echo "test_net_chaos" >> ./CI/tests/functional_tests
|
||||
echo "test_time" >> ./CI/tests/functional_tests
|
||||
echo "test_app_outages" > ./CI/tests/functional_tests
|
||||
echo "test_container" >> ./CI/tests/functional_tests
|
||||
echo "test_cpu_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_memory_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_customapp_pod" >> ./CI/tests/functional_tests
|
||||
echo "test_io_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_network_filter" >> ./CI/tests/functional_tests
|
||||
echo "test_memory_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_namespace" >> ./CI/tests/functional_tests
|
||||
echo "test_net_chaos" >> ./CI/tests/functional_tests
|
||||
echo "test_node" >> ./CI/tests/functional_tests
|
||||
|
||||
echo "test_service_hijacking" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_network_filter" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_server" >> ./CI/tests/functional_tests
|
||||
echo "test_time" >> ./CI/tests/functional_tests
|
||||
echo "test_node_network_chaos" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_network_chaos" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_error" >> ./CI/tests/functional_tests
|
||||
echo "test_pod" >> ./CI/tests/functional_tests
|
||||
# echo "test_pvc" >> ./CI/tests/functional_tests
|
||||
|
||||
|
||||
# Push on main only steps + all other functional to collect coverage
|
||||
# for the badge
|
||||
@@ -102,30 +118,9 @@ jobs:
|
||||
- name: Setup Post Merge Request Functional Tests
|
||||
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
|
||||
run: |
|
||||
yq -i '.kraken.port="8081"' CI/config/common_test_config.yaml
|
||||
yq -i '.kraken.signal_address="0.0.0.0"' CI/config/common_test_config.yaml
|
||||
yq -i '.kraken.performance_monitoring="localhost:9090"' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.enable_elastic=True' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.password="${{env.ELASTIC_PASSWORD}}"' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.elastic_port=9200' CI/config/common_test_config.yaml
|
||||
yq -i '.elastic.elastic_url="https://localhost"' CI/config/common_test_config.yaml
|
||||
yq -i '.performance_monitoring.prometheus_url="http://localhost:9090"' CI/config/common_test_config.yaml
|
||||
yq -i '.telemetry.username="${{secrets.TELEMETRY_USERNAME}}"' CI/config/common_test_config.yaml
|
||||
yq -i '.telemetry.password="${{secrets.TELEMETRY_PASSWORD}}"' CI/config/common_test_config.yaml
|
||||
echo "test_telemetry" > ./CI/tests/functional_tests
|
||||
echo "test_service_hijacking" >> ./CI/tests/functional_tests
|
||||
echo "test_app_outages" >> ./CI/tests/functional_tests
|
||||
echo "test_container" >> ./CI/tests/functional_tests
|
||||
echo "test_pod" >> ./CI/tests/functional_tests
|
||||
echo "test_customapp_pod" >> ./CI/tests/functional_tests
|
||||
echo "test_namespace" >> ./CI/tests/functional_tests
|
||||
echo "test_net_chaos" >> ./CI/tests/functional_tests
|
||||
echo "test_time" >> ./CI/tests/functional_tests
|
||||
echo "test_cpu_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_memory_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_io_hog" >> ./CI/tests/functional_tests
|
||||
echo "test_pod_network_filter" >> ./CI/tests/functional_tests
|
||||
|
||||
echo "test_telemetry" >> ./CI/tests/functional_tests
|
||||
# Final common steps
|
||||
- name: Run Functional tests
|
||||
env:
|
||||
@@ -135,38 +130,38 @@ jobs:
|
||||
cat ./CI/results.markdown >> $GITHUB_STEP_SUMMARY
|
||||
echo >> $GITHUB_STEP_SUMMARY
|
||||
- name: Upload CI logs
|
||||
if: ${{ success() || failure() }}
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: ci-logs
|
||||
path: CI/out
|
||||
if-no-files-found: error
|
||||
- name: Collect coverage report
|
||||
if: ${{ success() || failure() }}
|
||||
if: ${{ always() }}
|
||||
run: |
|
||||
python -m coverage html
|
||||
python -m coverage json
|
||||
- name: Publish coverage report to job summary
|
||||
if: ${{ success() || failure() }}
|
||||
if: ${{ always() }}
|
||||
run: |
|
||||
pip install html2text
|
||||
html2text --ignore-images --ignore-links -b 0 htmlcov/index.html >> $GITHUB_STEP_SUMMARY
|
||||
- name: Upload coverage data
|
||||
if: ${{ success() || failure() }}
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: coverage
|
||||
path: htmlcov
|
||||
if-no-files-found: error
|
||||
- name: Upload json coverage
|
||||
if: ${{ success() || failure() }}
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: coverage.json
|
||||
path: coverage.json
|
||||
if-no-files-found: error
|
||||
- name: Check CI results
|
||||
if: ${{ success() || failure() }}
|
||||
if: ${{ always() }}
|
||||
run: "! grep Fail CI/results.markdown"
|
||||
|
||||
badge:
|
||||
@@ -191,7 +186,7 @@ jobs:
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: 3.9
|
||||
python-version: '3.11'
|
||||
- name: Copy badge on GitHub Page Repo
|
||||
env:
|
||||
COLOR: yellow
|
||||
|
||||
@@ -2,6 +2,10 @@ kraken:
|
||||
distribution: kubernetes # Distribution can be kubernetes or openshift.
|
||||
kubeconfig_path: ~/.kube/config # Path to kubeconfig.
|
||||
exit_on_failure: False # Exit when a post action scenario fails.
|
||||
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
|
||||
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
|
||||
signal_address: 0.0.0.0 # Signal listening address
|
||||
port: 8081 # Signal port
|
||||
auto_rollback: True # Enable auto rollback for scenarios.
|
||||
rollback_versions_directory: /tmp/kraken-rollback # Directory to store rollback version files.
|
||||
chaos_scenarios: # List of policies/chaos scenarios to load.
|
||||
@@ -38,7 +42,7 @@ telemetry:
|
||||
prometheus_backup: True # enables/disables prometheus data collection
|
||||
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
|
||||
backup_threads: 5 # number of telemetry download/upload threads
|
||||
archive_path: /tmp # local path where the archive files will be temporarly stored
|
||||
archive_path: /tmp # local path where the archive files will be temporarily stored
|
||||
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
|
||||
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
|
||||
archive_size: 10000 # the size of the prometheus data archive size in KB. The lower the size of archive is
|
||||
|
||||
@@ -45,15 +45,45 @@ metadata:
|
||||
name: kraken-test-pod
|
||||
namespace: kraken
|
||||
spec:
|
||||
securityContext:
|
||||
fsGroup: 1001
|
||||
# initContainer to fix permissions on the mounted volume
|
||||
initContainers:
|
||||
- name: fix-permissions
|
||||
image: 'quay.io/centos7/httpd-24-centos7:centos7'
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
echo "Setting up permissions for /home/kraken..."
|
||||
# Create the directory if it doesn't exist
|
||||
mkdir -p /home/kraken
|
||||
# Set ownership to user 1001 and group 1001
|
||||
chown -R 1001:1001 /home/kraken
|
||||
# Set permissions to allow read/write
|
||||
chmod -R 755 /home/kraken
|
||||
rm -rf /home/kraken/*
|
||||
echo "Permissions fixed. Current state:"
|
||||
ls -la /home/kraken
|
||||
volumeMounts:
|
||||
- mountPath: "/home/kraken"
|
||||
name: kraken-test-pv
|
||||
securityContext:
|
||||
runAsUser: 0 # Run as root to fix permissions
|
||||
volumes:
|
||||
- name: kraken-test-pv
|
||||
persistentVolumeClaim:
|
||||
claimName: kraken-test-pvc
|
||||
containers:
|
||||
- name: kraken-test-container
|
||||
image: 'quay.io/centos7/httpd-24-centos7:latest'
|
||||
volumeMounts:
|
||||
- mountPath: "/home/krake-dir/"
|
||||
name: kraken-test-pv
|
||||
image: 'quay.io/centos7/httpd-24-centos7:centos7'
|
||||
securityContext:
|
||||
privileged: true
|
||||
runAsUser: 1001
|
||||
runAsNonRoot: true
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
drop:
|
||||
- ALL
|
||||
volumeMounts:
|
||||
- mountPath: "/home/kraken"
|
||||
name: kraken-test-pv
|
||||
|
||||
@@ -19,6 +19,7 @@ function functional_test_app_outage {
|
||||
kubectl get pods
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/app_outage.yaml
|
||||
cat $scenario_file
|
||||
cat CI/config/app_outage.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/app_outage.yaml
|
||||
echo "App outage scenario test: Success"
|
||||
}
|
||||
|
||||
@@ -16,8 +16,10 @@ function functional_test_container_crash {
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/container_config.yaml
|
||||
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/container_config.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/container_config.yaml -d True
|
||||
echo "Container scenario test: Success"
|
||||
|
||||
kubectl get pods -n kube-system -l component=etcd
|
||||
}
|
||||
|
||||
functional_test_container_crash
|
||||
|
||||
@@ -11,7 +11,7 @@ function functional_test_customapp_pod_node_selector {
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/customapp_pod_config.yaml
|
||||
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/customapp_pod_config.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/customapp_pod_config.yaml -d True
|
||||
echo "Pod disruption with node_label_selector test: Success"
|
||||
}
|
||||
|
||||
|
||||
18
CI/tests/test_node.sh
Executable file
18
CI/tests/test_node.sh
Executable file
@@ -0,0 +1,18 @@
|
||||
uset -xeEo pipefail
|
||||
|
||||
source CI/tests/common.sh
|
||||
|
||||
trap error ERR
|
||||
trap finish EXIT
|
||||
|
||||
function functional_test_node_stop_start {
|
||||
export scenario_type="node_scenarios"
|
||||
export scenario_file="scenarios/kind/node_scenarios_example.yml"
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/node_config.yaml
|
||||
cat CI/config/node_config.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/node_config.yaml
|
||||
echo "Node Stop/Start scenario test: Success"
|
||||
}
|
||||
|
||||
functional_test_node_stop_start
|
||||
165
CI/tests/test_node_network_chaos.sh
Executable file
165
CI/tests/test_node_network_chaos.sh
Executable file
@@ -0,0 +1,165 @@
|
||||
set -xeEo pipefail
|
||||
|
||||
source CI/tests/common.sh
|
||||
|
||||
trap error ERR
|
||||
trap finish EXIT
|
||||
|
||||
function functional_test_node_network_chaos {
|
||||
echo "Starting node network chaos functional test"
|
||||
|
||||
# Get a worker node
|
||||
get_node
|
||||
export TARGET_NODE=$(echo $WORKER_NODE | awk '{print $1}')
|
||||
echo "Target node: $TARGET_NODE"
|
||||
|
||||
# Deploy nginx workload on the target node
|
||||
echo "Deploying nginx workload on $TARGET_NODE..."
|
||||
kubectl create deployment nginx-node-net-chaos --image=nginx:latest
|
||||
|
||||
# Add node selector to ensure pod runs on target node
|
||||
kubectl patch deployment nginx-node-net-chaos -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"'$TARGET_NODE'"}}}}}'
|
||||
|
||||
# Expose service
|
||||
kubectl expose deployment nginx-node-net-chaos --port=80 --target-port=80 --name=nginx-node-net-chaos-svc
|
||||
|
||||
# Wait for nginx to be ready
|
||||
echo "Waiting for nginx pod to be ready on $TARGET_NODE..."
|
||||
kubectl wait --for=condition=ready pod -l app=nginx-node-net-chaos --timeout=120s
|
||||
|
||||
# Verify pod is on correct node
|
||||
export POD_NAME=$(kubectl get pods -l app=nginx-node-net-chaos -o jsonpath='{.items[0].metadata.name}')
|
||||
export POD_NODE=$(kubectl get pod $POD_NAME -o jsonpath='{.spec.nodeName}')
|
||||
echo "Pod $POD_NAME is running on node $POD_NODE"
|
||||
|
||||
if [ "$POD_NODE" != "$TARGET_NODE" ]; then
|
||||
echo "ERROR: Pod is not on target node (expected $TARGET_NODE, got $POD_NODE)"
|
||||
kubectl get pods -l app=nginx-node-net-chaos -o wide
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Setup port-forward to access nginx
|
||||
echo "Setting up port-forward to nginx service..."
|
||||
kubectl port-forward service/nginx-node-net-chaos-svc 8091:80 &
|
||||
PORT_FORWARD_PID=$!
|
||||
sleep 3 # Give port-forward time to start
|
||||
|
||||
# Test baseline connectivity
|
||||
echo "Testing baseline connectivity..."
|
||||
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://localhost:8091 || echo "000")
|
||||
if [ "$response" != "200" ]; then
|
||||
echo "ERROR: Nginx not responding correctly (got $response, expected 200)"
|
||||
kubectl get pods -l app=nginx-node-net-chaos
|
||||
kubectl describe pod $POD_NAME
|
||||
exit 1
|
||||
fi
|
||||
echo "Baseline test passed: nginx responding with 200"
|
||||
|
||||
# Measure baseline latency
|
||||
echo "Measuring baseline latency..."
|
||||
baseline_start=$(date +%s%3N)
|
||||
curl -s http://localhost:8091 > /dev/null || true
|
||||
baseline_end=$(date +%s%3N)
|
||||
baseline_latency=$((baseline_end - baseline_start))
|
||||
echo "Baseline latency: ${baseline_latency}ms"
|
||||
|
||||
# Configure node network chaos scenario
|
||||
echo "Configuring node network chaos scenario..."
|
||||
yq -i '.[0].config.target="'$TARGET_NODE'"' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.namespace="default"' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.test_duration=20' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.latency="200ms"' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.loss=15' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.bandwidth="10mbit"' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.ingress=true' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.egress=true' scenarios/kube/node-network-chaos.yml
|
||||
yq -i '.[0].config.force=false' scenarios/kube/node-network-chaos.yml
|
||||
yq -i 'del(.[0].config.interfaces)' scenarios/kube/node-network-chaos.yml
|
||||
|
||||
# Prepare krkn config
|
||||
export scenario_type="network_chaos_ng_scenarios"
|
||||
export scenario_file="scenarios/kube/node-network-chaos.yml"
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/node_network_chaos_config.yaml
|
||||
|
||||
# Run krkn in background
|
||||
echo "Starting krkn with node network chaos scenario..."
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/node_network_chaos_config.yaml &
|
||||
KRKN_PID=$!
|
||||
echo "Krkn started with PID: $KRKN_PID"
|
||||
|
||||
# Wait for chaos to start (give it time to inject chaos)
|
||||
echo "Waiting for chaos injection to begin..."
|
||||
sleep 10
|
||||
|
||||
# Test during chaos - check for increased latency or packet loss effects
|
||||
echo "Testing network behavior during chaos..."
|
||||
chaos_test_count=0
|
||||
chaos_success=0
|
||||
|
||||
for i in {1..5}; do
|
||||
chaos_test_count=$((chaos_test_count + 1))
|
||||
chaos_start=$(date +%s%3N)
|
||||
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 http://localhost:8091 || echo "000")
|
||||
chaos_end=$(date +%s%3N)
|
||||
chaos_latency=$((chaos_end - chaos_start))
|
||||
|
||||
echo "Attempt $i: HTTP $response, latency: ${chaos_latency}ms"
|
||||
|
||||
# We expect either increased latency or some failures due to packet loss
|
||||
if [ "$response" == "200" ] || [ "$response" == "000" ]; then
|
||||
chaos_success=$((chaos_success + 1))
|
||||
fi
|
||||
|
||||
sleep 2
|
||||
done
|
||||
|
||||
echo "Chaos test results: $chaos_success/$chaos_test_count requests processed"
|
||||
|
||||
# Verify node-level chaos affects pod
|
||||
echo "Verifying node-level chaos affects pod on $TARGET_NODE..."
|
||||
# The node chaos should affect all pods on the node
|
||||
|
||||
# Wait for krkn to complete
|
||||
echo "Waiting for krkn to complete..."
|
||||
wait $KRKN_PID || true
|
||||
echo "Krkn completed"
|
||||
|
||||
# Wait a bit for cleanup
|
||||
sleep 5
|
||||
|
||||
# Verify recovery - nginx should respond normally again
|
||||
echo "Verifying service recovery..."
|
||||
recovery_attempts=0
|
||||
max_recovery_attempts=10
|
||||
|
||||
while [ $recovery_attempts -lt $max_recovery_attempts ]; do
|
||||
recovery_attempts=$((recovery_attempts + 1))
|
||||
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://localhost:8091 || echo "000")
|
||||
|
||||
if [ "$response" == "200" ]; then
|
||||
echo "Recovery verified: nginx responding normally (attempt $recovery_attempts)"
|
||||
break
|
||||
fi
|
||||
|
||||
echo "Recovery attempt $recovery_attempts/$max_recovery_attempts: got $response, retrying..."
|
||||
sleep 3
|
||||
done
|
||||
|
||||
if [ "$response" != "200" ]; then
|
||||
echo "ERROR: Service did not recover after chaos (got $response)"
|
||||
kubectl get pods -l app=nginx-node-net-chaos
|
||||
kubectl describe pod $POD_NAME
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Cleanup
|
||||
echo "Cleaning up test resources..."
|
||||
kill $PORT_FORWARD_PID 2>/dev/null || true
|
||||
kubectl delete deployment nginx-node-net-chaos --ignore-not-found=true
|
||||
kubectl delete service nginx-node-net-chaos-svc --ignore-not-found=true
|
||||
|
||||
echo "Node network chaos test: Success"
|
||||
}
|
||||
|
||||
functional_test_node_network_chaos
|
||||
@@ -7,12 +7,15 @@ trap finish EXIT
|
||||
|
||||
function functional_test_pod_crash {
|
||||
export scenario_type="pod_disruption_scenarios"
|
||||
export scenario_file="scenarios/kind/pod_etcd.yml"
|
||||
export scenario_file="scenarios/kind/pod_path_provisioner.yml"
|
||||
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
|
||||
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml
|
||||
echo "Pod disruption scenario test: Success"
|
||||
date
|
||||
kubectl get pods -n local-path-storage -l app=local-path-provisioner -o yaml
|
||||
}
|
||||
|
||||
functional_test_pod_crash
|
||||
|
||||
31
CI/tests/test_pod_error.sh
Executable file
31
CI/tests/test_pod_error.sh
Executable file
@@ -0,0 +1,31 @@
|
||||
|
||||
|
||||
source CI/tests/common.sh
|
||||
|
||||
trap error ERR
|
||||
trap finish EXIT
|
||||
|
||||
function functional_test_pod_error {
|
||||
export scenario_type="pod_disruption_scenarios"
|
||||
export scenario_file="scenarios/kind/pod_etcd.yml"
|
||||
export post_config=""
|
||||
# this test will check if krkn exits with an error when too many pods are targeted
|
||||
yq -i '.[0].config.kill=5' scenarios/kind/pod_etcd.yml
|
||||
yq -i '.[0].config.krkn_pod_recovery_time=1' scenarios/kind/pod_etcd.yml
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
|
||||
cat CI/config/pod_config.yaml
|
||||
|
||||
cat scenarios/kind/pod_etcd.yml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml
|
||||
|
||||
ret=$?
|
||||
echo "\n\nret $ret"
|
||||
if [[ $ret -ge 1 ]]; then
|
||||
echo "Pod disruption error scenario test: Success"
|
||||
else
|
||||
echo "Pod disruption error scenario test: Failure"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
functional_test_pod_error
|
||||
143
CI/tests/test_pod_network_chaos.sh
Executable file
143
CI/tests/test_pod_network_chaos.sh
Executable file
@@ -0,0 +1,143 @@
|
||||
set -xeEo pipefail
|
||||
|
||||
source CI/tests/common.sh
|
||||
|
||||
trap error ERR
|
||||
trap finish EXIT
|
||||
|
||||
function functional_test_pod_network_chaos {
|
||||
echo "Starting pod network chaos functional test"
|
||||
|
||||
# Deploy nginx workload
|
||||
echo "Deploying nginx workload..."
|
||||
kubectl create deployment nginx-pod-net-chaos --image=nginx:latest
|
||||
kubectl expose deployment nginx-pod-net-chaos --port=80 --target-port=80 --name=nginx-pod-net-chaos-svc
|
||||
|
||||
# Wait for nginx to be ready
|
||||
echo "Waiting for nginx pod to be ready..."
|
||||
kubectl wait --for=condition=ready pod -l app=nginx-pod-net-chaos --timeout=120s
|
||||
|
||||
# Get pod name
|
||||
export POD_NAME=$(kubectl get pods -l app=nginx-pod-net-chaos -o jsonpath='{.items[0].metadata.name}')
|
||||
echo "Target pod: $POD_NAME"
|
||||
|
||||
# Setup port-forward to access nginx
|
||||
echo "Setting up port-forward to nginx service..."
|
||||
kubectl port-forward service/nginx-pod-net-chaos-svc 8090:80 &
|
||||
PORT_FORWARD_PID=$!
|
||||
sleep 3 # Give port-forward time to start
|
||||
|
||||
# Test baseline connectivity
|
||||
echo "Testing baseline connectivity..."
|
||||
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://localhost:8090 || echo "000")
|
||||
if [ "$response" != "200" ]; then
|
||||
echo "ERROR: Nginx not responding correctly (got $response, expected 200)"
|
||||
kubectl get pods -l app=nginx-pod-net-chaos
|
||||
kubectl describe pod $POD_NAME
|
||||
exit 1
|
||||
fi
|
||||
echo "Baseline test passed: nginx responding with 200"
|
||||
|
||||
# Measure baseline latency
|
||||
echo "Measuring baseline latency..."
|
||||
baseline_start=$(date +%s%3N)
|
||||
curl -s http://localhost:8090 > /dev/null || true
|
||||
baseline_end=$(date +%s%3N)
|
||||
baseline_latency=$((baseline_end - baseline_start))
|
||||
echo "Baseline latency: ${baseline_latency}ms"
|
||||
|
||||
# Configure pod network chaos scenario
|
||||
echo "Configuring pod network chaos scenario..."
|
||||
yq -i '.[0].config.target="'$POD_NAME'"' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.namespace="default"' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.test_duration=20' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.latency="200ms"' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.loss=15' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.bandwidth="10mbit"' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.ingress=true' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i '.[0].config.egress=true' scenarios/kube/pod-network-chaos.yml
|
||||
yq -i 'del(.[0].config.interfaces)' scenarios/kube/pod-network-chaos.yml
|
||||
|
||||
# Prepare krkn config
|
||||
export scenario_type="network_chaos_ng_scenarios"
|
||||
export scenario_file="scenarios/kube/pod-network-chaos.yml"
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/pod_network_chaos_config.yaml
|
||||
|
||||
# Run krkn in background
|
||||
echo "Starting krkn with pod network chaos scenario..."
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/pod_network_chaos_config.yaml &
|
||||
KRKN_PID=$!
|
||||
echo "Krkn started with PID: $KRKN_PID"
|
||||
|
||||
# Wait for chaos to start (give it time to inject chaos)
|
||||
echo "Waiting for chaos injection to begin..."
|
||||
sleep 10
|
||||
|
||||
# Test during chaos - check for increased latency or packet loss effects
|
||||
echo "Testing network behavior during chaos..."
|
||||
chaos_test_count=0
|
||||
chaos_success=0
|
||||
|
||||
for i in {1..5}; do
|
||||
chaos_test_count=$((chaos_test_count + 1))
|
||||
chaos_start=$(date +%s%3N)
|
||||
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 http://localhost:8090 || echo "000")
|
||||
chaos_end=$(date +%s%3N)
|
||||
chaos_latency=$((chaos_end - chaos_start))
|
||||
|
||||
echo "Attempt $i: HTTP $response, latency: ${chaos_latency}ms"
|
||||
|
||||
# We expect either increased latency or some failures due to packet loss
|
||||
if [ "$response" == "200" ] || [ "$response" == "000" ]; then
|
||||
chaos_success=$((chaos_success + 1))
|
||||
fi
|
||||
|
||||
sleep 2
|
||||
done
|
||||
|
||||
echo "Chaos test results: $chaos_success/$chaos_test_count requests processed"
|
||||
|
||||
# Wait for krkn to complete
|
||||
echo "Waiting for krkn to complete..."
|
||||
wait $KRKN_PID || true
|
||||
echo "Krkn completed"
|
||||
|
||||
# Wait a bit for cleanup
|
||||
sleep 5
|
||||
|
||||
# Verify recovery - nginx should respond normally again
|
||||
echo "Verifying service recovery..."
|
||||
recovery_attempts=0
|
||||
max_recovery_attempts=10
|
||||
|
||||
while [ $recovery_attempts -lt $max_recovery_attempts ]; do
|
||||
recovery_attempts=$((recovery_attempts + 1))
|
||||
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://localhost:8090 || echo "000")
|
||||
|
||||
if [ "$response" == "200" ]; then
|
||||
echo "Recovery verified: nginx responding normally (attempt $recovery_attempts)"
|
||||
break
|
||||
fi
|
||||
|
||||
echo "Recovery attempt $recovery_attempts/$max_recovery_attempts: got $response, retrying..."
|
||||
sleep 3
|
||||
done
|
||||
|
||||
if [ "$response" != "200" ]; then
|
||||
echo "ERROR: Service did not recover after chaos (got $response)"
|
||||
kubectl get pods -l app=nginx-pod-net-chaos
|
||||
kubectl describe pod $POD_NAME
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Cleanup
|
||||
echo "Cleaning up test resources..."
|
||||
kill $PORT_FORWARD_PID 2>/dev/null || true
|
||||
kubectl delete deployment nginx-pod-net-chaos --ignore-not-found=true
|
||||
kubectl delete service nginx-pod-net-chaos-svc --ignore-not-found=true
|
||||
|
||||
echo "Pod network chaos test: Success"
|
||||
}
|
||||
|
||||
functional_test_pod_network_chaos
|
||||
35
CI/tests/test_pod_server.sh
Executable file
35
CI/tests/test_pod_server.sh
Executable file
@@ -0,0 +1,35 @@
|
||||
set -xeEo pipefail
|
||||
|
||||
source CI/tests/common.sh
|
||||
|
||||
trap error ERR
|
||||
trap finish EXIT
|
||||
|
||||
function functional_test_pod_server {
|
||||
export scenario_type="pod_disruption_scenarios"
|
||||
export scenario_file="scenarios/kind/pod_etcd.yml"
|
||||
export post_config=""
|
||||
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
|
||||
yq -i '.[0].config.kill=1' scenarios/kind/pod_etcd.yml
|
||||
|
||||
yq -i '.tunings.daemon_mode=True' CI/config/pod_config.yaml
|
||||
cat CI/config/pod_config.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml &
|
||||
sleep 15
|
||||
curl -X POST http:/0.0.0.0:8081/STOP
|
||||
|
||||
wait
|
||||
|
||||
yq -i '.kraken.signal_state="PAUSE"' CI/config/pod_config.yaml
|
||||
yq -i '.tunings.daemon_mode=False' CI/config/pod_config.yaml
|
||||
cat CI/config/pod_config.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml &
|
||||
sleep 5
|
||||
curl -X POST http:/0.0.0.0:8081/RUN
|
||||
wait
|
||||
|
||||
echo "Pod disruption with server scenario test: Success"
|
||||
}
|
||||
|
||||
functional_test_pod_server
|
||||
18
CI/tests/test_pvc.sh
Executable file
18
CI/tests/test_pvc.sh
Executable file
@@ -0,0 +1,18 @@
|
||||
set -xeEo pipefail
|
||||
|
||||
source CI/tests/common.sh
|
||||
|
||||
trap error ERR
|
||||
trap finish EXIT
|
||||
|
||||
function functional_test_pvc_fill {
|
||||
export scenario_type="pvc_scenarios"
|
||||
export scenario_file="scenarios/kind/pvc_scenario.yaml"
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/pvc_config.yaml
|
||||
cat CI/config/pvc_config.yaml
|
||||
python3 -m coverage run -a run_kraken.py -c CI/config/pvc_config.yaml --debug True
|
||||
echo "PVC Fill scenario test: Success"
|
||||
}
|
||||
|
||||
functional_test_pvc_fill
|
||||
@@ -18,9 +18,8 @@ function functional_test_telemetry {
|
||||
yq -i '.performance_monitoring.prometheus_url="http://localhost:9090"' CI/config/common_test_config.yaml
|
||||
yq -i '.telemetry.run_tag=env(RUN_TAG)' CI/config/common_test_config.yaml
|
||||
|
||||
export scenario_type="hog_scenarios"
|
||||
|
||||
export scenario_file="scenarios/kube/cpu-hog.yml"
|
||||
export scenario_type="pod_disruption_scenarios"
|
||||
export scenario_file="scenarios/kind/pod_etcd.yml"
|
||||
|
||||
export post_config=""
|
||||
envsubst < CI/config/common_test_config.yaml > CI/config/telemetry.yaml
|
||||
|
||||
273
CLAUDE.md
Normal file
273
CLAUDE.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# CLAUDE.md - Krkn Chaos Engineering Framework
|
||||
|
||||
## Project Overview
|
||||
|
||||
Krkn (Kraken) is a chaos engineering tool for Kubernetes/OpenShift clusters. It injects deliberate failures to validate cluster resilience. Plugin-based architecture with multi-cloud support (AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack).
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
krkn/
|
||||
├── krkn/
|
||||
│ ├── scenario_plugins/ # Chaos scenario plugins (pod, node, network, hogs, etc.)
|
||||
│ ├── utils/ # Utility functions
|
||||
│ ├── rollback/ # Rollback management
|
||||
│ ├── prometheus/ # Prometheus integration
|
||||
│ └── cerberus/ # Health monitoring
|
||||
├── tests/ # Unit tests (unittest framework)
|
||||
├── scenarios/ # Example scenario configs (openshift/, kube/, kind/)
|
||||
├── config/ # Configuration files
|
||||
└── CI/ # CI/CD test scripts
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Setup (ALWAYS use virtual environment)
|
||||
python3 -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Run Krkn
|
||||
python run_kraken.py --config config/config.yaml
|
||||
|
||||
# Note: Scenarios are specified in config.yaml under kraken.chaos_scenarios
|
||||
# There is no --scenario flag; edit config/config.yaml to select scenarios
|
||||
|
||||
# Run tests
|
||||
python -m unittest discover -s tests -v
|
||||
python -m coverage run -a -m unittest discover -s tests -v
|
||||
```
|
||||
|
||||
## Critical Requirements
|
||||
|
||||
### Python Environment
|
||||
- **Python 3.9+** required
|
||||
- **NEVER install packages globally** - always use virtual environment
|
||||
- **CRITICAL**: `docker` must be <7.0 and `requests` must be <2.32 (Unix socket compatibility)
|
||||
|
||||
### Key Dependencies
|
||||
- **krkn-lib** (5.1.13): Core library for Kubernetes/OpenShift operations
|
||||
- **kubernetes** (34.1.0): Kubernetes Python client
|
||||
- **docker** (<7.0), **requests** (<2.32): DO NOT upgrade without verifying compatibility
|
||||
- Cloud SDKs: boto3 (AWS), azure-mgmt-* (Azure), google-cloud-compute (GCP), ibm_vpc (IBM), pyVmomi (VMware)
|
||||
|
||||
## Plugin Architecture (CRITICAL)
|
||||
|
||||
**Strictly enforced naming conventions:**
|
||||
|
||||
### Naming Rules
|
||||
- **Module files**: Must end with `_scenario_plugin.py` and use snake_case
|
||||
- Example: `pod_disruption_scenario_plugin.py`
|
||||
- **Class names**: Must be CamelCase and end with `ScenarioPlugin`
|
||||
- Example: `PodDisruptionScenarioPlugin`
|
||||
- Must match module filename (snake_case ↔ CamelCase)
|
||||
- **Directory structure**: Plugin dirs CANNOT contain "scenario" or "plugin"
|
||||
- Location: `krkn/scenario_plugins/<plugin_name>/`
|
||||
|
||||
### Plugin Implementation
|
||||
Every plugin MUST:
|
||||
1. Extend `AbstractScenarioPlugin`
|
||||
2. Implement `run()` method
|
||||
3. Implement `get_scenario_types()` method
|
||||
|
||||
```python
|
||||
from krkn.scenario_plugins import AbstractScenarioPlugin
|
||||
|
||||
class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
def run(self, config, scenarios_list, kubeconfig_path, wait_duration):
|
||||
pass
|
||||
|
||||
def get_scenario_types(self):
|
||||
return ["pod_scenarios", "pod_outage"]
|
||||
```
|
||||
|
||||
### Creating a New Plugin
|
||||
1. Create directory: `krkn/scenario_plugins/<plugin_name>/`
|
||||
2. Create module: `<plugin_name>_scenario_plugin.py`
|
||||
3. Create class: `<PluginName>ScenarioPlugin` extending `AbstractScenarioPlugin`
|
||||
4. Implement `run()` and `get_scenario_types()`
|
||||
5. Create unit test: `tests/test_<plugin_name>_scenario_plugin.py`
|
||||
6. Add example scenario: `scenarios/<platform>/<scenario>.yaml`
|
||||
|
||||
**DO NOT**: Violate naming conventions (factory will reject), include "scenario"/"plugin" in directory names, create plugins without tests.
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
```bash
|
||||
# Run all tests
|
||||
python -m unittest discover -s tests -v
|
||||
|
||||
# Specific test
|
||||
python -m unittest tests.test_pod_disruption_scenario_plugin
|
||||
|
||||
# With coverage
|
||||
python -m coverage run -a -m unittest discover -s tests -v
|
||||
python -m coverage html
|
||||
```
|
||||
|
||||
**Test requirements:**
|
||||
- Naming: `test_<module>_scenario_plugin.py`
|
||||
- Mock external dependencies (Kubernetes API, cloud providers)
|
||||
- Test success, failure, and edge cases
|
||||
- Keep tests isolated and independent
|
||||
|
||||
### Functional Tests
|
||||
Located in `CI/tests/`. Can be run locally on a kind cluster with Prometheus and Elasticsearch set up.
|
||||
|
||||
**Setup for local testing:**
|
||||
1. Deploy Prometheus and Elasticsearch on your kind cluster:
|
||||
- Prometheus setup: https://krkn-chaos.dev/docs/developers-guide/testing-changes/#prometheus
|
||||
- Elasticsearch setup: https://krkn-chaos.dev/docs/developers-guide/testing-changes/#elasticsearch
|
||||
|
||||
2. Or disable monitoring features in `config/config.yaml`:
|
||||
```yaml
|
||||
performance_monitoring:
|
||||
enable_alerts: False
|
||||
enable_metrics: False
|
||||
check_critical_alerts: False
|
||||
```
|
||||
|
||||
**Note:** Functional tests run automatically in CI with full monitoring enabled.
|
||||
|
||||
## Cloud Provider Implementations
|
||||
|
||||
Node chaos scenarios are cloud-specific. Each in `krkn/scenario_plugins/node_actions/<provider>_node_scenarios.py`:
|
||||
- AWS, Azure, GCP, IBM Cloud, VMware, Alibaba, OpenStack, Bare Metal
|
||||
|
||||
Implement: stop, start, reboot, terminate instances.
|
||||
|
||||
**When modifying**: Maintain consistency with other providers, handle API errors, add logging, update tests.
|
||||
|
||||
### Adding Cloud Provider Support
|
||||
1. Create: `krkn/scenario_plugins/node_actions/<provider>_node_scenarios.py`
|
||||
2. Extend: `abstract_node_scenarios.AbstractNodeScenarios`
|
||||
3. Implement: `stop_instances`, `start_instances`, `reboot_instances`, `terminate_instances`
|
||||
4. Add SDK to `requirements.txt`
|
||||
5. Create unit test with mocked SDK
|
||||
6. Add example scenario: `scenarios/openshift/<provider>_node_scenarios.yml`
|
||||
|
||||
## Configuration
|
||||
|
||||
**Main config**: `config/config.yaml`
|
||||
- `kraken`: Core settings
|
||||
- `cerberus`: Health monitoring
|
||||
- `performance_monitoring`: Prometheus
|
||||
- `elastic`: Elasticsearch telemetry
|
||||
|
||||
**Scenario configs**: `scenarios/` directory
|
||||
```yaml
|
||||
- config:
|
||||
scenario_type: <type> # Must match plugin's get_scenario_types()
|
||||
```
|
||||
|
||||
## Code Style
|
||||
|
||||
- **Import order**: Standard library, third-party, local imports
|
||||
- **Naming**: snake_case (functions/variables), CamelCase (classes)
|
||||
- **Logging**: Use Python's `logging` module
|
||||
- **Error handling**: Return appropriate exit codes
|
||||
- **Docstrings**: Required for public functions/classes
|
||||
|
||||
## Exit Codes
|
||||
|
||||
Krkn uses specific exit codes to communicate execution status:
|
||||
|
||||
- `0`: Success - all scenarios passed, no critical alerts
|
||||
- `1`: Scenario failure - one or more scenarios failed
|
||||
- `2`: Critical alerts fired during execution
|
||||
- `3+`: Health check failure (Cerberus monitoring detected issues)
|
||||
|
||||
**When implementing scenarios:**
|
||||
- Return `0` on success
|
||||
- Return `1` on scenario-specific failures
|
||||
- Propagate health check failures appropriately
|
||||
- Log exit code reasons clearly
|
||||
|
||||
## Container Support
|
||||
|
||||
Krkn can run inside a container. See `containers/` directory.
|
||||
|
||||
**Building custom image:**
|
||||
```bash
|
||||
cd containers
|
||||
./compile_dockerfile.sh # Generates Dockerfile from template
|
||||
docker build -t krkn:latest .
|
||||
```
|
||||
|
||||
**Running containerized:**
|
||||
```bash
|
||||
docker run -v ~/.kube:/root/.kube:Z \
|
||||
-v $(pwd)/config:/config:Z \
|
||||
-v $(pwd)/scenarios:/scenarios:Z \
|
||||
krkn:latest
|
||||
```
|
||||
|
||||
## Git Workflow
|
||||
|
||||
- **NEVER commit directly to main**
|
||||
- **NEVER use `--force` without approval**
|
||||
- **ALWAYS create feature branches**: `git checkout -b feature/description`
|
||||
- **ALWAYS run tests before pushing**
|
||||
|
||||
**Conventional commits**: `feat:`, `fix:`, `test:`, `docs:`, `refactor:`
|
||||
|
||||
```bash
|
||||
git checkout main && git pull origin main
|
||||
git checkout -b feature/your-feature-name
|
||||
# Make changes, write tests
|
||||
python -m unittest discover -s tests -v
|
||||
git add <specific-files>
|
||||
git commit -m "feat: description"
|
||||
git push -u origin feature/your-feature-name
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `KUBECONFIG`: Path to kubeconfig
|
||||
- `AWS_*`, `AZURE_*`, `GOOGLE_APPLICATION_CREDENTIALS`: Cloud credentials
|
||||
- `PROMETHEUS_URL`, `ELASTIC_URL`, `ELASTIC_PASSWORD`: Monitoring config
|
||||
|
||||
**NEVER commit credentials or API keys.**
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. Missing virtual environment - always activate venv
|
||||
2. Running functional tests without cluster setup
|
||||
3. Ignoring exit codes
|
||||
4. Modifying krkn-lib directly (it's a separate package)
|
||||
5. Upgrading docker/requests beyond version constraints
|
||||
|
||||
## Before Writing Code
|
||||
|
||||
1. Check for existing implementations
|
||||
2. Review existing plugins as examples
|
||||
3. Maintain consistency with cloud provider patterns
|
||||
4. Plan rollback logic
|
||||
5. Write tests alongside code
|
||||
6. Update documentation
|
||||
|
||||
## When Adding Dependencies
|
||||
|
||||
1. Check if functionality exists in krkn-lib or current dependencies
|
||||
2. Verify compatibility with existing versions
|
||||
3. Pin specific versions in `requirements.txt`
|
||||
4. Check for security vulnerabilities
|
||||
5. Test thoroughly for conflicts
|
||||
|
||||
## Common Development Tasks
|
||||
|
||||
### Modifying Existing Plugin
|
||||
1. Read plugin code and corresponding test
|
||||
2. Make changes
|
||||
3. Update/add unit tests
|
||||
4. Run: `python -m unittest tests.test_<plugin>_scenario_plugin`
|
||||
|
||||
### Writing Unit Tests
|
||||
1. Create: `tests/test_<module>_scenario_plugin.py`
|
||||
2. Import `unittest` and plugin class
|
||||
3. Mock external dependencies
|
||||
4. Test success, failure, and edge cases
|
||||
5. Run: `python -m unittest tests.test_<module>_scenario_plugin`
|
||||
|
||||
@@ -26,7 +26,7 @@ Here is an excerpt:
|
||||
## Maintainer Levels
|
||||
|
||||
### Contributor
|
||||
Contributors contributor to the community. Anyone can become a contributor by participating in discussions, reporting bugs, or contributing code or documentation.
|
||||
Contributors contribute to the community. Anyone can become a contributor by participating in discussions, reporting bugs, or contributing code or documentation.
|
||||
|
||||
#### Responsibilities:
|
||||
|
||||
@@ -80,4 +80,4 @@ Represent the project in the broader open-source community.
|
||||
|
||||
|
||||
# Credits
|
||||
Sections of this documents have been borrowed from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md)
|
||||
Sections of this document have been borrowed from [Kubernetes governance](https://github.com/kubernetes/community/blob/master/governance.md)
|
||||
@@ -16,5 +16,5 @@ Following are a list of enhancements that we are planning to work on adding supp
|
||||
- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)
|
||||
- [x] [AI Chat bot to help get started with Krkn and commands](https://github.com/krkn-chaos/krkn-lightspeed)
|
||||
- [ ] [Ability to roll back cluster to original state if chaos fails](https://github.com/krkn-chaos/krkn/issues/804)
|
||||
- [ ] Add recovery time metrics to each scenario for each better regression analysis
|
||||
- [ ] Add recovery time metrics to each scenario for better regression analysis
|
||||
- [ ] [Add resiliency scoring to chaos scenarios ran on cluster](https://github.com/krkn-chaos/krkn/issues/125)
|
||||
@@ -40,4 +40,4 @@ The security team currently consists of the [Maintainers of Krkn](https://github
|
||||
|
||||
## Process and Supported Releases
|
||||
|
||||
The Krkn security team will investigate and provide a fix in a timely mannner depending on the severity. The fix will be included in the new release of Krkn and details will be included in the release notes.
|
||||
The Krkn security team will investigate and provide a fix in a timely manner depending on the severity. The fix will be included in the new release of Krkn and details will be included in the release notes.
|
||||
|
||||
@@ -39,7 +39,7 @@ cerberus:
|
||||
Sunday:
|
||||
slack_team_alias: # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned
|
||||
|
||||
custom_checks: # Relative paths of files conataining additional user defined checks
|
||||
custom_checks: # Relative paths of files containing additional user defined checks
|
||||
|
||||
tunings:
|
||||
timeout: 3 # Number of seconds before requests fail
|
||||
|
||||
@@ -50,13 +50,15 @@ kraken:
|
||||
- network_chaos_ng_scenarios:
|
||||
- scenarios/kube/pod-network-filter.yml
|
||||
- scenarios/kube/node-network-filter.yml
|
||||
- scenarios/kube/node-network-chaos.yml
|
||||
- scenarios/kube/pod-network-chaos.yml
|
||||
- kubevirt_vm_outage:
|
||||
- scenarios/kubevirt/kubevirt-vm-outage.yaml
|
||||
|
||||
cerberus:
|
||||
cerberus_enabled: False # Enable it when cerberus is previously installed
|
||||
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
|
||||
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
|
||||
performance_monitoring:
|
||||
prometheus_url: '' # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
@@ -93,7 +95,7 @@ telemetry:
|
||||
prometheus_pod_name: "" # name of the prometheus pod (if distribution is kubernetes)
|
||||
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
|
||||
backup_threads: 5 # number of telemetry download/upload threads
|
||||
archive_path: /tmp # local path where the archive files will be temporarly stored
|
||||
archive_path: /tmp # local path where the archive files will be temporarily stored
|
||||
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
|
||||
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
|
||||
archive_size: 500000
|
||||
@@ -126,4 +128,6 @@ kubevirt_checks: # Utilizing virt che
|
||||
name: # Regex Name style of VMI's to watch, optional, will watch all VMI names in the namespace if left blank
|
||||
only_failures: False # Boolean of whether to show all VMI's failures and successful ssh connection (False), or only failure status' (True)
|
||||
disconnected: False # Boolean of how to try to connect to the VMIs; if True will use the ip_address to try ssh from within a node, if false will use the name and uses virtctl to try to connect; Default is False
|
||||
ssh_node: "" # If set, will be a backup way to ssh to a node. Will want to set to a node that isn't targeted in chaos
|
||||
ssh_node: "" # If set, will be a backup way to ssh to a node. Will want to set to a node that isn't targeted in chaos
|
||||
node_names: ""
|
||||
exit_on_failure: # If value is True and VMI's are failing post chaos returns failure, values can be True/False
|
||||
@@ -13,7 +13,7 @@ kraken:
|
||||
cerberus:
|
||||
cerberus_enabled: False # Enable it when cerberus is previously installed
|
||||
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
|
||||
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
|
||||
performance_monitoring:
|
||||
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
@@ -32,7 +32,7 @@ tunings:
|
||||
|
||||
telemetry:
|
||||
enabled: False # enable/disables the telemetry collection feature
|
||||
archive_path: /tmp # local path where the archive files will be temporarly stored
|
||||
archive_path: /tmp # local path where the archive files will be temporarily stored
|
||||
events_backup: False # enables/disables cluster events collection
|
||||
logs_backup: False
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@ kraken:
|
||||
cerberus:
|
||||
cerberus_enabled: False # Enable it when cerberus is previously installed
|
||||
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
|
||||
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
|
||||
performance_monitoring:
|
||||
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
|
||||
|
||||
@@ -35,7 +35,7 @@ kraken:
|
||||
cerberus:
|
||||
cerberus_enabled: True # Enable it when cerberus is previously installed
|
||||
cerberus_url: http://0.0.0.0:8080 # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
|
||||
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
|
||||
|
||||
performance_monitoring:
|
||||
deploy_dashboards: True # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
|
||||
@@ -61,7 +61,7 @@ telemetry:
|
||||
prometheus_backup: True # enables/disables prometheus data collection
|
||||
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
|
||||
backup_threads: 5 # number of telemetry download/upload threads
|
||||
archive_path: /tmp # local path where the archive files will be temporarly stored
|
||||
archive_path: /tmp # local path where the archive files will be temporarily stored
|
||||
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
|
||||
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
|
||||
archive_size: 500000 # the size of the prometheus data archive size in KB. The lower the size of archive is
|
||||
|
||||
@@ -1,22 +1,35 @@
|
||||
# oc build
|
||||
FROM golang:1.23.1 AS oc-build
|
||||
FROM golang:1.24.9 AS oc-build
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends libkrb5-dev
|
||||
WORKDIR /tmp
|
||||
# oc build
|
||||
RUN git clone --branch release-4.18 https://github.com/openshift/oc.git
|
||||
WORKDIR /tmp/oc
|
||||
RUN go mod edit -go 1.23.1 &&\
|
||||
go get github.com/moby/buildkit@v0.12.5 &&\
|
||||
go get github.com/containerd/containerd@v1.7.11&&\
|
||||
go get github.com/docker/docker@v25.0.6&&\
|
||||
go get github.com/opencontainers/runc@v1.1.14&&\
|
||||
go get github.com/go-git/go-git/v5@v5.13.0&&\
|
||||
go get golang.org/x/net@v0.38.0&&\
|
||||
go get github.com/containerd/containerd@v1.7.27&&\
|
||||
go get golang.org/x/oauth2@v0.27.0&&\
|
||||
go get golang.org/x/crypto@v0.35.0&&\
|
||||
RUN go mod edit -go 1.24.9 &&\
|
||||
go mod edit -require github.com/moby/buildkit@v0.12.5 &&\
|
||||
go mod edit -require github.com/containerd/containerd@v1.7.29&&\
|
||||
go mod edit -require github.com/docker/docker@v27.5.1+incompatible&&\
|
||||
go mod edit -require github.com/opencontainers/runc@v1.2.8&&\
|
||||
go mod edit -require github.com/go-git/go-git/v5@v5.13.0&&\
|
||||
go mod edit -require github.com/opencontainers/selinux@v1.13.0&&\
|
||||
go mod edit -require github.com/ulikunitz/xz@v0.5.15&&\
|
||||
go mod edit -require golang.org/x/net@v0.38.0&&\
|
||||
go mod edit -require github.com/containerd/containerd@v1.7.27&&\
|
||||
go mod edit -require golang.org/x/oauth2@v0.27.0&&\
|
||||
go mod edit -require golang.org/x/crypto@v0.35.0&&\
|
||||
go mod edit -replace github.com/containerd/containerd@v1.7.27=github.com/containerd/containerd@v1.7.29&&\
|
||||
go mod tidy && go mod vendor
|
||||
|
||||
RUN make GO_REQUIRED_MIN_VERSION:= oc
|
||||
|
||||
# virtctl build
|
||||
WORKDIR /tmp
|
||||
RUN git clone https://github.com/kubevirt/kubevirt.git
|
||||
WORKDIR /tmp/kubevirt
|
||||
RUN go mod edit -go 1.24.9 &&\
|
||||
go work use &&\
|
||||
go build -o virtctl ./cmd/virtctl/
|
||||
|
||||
FROM fedora:40
|
||||
ARG PR_NUMBER
|
||||
ARG TAG
|
||||
@@ -28,16 +41,12 @@ ENV KUBECONFIG /home/krkn/.kube/config
|
||||
|
||||
# This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo
|
||||
RUN dnf update && dnf install -y --setopt=install_weak_deps=False \
|
||||
git python39 jq yq gettext wget which ipmitool openssh-server &&\
|
||||
git python3.11 jq yq gettext wget which ipmitool openssh-server &&\
|
||||
dnf clean all
|
||||
|
||||
# Virtctl
|
||||
RUN export VERSION=$(curl https://storage.googleapis.com/kubevirt-prow/release/kubevirt/kubevirt/stable.txt) && \
|
||||
wget https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/virtctl-${VERSION}-linux-amd64 && \
|
||||
chmod +x virtctl-${VERSION}-linux-amd64 && sudo mv virtctl-${VERSION}-linux-amd64 /usr/local/bin/virtctl
|
||||
|
||||
# copy oc client binary from oc-build image
|
||||
COPY --from=oc-build /tmp/oc/oc /usr/bin/oc
|
||||
COPY --from=oc-build /tmp/kubevirt/virtctl /usr/bin/virtctl
|
||||
|
||||
# krkn build
|
||||
RUN git clone https://github.com/krkn-chaos/krkn.git /home/krkn/kraken && \
|
||||
@@ -54,10 +63,15 @@ RUN if [ -n "$PR_NUMBER" ]; then git fetch origin pull/${PR_NUMBER}/head:pr-${PR
|
||||
# if it is a TAG trigger checkout the tag
|
||||
RUN if [ -n "$TAG" ]; then git checkout "$TAG";fi
|
||||
|
||||
RUN python3.9 -m ensurepip --upgrade --default-pip
|
||||
RUN python3.9 -m pip install --upgrade pip setuptools==78.1.1
|
||||
RUN pip3.9 install -r requirements.txt
|
||||
RUN pip3.9 install jsonschema
|
||||
RUN python3.11 -m ensurepip --upgrade --default-pip
|
||||
RUN python3.11 -m pip install --upgrade pip setuptools==78.1.1
|
||||
|
||||
# removes the the vulnerable versions of setuptools and pip
|
||||
RUN rm -rf "$(pip cache dir)"
|
||||
RUN rm -rf /tmp/*
|
||||
RUN rm -rf /usr/local/lib/python3.11/ensurepip/_bundled
|
||||
RUN pip3.11 install -r requirements.txt
|
||||
RUN pip3.11 install jsonschema
|
||||
|
||||
LABEL krknctl.title.global="Krkn Base Image"
|
||||
LABEL krknctl.description.global="This is the krkn base image."
|
||||
|
||||
@@ -85,6 +85,24 @@
|
||||
"default": "False",
|
||||
"required": "false"
|
||||
},
|
||||
{
|
||||
"name": "prometheus-url",
|
||||
"short_description": "Prometheus url",
|
||||
"description": "Prometheus url for when running on kuberenetes",
|
||||
"variable": "PROMETHEUS_URL",
|
||||
"type": "string",
|
||||
"default": "",
|
||||
"required": "false"
|
||||
},
|
||||
{
|
||||
"name": "prometheus-token",
|
||||
"short_description": "Prometheus bearer token",
|
||||
"description": "Prometheus bearer token for prometheus url authentication",
|
||||
"variable": "PROMETHEUS_TOKEN",
|
||||
"type": "string",
|
||||
"default": "",
|
||||
"required": "false"
|
||||
},
|
||||
{
|
||||
"name": "uuid",
|
||||
"short_description": "Sets krkn run uuid",
|
||||
@@ -501,6 +519,26 @@
|
||||
"default": "",
|
||||
"required": "false"
|
||||
},
|
||||
{
|
||||
"name": "kubevirt-exit-on-failure",
|
||||
"short_description": "KubeVirt fail if failed vms at end of run",
|
||||
"description": "KubeVirt fails run if vms still have false status",
|
||||
"variable": "KUBE_VIRT_EXIT_ON_FAIL",
|
||||
"type": "enum",
|
||||
"allowed_values": "True,False,true,false",
|
||||
"separator": ",",
|
||||
"default": "False",
|
||||
"required": "false"
|
||||
},
|
||||
{
|
||||
"name": "kubevirt-node-node",
|
||||
"short_description": "KubeVirt node to filter vms on",
|
||||
"description": "Only track VMs in KubeVirt on given node name",
|
||||
"variable": "KUBE_VIRT_NODE_NAME",
|
||||
"type": "string",
|
||||
"default": "",
|
||||
"required": "false"
|
||||
},
|
||||
{
|
||||
"name": "krkn-debug",
|
||||
"short_description": "Krkn debug mode",
|
||||
|
||||
@@ -3,10 +3,16 @@ apiVersion: kind.x-k8s.io/v1alpha4
|
||||
nodes:
|
||||
- role: control-plane
|
||||
extraPortMappings:
|
||||
- containerPort: 30000
|
||||
hostPort: 9090
|
||||
- containerPort: 32766
|
||||
hostPort: 9200
|
||||
- containerPort: 30036
|
||||
hostPort: 8888
|
||||
- containerPort: 30037
|
||||
hostPort: 8889
|
||||
- containerPort: 30080
|
||||
hostPort: 30080
|
||||
- role: control-plane
|
||||
- role: control-plane
|
||||
- role: worker
|
||||
|
||||
@@ -14,7 +14,7 @@ def get_status(config, start_time, end_time):
|
||||
if config["cerberus"]["cerberus_enabled"]:
|
||||
cerberus_url = config["cerberus"]["cerberus_url"]
|
||||
check_application_routes = \
|
||||
config["cerberus"]["check_applicaton_routes"]
|
||||
config["cerberus"]["check_application_routes"]
|
||||
if not cerberus_url:
|
||||
logging.error(
|
||||
"url where Cerberus publishes True/False signal "
|
||||
|
||||
@@ -15,7 +15,7 @@ def invoke(command, timeout=None):
|
||||
|
||||
|
||||
# Invokes a given command and returns the stdout
|
||||
def invoke_no_exit(command, timeout=None):
|
||||
def invoke_no_exit(command, timeout=15):
|
||||
output = ""
|
||||
try:
|
||||
output = subprocess.check_output(command, shell=True, universal_newlines=True, timeout=timeout, stderr=subprocess.DEVNULL)
|
||||
|
||||
@@ -214,7 +214,7 @@ def metrics(
|
||||
end_time=datetime.datetime.fromtimestamp(end_time), granularity=30
|
||||
)
|
||||
else:
|
||||
logging.info('didnt match keys')
|
||||
logging.info("didn't match keys")
|
||||
continue
|
||||
|
||||
for returned_metric in metrics_result:
|
||||
|
||||
@@ -3,7 +3,7 @@ import logging
|
||||
from typing import Optional, TYPE_CHECKING
|
||||
|
||||
from krkn.rollback.config import RollbackConfig
|
||||
from krkn.rollback.handler import execute_rollback_version_files, cleanup_rollback_version_files
|
||||
from krkn.rollback.handler import execute_rollback_version_files
|
||||
|
||||
|
||||
|
||||
@@ -96,24 +96,16 @@ def execute_rollback(telemetry_ocp: "KrknTelemetryOpenshift", run_uuid: Optional
|
||||
:return: Exit code (0 for success, 1 for error)
|
||||
"""
|
||||
logging.info("Executing rollback version files")
|
||||
|
||||
if not run_uuid:
|
||||
logging.error("run_uuid is required for execute-rollback command")
|
||||
return 1
|
||||
|
||||
if not scenario_type:
|
||||
logging.warning("scenario_type is not specified, executing all scenarios in rollback directory")
|
||||
|
||||
logging.info(f"Executing rollback for run_uuid={run_uuid or '*'}, scenario_type={scenario_type or '*'}")
|
||||
|
||||
try:
|
||||
# Execute rollback version files
|
||||
logging.info(f"Executing rollback for run_uuid={run_uuid}, scenario_type={scenario_type or '*'}")
|
||||
execute_rollback_version_files(telemetry_ocp, run_uuid, scenario_type)
|
||||
|
||||
# If execution was successful, cleanup the version files
|
||||
logging.info("Rollback execution completed successfully, cleaning up version files")
|
||||
cleanup_rollback_version_files(run_uuid, scenario_type)
|
||||
|
||||
logging.info("Rollback execution and cleanup completed successfully")
|
||||
execute_rollback_version_files(
|
||||
telemetry_ocp,
|
||||
run_uuid,
|
||||
scenario_type,
|
||||
ignore_auto_rollback_config=True
|
||||
)
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
|
||||
@@ -108,7 +108,76 @@ class RollbackConfig(metaclass=SingletonMeta):
|
||||
return f"{cls().versions_directory}/{rollback_context}"
|
||||
|
||||
@classmethod
|
||||
def search_rollback_version_files(cls, run_uuid: str, scenario_type: str | None = None) -> list[str]:
|
||||
def is_rollback_version_file_format(cls, file_name: str, expected_scenario_type: str | None = None) -> bool:
|
||||
"""
|
||||
Validate the format of a rollback version file name.
|
||||
|
||||
Expected format: <scenario_type>_<timestamp>_<hash_suffix>.py
|
||||
where:
|
||||
- scenario_type: string (can include underscores)
|
||||
- timestamp: integer (nanoseconds since epoch)
|
||||
- hash_suffix: alphanumeric string (length 8)
|
||||
- .py: file extension
|
||||
|
||||
:param file_name: The name of the file to validate.
|
||||
:param expected_scenario_type: The expected scenario type (if any) to validate against.
|
||||
:return: True if the file name matches the expected format, False otherwise.
|
||||
"""
|
||||
if not file_name.endswith(".py"):
|
||||
return False
|
||||
|
||||
parts = file_name.split("_")
|
||||
if len(parts) < 3:
|
||||
return False
|
||||
|
||||
scenario_type = "_".join(parts[:-2])
|
||||
timestamp_str = parts[-2]
|
||||
hash_suffix_with_ext = parts[-1]
|
||||
hash_suffix = hash_suffix_with_ext[:-3]
|
||||
|
||||
if expected_scenario_type and scenario_type != expected_scenario_type:
|
||||
return False
|
||||
|
||||
if not timestamp_str.isdigit():
|
||||
return False
|
||||
|
||||
if len(hash_suffix) != 8 or not hash_suffix.isalnum():
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
def is_rollback_context_directory_format(cls, directory_name: str, expected_run_uuid: str | None = None) -> bool:
|
||||
"""
|
||||
Validate the format of a rollback context directory name.
|
||||
|
||||
Expected format: <timestamp>-<run_uuid>
|
||||
where:
|
||||
- timestamp: integer (nanoseconds since epoch)
|
||||
- run_uuid: alphanumeric string
|
||||
|
||||
:param directory_name: The name of the directory to validate.
|
||||
:param expected_run_uuid: The expected run UUID (if any) to validate against.
|
||||
:return: True if the directory name matches the expected format, False otherwise.
|
||||
"""
|
||||
parts = directory_name.split("-", 1)
|
||||
if len(parts) != 2:
|
||||
return False
|
||||
|
||||
timestamp_str, run_uuid = parts
|
||||
|
||||
# Validate timestamp is numeric
|
||||
if not timestamp_str.isdigit():
|
||||
return False
|
||||
|
||||
# Validate run_uuid
|
||||
if expected_run_uuid and expected_run_uuid != run_uuid:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
def search_rollback_version_files(cls, run_uuid: str | None = None, scenario_type: str | None = None) -> list[str]:
|
||||
"""
|
||||
Search for rollback version files based on run_uuid and scenario_type.
|
||||
|
||||
@@ -123,34 +192,35 @@ class RollbackConfig(metaclass=SingletonMeta):
|
||||
if not os.path.exists(cls().versions_directory):
|
||||
return []
|
||||
|
||||
rollback_context_directories = [
|
||||
dirname for dirname in os.listdir(cls().versions_directory) if run_uuid in dirname
|
||||
]
|
||||
rollback_context_directories = []
|
||||
for dir in os.listdir(cls().versions_directory):
|
||||
if cls.is_rollback_context_directory_format(dir, run_uuid):
|
||||
rollback_context_directories.append(dir)
|
||||
else:
|
||||
logger.warning(f"Directory {dir} does not match expected pattern of <timestamp>-<run_uuid>")
|
||||
|
||||
if not rollback_context_directories:
|
||||
logger.warning(f"No rollback context directories found for run UUID {run_uuid}")
|
||||
return []
|
||||
|
||||
if len(rollback_context_directories) > 1:
|
||||
logger.warning(
|
||||
f"Expected one directory for run UUID {run_uuid}, found: {rollback_context_directories}"
|
||||
)
|
||||
|
||||
rollback_context_directory = rollback_context_directories[0]
|
||||
|
||||
version_files = []
|
||||
scenario_rollback_versions_directory = os.path.join(
|
||||
cls().versions_directory, rollback_context_directory
|
||||
)
|
||||
for file in os.listdir(scenario_rollback_versions_directory):
|
||||
# assert all files start with scenario_type and end with .py
|
||||
if file.endswith(".py") and (scenario_type is None or file.startswith(scenario_type)):
|
||||
version_files.append(
|
||||
os.path.join(scenario_rollback_versions_directory, file)
|
||||
)
|
||||
else:
|
||||
logger.warning(
|
||||
f"File {file} does not match expected pattern for scenario type {scenario_type}"
|
||||
)
|
||||
for rollback_context_dir in rollback_context_directories:
|
||||
rollback_context_dir = os.path.join(cls().versions_directory, rollback_context_dir)
|
||||
|
||||
for file in os.listdir(rollback_context_dir):
|
||||
# Skip known non-rollback files/directories
|
||||
if file == "__pycache__" or file.endswith(".executed"):
|
||||
continue
|
||||
|
||||
if cls.is_rollback_version_file_format(file, scenario_type):
|
||||
version_files.append(
|
||||
os.path.join(rollback_context_dir, file)
|
||||
)
|
||||
else:
|
||||
logger.warning(
|
||||
f"File {file} does not match expected pattern of <{scenario_type or '*'}>_<timestamp>_<hash_suffix>.py"
|
||||
)
|
||||
return version_files
|
||||
|
||||
@dataclass(frozen=True)
|
||||
|
||||
@@ -117,23 +117,32 @@ def _parse_rollback_module(version_file_path: str) -> tuple[RollbackCallable, Ro
|
||||
return rollback_callable, rollback_content
|
||||
|
||||
|
||||
def execute_rollback_version_files(telemetry_ocp: "KrknTelemetryOpenshift", run_uuid: str, scenario_type: str | None = None):
|
||||
def execute_rollback_version_files(
|
||||
telemetry_ocp: "KrknTelemetryOpenshift",
|
||||
run_uuid: str | None = None,
|
||||
scenario_type: str | None = None,
|
||||
ignore_auto_rollback_config: bool = False
|
||||
):
|
||||
"""
|
||||
Execute rollback version files for the given run_uuid and scenario_type.
|
||||
This function is called when a signal is received to perform rollback operations.
|
||||
|
||||
:param run_uuid: Unique identifier for the run.
|
||||
:param scenario_type: Type of the scenario being rolled back.
|
||||
:param ignore_auto_rollback_config: Flag to ignore auto rollback configuration. Will be set to True for manual execute-rollback calls.
|
||||
"""
|
||||
|
||||
if not ignore_auto_rollback_config and RollbackConfig().auto is False:
|
||||
logger.warning(f"Auto rollback is disabled, skipping execution for run_uuid={run_uuid or '*'}, scenario_type={scenario_type or '*'}")
|
||||
return
|
||||
|
||||
# Get the rollback versions directory
|
||||
version_files = RollbackConfig.search_rollback_version_files(run_uuid, scenario_type)
|
||||
if not version_files:
|
||||
logger.warning(f"Skip execution for run_uuid={run_uuid}, scenario_type={scenario_type or '*'}")
|
||||
logger.warning(f"Skip execution for run_uuid={run_uuid or '*'}, scenario_type={scenario_type or '*'}")
|
||||
return
|
||||
|
||||
# Execute all version files in the directory
|
||||
logger.info(f"Executing rollback version files for run_uuid={run_uuid}, scenario_type={scenario_type or '*'}")
|
||||
logger.info(f"Executing rollback version files for run_uuid={run_uuid or '*'}, scenario_type={scenario_type or '*'}")
|
||||
for version_file in version_files:
|
||||
try:
|
||||
logger.info(f"Executing rollback version file: {version_file}")
|
||||
@@ -144,28 +153,37 @@ def execute_rollback_version_files(telemetry_ocp: "KrknTelemetryOpenshift", run_
|
||||
logger.info('Executing rollback callable...')
|
||||
rollback_callable(rollback_content, telemetry_ocp)
|
||||
logger.info('Rollback completed.')
|
||||
|
||||
logger.info(f"Executed {version_file} successfully.")
|
||||
success = True
|
||||
except Exception as e:
|
||||
success = False
|
||||
logger.error(f"Failed to execute rollback version file {version_file}: {e}")
|
||||
raise
|
||||
|
||||
# Rename the version file with .executed suffix if successful
|
||||
if success:
|
||||
try:
|
||||
executed_file = f"{version_file}.executed"
|
||||
os.rename(version_file, executed_file)
|
||||
logger.info(f"Renamed {version_file} to {executed_file} successfully.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to rename rollback version file {version_file}: {e}")
|
||||
raise
|
||||
|
||||
def cleanup_rollback_version_files(run_uuid: str, scenario_type: str):
|
||||
"""
|
||||
Cleanup rollback version files for the given run_uuid and scenario_type.
|
||||
This function is called to remove the rollback version files after execution.
|
||||
This function is called to remove the rollback version files after successful scenario execution in run_scenarios.
|
||||
|
||||
:param run_uuid: Unique identifier for the run.
|
||||
:param scenario_type: Type of the scenario being rolled back.
|
||||
"""
|
||||
|
||||
|
||||
# Get the rollback versions directory
|
||||
version_files = RollbackConfig.search_rollback_version_files(run_uuid, scenario_type)
|
||||
if not version_files:
|
||||
logger.warning(f"Skip cleanup for run_uuid={run_uuid}, scenario_type={scenario_type or '*'}")
|
||||
return
|
||||
|
||||
|
||||
# Remove all version files in the directory
|
||||
logger.info(f"Cleaning up rollback version files for run_uuid={run_uuid}, scenario_type={scenario_type}")
|
||||
for version_file in version_files:
|
||||
@@ -176,7 +194,6 @@ def cleanup_rollback_version_files(run_uuid: str, scenario_type: str):
|
||||
logger.error(f"Failed to remove rollback version file {version_file}: {e}")
|
||||
raise
|
||||
|
||||
|
||||
class RollbackHandler:
|
||||
def __init__(
|
||||
self,
|
||||
|
||||
@@ -115,14 +115,15 @@ class AbstractScenarioPlugin(ABC):
|
||||
)
|
||||
return_value = 1
|
||||
|
||||
# execute rollback files based on the return value
|
||||
if return_value != 0:
|
||||
if return_value == 0:
|
||||
cleanup_rollback_version_files(
|
||||
run_uuid, scenario_telemetry.scenario_type
|
||||
)
|
||||
else:
|
||||
# execute rollback files based on the return value
|
||||
execute_rollback_version_files(
|
||||
telemetry, run_uuid, scenario_telemetry.scenario_type
|
||||
)
|
||||
cleanup_rollback_version_files(
|
||||
run_uuid, scenario_telemetry.scenario_type
|
||||
)
|
||||
scenario_telemetry.exit_status = return_value
|
||||
scenario_telemetry.end_timestamp = time.time()
|
||||
utils.collect_and_put_ocp_logs(
|
||||
@@ -145,7 +146,7 @@ class AbstractScenarioPlugin(ABC):
|
||||
if scenario_telemetry.exit_status != 0:
|
||||
failed_scenarios.append(scenario_config)
|
||||
scenario_telemetries.append(scenario_telemetry)
|
||||
logging.info(f"wating {wait_duration} before running the next scenario")
|
||||
logging.info(f"waiting {wait_duration} before running the next scenario")
|
||||
time.sleep(wait_duration)
|
||||
return failed_scenarios, scenario_telemetries
|
||||
|
||||
|
||||
@@ -34,6 +34,21 @@ class ApplicationOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
)
|
||||
namespace = get_yaml_item_value(scenario_config, "namespace", "")
|
||||
duration = get_yaml_item_value(scenario_config, "duration", 60)
|
||||
exclude_label = get_yaml_item_value(
|
||||
scenario_config, "exclude_label", None
|
||||
)
|
||||
match_expressions = self._build_exclude_expressions(exclude_label)
|
||||
if match_expressions:
|
||||
# Log the format being used for better clarity
|
||||
format_type = "dict" if isinstance(exclude_label, dict) else "string"
|
||||
logging.info(
|
||||
"Excluding pods with labels (%s format): %s",
|
||||
format_type,
|
||||
", ".join(
|
||||
f"{expr['key']} NOT IN {expr['values']}"
|
||||
for expr in match_expressions
|
||||
),
|
||||
)
|
||||
|
||||
start_time = int(time.time())
|
||||
policy_name = f"krkn-deny-{get_random_string(5)}"
|
||||
@@ -43,18 +58,30 @@ class ApplicationOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: """
|
||||
+ policy_name
|
||||
+ """
|
||||
name: {{ policy_name }}
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels: {{ pod_selector }}
|
||||
{% if match_expressions %}
|
||||
matchExpressions:
|
||||
{% for expression in match_expressions %}
|
||||
- key: {{ expression["key"] }}
|
||||
operator: NotIn
|
||||
values:
|
||||
{% for value in expression["values"] %}
|
||||
- {{ value }}
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
policyTypes: {{ traffic_type }}
|
||||
"""
|
||||
)
|
||||
t = Template(network_policy_template)
|
||||
rendered_spec = t.render(
|
||||
pod_selector=pod_selector, traffic_type=traffic_type
|
||||
pod_selector=pod_selector,
|
||||
traffic_type=traffic_type,
|
||||
match_expressions=match_expressions,
|
||||
policy_name=policy_name,
|
||||
)
|
||||
yaml_spec = yaml.safe_load(rendered_spec)
|
||||
# Block the traffic by creating network policy
|
||||
@@ -122,3 +149,63 @@ class ApplicationOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
|
||||
def get_scenario_types(self) -> list[str]:
|
||||
return ["application_outages_scenarios"]
|
||||
|
||||
@staticmethod
|
||||
def _build_exclude_expressions(exclude_label) -> list[dict]:
|
||||
"""
|
||||
Build match expressions for NetworkPolicy from exclude_label.
|
||||
|
||||
Supports multiple formats:
|
||||
- Dict format (preferred, similar to pod_selector): {key1: value1, key2: [value2, value3]}
|
||||
Example: {tier: "gold", env: ["prod", "staging"]}
|
||||
- String format: "key1=value1,key2=value2" or "key1=value1|value2"
|
||||
Example: "tier=gold,env=prod" or "tier=gold|platinum"
|
||||
- List format (list of strings): ["key1=value1", "key2=value2"]
|
||||
Example: ["tier=gold", "env=prod"]
|
||||
Note: List elements must be strings in "key=value" format.
|
||||
|
||||
:param exclude_label: Can be dict, string, list of strings, or None
|
||||
:return: List of match expression dictionaries
|
||||
"""
|
||||
expressions: list[dict] = []
|
||||
|
||||
if not exclude_label:
|
||||
return expressions
|
||||
|
||||
def _append_expr(key: str, values):
|
||||
if not key or values is None:
|
||||
return
|
||||
if not isinstance(values, list):
|
||||
values = [values]
|
||||
cleaned_values = [str(v).strip() for v in values if str(v).strip()]
|
||||
if cleaned_values:
|
||||
expressions.append({"key": key.strip(), "values": cleaned_values})
|
||||
|
||||
if isinstance(exclude_label, dict):
|
||||
for k, v in exclude_label.items():
|
||||
_append_expr(str(k), v)
|
||||
return expressions
|
||||
|
||||
if isinstance(exclude_label, list):
|
||||
selectors = exclude_label
|
||||
else:
|
||||
selectors = [sel.strip() for sel in str(exclude_label).split(",")]
|
||||
|
||||
for selector in selectors:
|
||||
if not selector:
|
||||
continue
|
||||
if "=" not in selector:
|
||||
logging.warning(
|
||||
"exclude_label entry '%s' is invalid, expected key=value format",
|
||||
selector,
|
||||
)
|
||||
continue
|
||||
key, value = selector.split("=", 1)
|
||||
value_items = (
|
||||
[item.strip() for item in value.split("|") if item.strip()]
|
||||
if value
|
||||
else []
|
||||
)
|
||||
_append_expr(key, value_items or value)
|
||||
|
||||
return expressions
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
import logging
|
||||
import random
|
||||
import time
|
||||
import traceback
|
||||
from asyncio import Future
|
||||
import yaml
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
@@ -41,6 +42,7 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
|
||||
logging.info("ContainerScenarioPlugin failed with unrecovered containers")
|
||||
return 1
|
||||
except (RuntimeError, Exception) as e:
|
||||
logging.error("Stack trace:\n%s", traceback.format_exc())
|
||||
logging.error("ContainerScenarioPlugin exiting due to Exception %s" % e)
|
||||
return 1
|
||||
else:
|
||||
@@ -50,7 +52,6 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
|
||||
return ["container_scenarios"]
|
||||
|
||||
def start_monitoring(self, kill_scenario: dict, lib_telemetry: KrknTelemetryOpenshift) -> Future:
|
||||
|
||||
namespace_pattern = f"^{kill_scenario['namespace']}$"
|
||||
label_selector = kill_scenario["label_selector"]
|
||||
recovery_time = kill_scenario["expected_recovery_time"]
|
||||
@@ -70,6 +71,7 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
|
||||
container_name = get_yaml_item_value(cont_scenario, "container_name", "")
|
||||
kill_action = get_yaml_item_value(cont_scenario, "action", 1)
|
||||
kill_count = get_yaml_item_value(cont_scenario, "count", 1)
|
||||
exclude_label = get_yaml_item_value(cont_scenario, "exclude_label", "")
|
||||
if not isinstance(kill_action, int):
|
||||
logging.error(
|
||||
"Please make sure the action parameter defined in the "
|
||||
@@ -91,7 +93,19 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
|
||||
pods = kubecli.get_all_pods(label_selector)
|
||||
else:
|
||||
# Only returns pod names
|
||||
pods = kubecli.list_pods(namespace, label_selector)
|
||||
# Use list_pods with exclude_label parameter to exclude pods
|
||||
if exclude_label:
|
||||
logging.info(
|
||||
"Using exclude_label '%s' to exclude pods from container scenario %s in namespace %s",
|
||||
exclude_label,
|
||||
scenario_name,
|
||||
namespace,
|
||||
)
|
||||
pods = kubecli.list_pods(
|
||||
namespace=namespace,
|
||||
label_selector=label_selector,
|
||||
exclude_label=exclude_label if exclude_label else None
|
||||
)
|
||||
else:
|
||||
if namespace == "*":
|
||||
logging.error(
|
||||
@@ -102,6 +116,7 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
|
||||
# sys.exit(1)
|
||||
raise RuntimeError()
|
||||
pods = pod_names
|
||||
|
||||
# get container and pod name
|
||||
container_pod_list = []
|
||||
for pod in pods:
|
||||
@@ -218,4 +233,5 @@ class ContainerScenarioPlugin(AbstractScenarioPlugin):
|
||||
timer += 5
|
||||
logging.info("Waiting 5 seconds for containers to become ready")
|
||||
time.sleep(5)
|
||||
|
||||
return killed_container_list
|
||||
|
||||
@@ -53,7 +53,7 @@ class HogsScenarioPlugin(AbstractScenarioPlugin):
|
||||
raise Exception("no available nodes to schedule workload")
|
||||
|
||||
if not has_selector:
|
||||
available_nodes = [available_nodes[random.randint(0, len(available_nodes))]]
|
||||
available_nodes = [available_nodes[random.randint(0, len(available_nodes) - 1)]]
|
||||
|
||||
if scenario_config.number_of_nodes and len(available_nodes) > scenario_config.number_of_nodes:
|
||||
available_nodes = random.sample(available_nodes, scenario_config.number_of_nodes)
|
||||
|
||||
@@ -25,6 +25,7 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
super().__init__(scenario_type)
|
||||
self.k8s_client = None
|
||||
self.original_vmi = None
|
||||
self.vmis_list = []
|
||||
|
||||
# Scenario type is handled directly in execute_scenario
|
||||
def get_scenario_types(self) -> list[str]:
|
||||
@@ -54,7 +55,8 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
pods_status.merge(single_pods_status)
|
||||
|
||||
scenario_telemetry.affected_pods = pods_status
|
||||
|
||||
if len(scenario_telemetry.affected_pods.unrecovered) > 0:
|
||||
return 1
|
||||
return 0
|
||||
except Exception as e:
|
||||
logging.error(f"KubeVirt VM Outage scenario failed: {e}")
|
||||
@@ -106,20 +108,20 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
:return: The VMI object if found, None otherwise
|
||||
"""
|
||||
try:
|
||||
vmis = self.custom_object_client.list_namespaced_custom_object(
|
||||
group="kubevirt.io",
|
||||
version="v1",
|
||||
namespace=namespace,
|
||||
plural="virtualmachineinstances",
|
||||
)
|
||||
namespaces = self.k8s_client.list_namespaces_by_regex(namespace)
|
||||
for namespace in namespaces:
|
||||
vmis = self.custom_object_client.list_namespaced_custom_object(
|
||||
group="kubevirt.io",
|
||||
version="v1",
|
||||
namespace=namespace,
|
||||
plural="virtualmachineinstances",
|
||||
)
|
||||
|
||||
vmi_list = []
|
||||
for vmi in vmis.get("items"):
|
||||
vmi_name = vmi.get("metadata",{}).get("name")
|
||||
match = re.match(regex_name, vmi_name)
|
||||
if match:
|
||||
vmi_list.append(vmi)
|
||||
return vmi_list
|
||||
for vmi in vmis.get("items"):
|
||||
vmi_name = vmi.get("metadata",{}).get("name")
|
||||
match = re.match(regex_name, vmi_name)
|
||||
if match:
|
||||
self.vmis_list.append(vmi)
|
||||
except ApiException as e:
|
||||
if e.status == 404:
|
||||
logging.warning(f"VMI {regex_name} not found in namespace {namespace}")
|
||||
@@ -152,21 +154,22 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
logging.error("vm_name parameter is required")
|
||||
return 1
|
||||
self.pods_status = PodsStatus()
|
||||
vmis_list = self.get_vmis(vm_name,namespace)
|
||||
self.get_vmis(vm_name,namespace)
|
||||
for _ in range(kill_count):
|
||||
|
||||
rand_int = random.randint(0, len(vmis_list) - 1)
|
||||
vmi = vmis_list[rand_int]
|
||||
rand_int = random.randint(0, len(self.vmis_list) - 1)
|
||||
vmi = self.vmis_list[rand_int]
|
||||
|
||||
logging.info(f"Starting KubeVirt VM outage scenario for VM: {vm_name} in namespace: {namespace}")
|
||||
vmi_name = vmi.get("metadata").get("name")
|
||||
if not self.validate_environment(vmi_name, namespace):
|
||||
vmi_namespace = vmi.get("metadata").get("namespace")
|
||||
if not self.validate_environment(vmi_name, vmi_namespace):
|
||||
return 1
|
||||
|
||||
vmi = self.get_vmi(vmi_name, namespace)
|
||||
vmi = self.get_vmi(vmi_name, vmi_namespace)
|
||||
self.affected_pod = AffectedPod(
|
||||
pod_name=vmi_name,
|
||||
namespace=namespace,
|
||||
namespace=vmi_namespace,
|
||||
)
|
||||
if not vmi:
|
||||
logging.error(f"VMI {vm_name} not found in namespace {namespace}")
|
||||
@@ -174,12 +177,12 @@ class KubevirtVmOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
|
||||
self.original_vmi = vmi
|
||||
logging.info(f"Captured initial state of VMI: {vm_name}")
|
||||
result = self.delete_vmi(vmi_name, namespace, disable_auto_restart)
|
||||
result = self.delete_vmi(vmi_name, vmi_namespace, disable_auto_restart)
|
||||
if result != 0:
|
||||
self.pods_status.unrecovered.append(self.affected_pod)
|
||||
continue
|
||||
|
||||
result = self.wait_for_running(vmi_name,namespace, timeout)
|
||||
result = self.wait_for_running(vmi_name,vmi_namespace, timeout)
|
||||
if result != 0:
|
||||
self.pods_status.unrecovered.append(self.affected_pod)
|
||||
continue
|
||||
|
||||
@@ -27,7 +27,7 @@ def get_status(config, start_time, end_time):
|
||||
application_routes_status = True
|
||||
if config["cerberus"]["cerberus_enabled"]:
|
||||
cerberus_url = config["cerberus"]["cerberus_url"]
|
||||
check_application_routes = config["cerberus"]["check_applicaton_routes"]
|
||||
check_application_routes = config["cerberus"]["check_application_routes"]
|
||||
if not cerberus_url:
|
||||
logging.error("url where Cerberus publishes True/False signal is not provided.")
|
||||
sys.exit(1)
|
||||
|
||||
@@ -27,7 +27,7 @@ def get_status(config, start_time, end_time):
|
||||
application_routes_status = True
|
||||
if config["cerberus"]["cerberus_enabled"]:
|
||||
cerberus_url = config["cerberus"]["cerberus_url"]
|
||||
check_application_routes = config["cerberus"]["check_applicaton_routes"]
|
||||
check_application_routes = config["cerberus"]["check_application_routes"]
|
||||
if not cerberus_url:
|
||||
logging.error(
|
||||
"url where Cerberus publishes True/False signal is not provided.")
|
||||
|
||||
@@ -36,7 +36,7 @@ def get_test_pods(
|
||||
- pods matching the label on which network policy
|
||||
need to be applied
|
||||
|
||||
namepsace (string)
|
||||
namespace (string)
|
||||
- namespace in which the pod is present
|
||||
|
||||
kubecli (KrknKubernetes)
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from typing import TypeVar, Optional
|
||||
|
||||
|
||||
class NetworkChaosScenarioType(Enum):
|
||||
@@ -9,16 +11,21 @@ class NetworkChaosScenarioType(Enum):
|
||||
|
||||
@dataclass
|
||||
class BaseNetworkChaosConfig:
|
||||
supported_execution = ["serial", "parallel"]
|
||||
id: str
|
||||
image: str
|
||||
wait_duration: int
|
||||
test_duration: int
|
||||
label_selector: str
|
||||
service_account: str
|
||||
taints: list[str]
|
||||
namespace: str
|
||||
instance_count: int
|
||||
execution: str
|
||||
namespace: str
|
||||
taints: list[str]
|
||||
supported_execution = ["serial", "parallel"]
|
||||
interfaces: list[str]
|
||||
target: str
|
||||
ingress: bool
|
||||
egress: bool
|
||||
|
||||
def validate(self) -> list[str]:
|
||||
errors = []
|
||||
@@ -41,12 +48,7 @@ class BaseNetworkChaosConfig:
|
||||
|
||||
@dataclass
|
||||
class NetworkFilterConfig(BaseNetworkChaosConfig):
|
||||
ingress: bool
|
||||
egress: bool
|
||||
interfaces: list[str]
|
||||
target: str
|
||||
ports: list[int]
|
||||
image: str
|
||||
protocols: list[str]
|
||||
|
||||
def validate(self) -> list[str]:
|
||||
@@ -58,3 +60,30 @@ class NetworkFilterConfig(BaseNetworkChaosConfig):
|
||||
f"{self.protocols} contains not allowed protocols only tcp and udp is allowed"
|
||||
)
|
||||
return errors
|
||||
|
||||
|
||||
@dataclass
|
||||
class NetworkChaosConfig(BaseNetworkChaosConfig):
|
||||
latency: Optional[str] = None
|
||||
loss: Optional[str] = None
|
||||
bandwidth: Optional[str] = None
|
||||
force: Optional[bool] = None
|
||||
|
||||
def validate(self) -> list[str]:
|
||||
errors = super().validate()
|
||||
latency_regex = re.compile(r"^(\d+)(us|ms|s)$")
|
||||
bandwidth_regex = re.compile(r"^(\d+)(bit|kbit|mbit|gbit|tbit)$")
|
||||
if self.latency:
|
||||
if not (latency_regex.match(self.latency)):
|
||||
errors.append(
|
||||
"latency must be a number followed by `us` (microseconds) or `ms` (milliseconds), or `s` (seconds)"
|
||||
)
|
||||
if self.bandwidth:
|
||||
if not (bandwidth_regex.match(self.bandwidth)):
|
||||
errors.append(
|
||||
"bandwidth must be a number followed by `bit` `kbit` or `mbit` or `tbit`"
|
||||
)
|
||||
if self.loss:
|
||||
if "%" in self.loss or not self.loss.isdigit():
|
||||
errors.append("loss must be a number followed without the `%` symbol")
|
||||
return errors
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
import abc
|
||||
import logging
|
||||
import queue
|
||||
from typing import Tuple
|
||||
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
@@ -27,7 +28,7 @@ class AbstractNetworkChaosModule(abc.ABC):
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig):
|
||||
def get_config(self) -> Tuple[NetworkChaosScenarioType, BaseNetworkChaosConfig]:
|
||||
"""
|
||||
returns the common subset of settings shared by all the scenarios `BaseNetworkChaosConfig` and the type of Network
|
||||
Chaos Scenario that is running (Pod Scenario or Node Scenario)
|
||||
@@ -41,6 +42,42 @@ class AbstractNetworkChaosModule(abc.ABC):
|
||||
|
||||
pass
|
||||
|
||||
def get_node_targets(self, config: BaseNetworkChaosConfig):
|
||||
if self.base_network_config.label_selector:
|
||||
return self.kubecli.get_lib_kubernetes().list_nodes(
|
||||
self.base_network_config.label_selector
|
||||
)
|
||||
else:
|
||||
if not config.target:
|
||||
raise Exception(
|
||||
"neither node selector nor node_name (target) specified, aborting."
|
||||
)
|
||||
node_info = self.kubecli.get_lib_kubernetes().list_nodes()
|
||||
if config.target not in node_info:
|
||||
raise Exception(f"node {config.target} not found, aborting")
|
||||
|
||||
return [config.target]
|
||||
|
||||
def get_pod_targets(self, config: BaseNetworkChaosConfig):
|
||||
if not config.namespace:
|
||||
raise Exception("namespace not specified, aborting")
|
||||
if self.base_network_config.label_selector:
|
||||
return self.kubecli.get_lib_kubernetes().list_pods(
|
||||
config.namespace, config.label_selector
|
||||
)
|
||||
else:
|
||||
if not config.target:
|
||||
raise Exception(
|
||||
"neither node selector nor node_name (target) specified, aborting."
|
||||
)
|
||||
if not self.kubecli.get_lib_kubernetes().check_if_pod_exists(
|
||||
config.target, config.namespace
|
||||
):
|
||||
raise Exception(
|
||||
f"pod {config.target} not found in namespace {config.namespace}"
|
||||
)
|
||||
return [config.target]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
base_network_config: BaseNetworkChaosConfig,
|
||||
|
||||
@@ -0,0 +1,156 @@
|
||||
import queue
|
||||
import time
|
||||
from typing import Tuple
|
||||
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn_lib.utils import get_random_string
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
NetworkChaosScenarioType,
|
||||
BaseNetworkChaosConfig,
|
||||
NetworkChaosConfig,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
|
||||
AbstractNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import (
|
||||
log_info,
|
||||
setup_network_chaos_ng_scenario,
|
||||
log_error,
|
||||
log_warning,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils_network_chaos import (
|
||||
common_set_limit_rules,
|
||||
common_delete_limit_rules,
|
||||
node_qdisc_is_simple,
|
||||
)
|
||||
|
||||
|
||||
class NodeNetworkChaosModule(AbstractNetworkChaosModule):
|
||||
|
||||
def __init__(self, config: NetworkChaosConfig, kubecli: KrknTelemetryOpenshift):
|
||||
super().__init__(config, kubecli)
|
||||
self.config = config
|
||||
|
||||
def run(self, target: str, error_queue: queue.Queue = None):
|
||||
parallel = False
|
||||
if error_queue:
|
||||
parallel = True
|
||||
try:
|
||||
network_chaos_pod_name = f"node-network-chaos-{get_random_string(5)}"
|
||||
container_name = f"fedora-container-{get_random_string(5)}"
|
||||
|
||||
log_info(
|
||||
f"creating workload to inject network chaos in node {target} network"
|
||||
f"latency:{str(self.config.latency) if self.config.latency else '0'}, "
|
||||
f"packet drop:{str(self.config.loss) if self.config.loss else '0'} "
|
||||
f"bandwidth restriction:{str(self.config.bandwidth) if self.config.bandwidth else '0'} ",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
_, interfaces = setup_network_chaos_ng_scenario(
|
||||
self.config,
|
||||
target,
|
||||
network_chaos_pod_name,
|
||||
container_name,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
target,
|
||||
parallel,
|
||||
True,
|
||||
)
|
||||
|
||||
if len(self.config.interfaces) == 0:
|
||||
if len(interfaces) == 0:
|
||||
log_error(
|
||||
"no network interface found in pod, impossible to execute the network chaos scenario",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
return
|
||||
log_info(
|
||||
f"detected network interfaces: {','.join(interfaces)}",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
else:
|
||||
interfaces = self.config.interfaces
|
||||
|
||||
log_info(
|
||||
f"targeting node {target}",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
complex_config_interfaces = []
|
||||
for interface in interfaces:
|
||||
is_simple = node_qdisc_is_simple(
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
network_chaos_pod_name,
|
||||
self.config.namespace,
|
||||
interface,
|
||||
)
|
||||
if not is_simple:
|
||||
complex_config_interfaces.append(interface)
|
||||
|
||||
if len(complex_config_interfaces) > 0 and not self.config.force:
|
||||
log_warning(
|
||||
f"node already has tc rules set for {','.join(complex_config_interfaces)}, this action might damage the cluster,"
|
||||
"if you want to continue set `force` to True in the node network "
|
||||
"chaos scenario config file and try again"
|
||||
)
|
||||
else:
|
||||
if len(complex_config_interfaces) > 0 and self.config.force:
|
||||
log_warning(
|
||||
f"you are forcing node network configuration override for {','.join(complex_config_interfaces)},"
|
||||
"this action might lead to unpredictable node behaviour, "
|
||||
"you're doing it in your own responsibility"
|
||||
"waiting 10 seconds before continuing"
|
||||
)
|
||||
time.sleep(10)
|
||||
common_set_limit_rules(
|
||||
self.config.egress,
|
||||
self.config.ingress,
|
||||
interfaces,
|
||||
self.config.bandwidth,
|
||||
self.config.latency,
|
||||
self.config.loss,
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
network_chaos_pod_name,
|
||||
self.config.namespace,
|
||||
None,
|
||||
)
|
||||
|
||||
time.sleep(self.config.test_duration)
|
||||
|
||||
log_info("removing tc rules", parallel, network_chaos_pod_name)
|
||||
|
||||
common_delete_limit_rules(
|
||||
self.config.egress,
|
||||
self.config.ingress,
|
||||
interfaces,
|
||||
network_chaos_pod_name,
|
||||
self.config.namespace,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
None,
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
self.kubecli.get_lib_kubernetes().delete_pod(
|
||||
network_chaos_pod_name, self.config.namespace
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
if error_queue is None:
|
||||
raise e
|
||||
else:
|
||||
error_queue.put(str(e))
|
||||
|
||||
def get_config(self) -> Tuple[NetworkChaosScenarioType, BaseNetworkChaosConfig]:
|
||||
return NetworkChaosScenarioType.Node, self.config
|
||||
|
||||
def get_targets(self) -> list[str]:
|
||||
return self.get_node_targets(self.config)
|
||||
@@ -1,5 +1,6 @@
|
||||
import queue
|
||||
import time
|
||||
from typing import Tuple
|
||||
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn_lib.utils import get_random_string
|
||||
@@ -11,14 +12,16 @@ from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
|
||||
AbstractNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import log_info
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import (
|
||||
log_info,
|
||||
deploy_network_chaos_ng_pod,
|
||||
get_pod_default_interface,
|
||||
)
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils_network_filter import (
|
||||
deploy_network_filter_pod,
|
||||
apply_network_rules,
|
||||
clean_network_rules,
|
||||
generate_rules,
|
||||
get_default_interface,
|
||||
)
|
||||
|
||||
|
||||
@@ -41,7 +44,7 @@ class NodeNetworkFilterModule(AbstractNetworkChaosModule):
|
||||
)
|
||||
|
||||
pod_name = f"node-filter-{get_random_string(5)}"
|
||||
deploy_network_filter_pod(
|
||||
deploy_network_chaos_ng_pod(
|
||||
self.config,
|
||||
target,
|
||||
pod_name,
|
||||
@@ -50,7 +53,7 @@ class NodeNetworkFilterModule(AbstractNetworkChaosModule):
|
||||
|
||||
if len(self.config.interfaces) == 0:
|
||||
interfaces = [
|
||||
get_default_interface(
|
||||
get_pod_default_interface(
|
||||
pod_name,
|
||||
self.config.namespace,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
@@ -108,21 +111,8 @@ class NodeNetworkFilterModule(AbstractNetworkChaosModule):
|
||||
super().__init__(config, kubecli)
|
||||
self.config = config
|
||||
|
||||
def get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig):
|
||||
def get_config(self) -> Tuple[NetworkChaosScenarioType, BaseNetworkChaosConfig]:
|
||||
return NetworkChaosScenarioType.Node, self.config
|
||||
|
||||
def get_targets(self) -> list[str]:
|
||||
if self.base_network_config.label_selector:
|
||||
return self.kubecli.get_lib_kubernetes().list_nodes(
|
||||
self.base_network_config.label_selector
|
||||
)
|
||||
else:
|
||||
if not self.config.target:
|
||||
raise Exception(
|
||||
"neither node selector nor node_name (target) specified, aborting."
|
||||
)
|
||||
node_info = self.kubecli.get_lib_kubernetes().list_nodes()
|
||||
if self.config.target not in node_info:
|
||||
raise Exception(f"node {self.config.target} not found, aborting")
|
||||
|
||||
return [self.config.target]
|
||||
return self.get_node_targets(self.config)
|
||||
|
||||
@@ -0,0 +1,159 @@
|
||||
import queue
|
||||
import time
|
||||
from typing import Tuple
|
||||
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn_lib.utils import get_random_string
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
NetworkChaosScenarioType,
|
||||
BaseNetworkChaosConfig,
|
||||
NetworkChaosConfig,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
|
||||
AbstractNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import (
|
||||
log_info,
|
||||
setup_network_chaos_ng_scenario,
|
||||
log_error,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils_network_chaos import (
|
||||
common_set_limit_rules,
|
||||
common_delete_limit_rules,
|
||||
)
|
||||
|
||||
|
||||
class PodNetworkChaosModule(AbstractNetworkChaosModule):
|
||||
|
||||
def __init__(self, config: NetworkChaosConfig, kubecli: KrknTelemetryOpenshift):
|
||||
super().__init__(config, kubecli)
|
||||
self.config = config
|
||||
|
||||
def run(self, target: str, error_queue: queue.Queue = None):
|
||||
parallel = False
|
||||
if error_queue:
|
||||
parallel = True
|
||||
try:
|
||||
network_chaos_pod_name = f"pod-network-chaos-{get_random_string(5)}"
|
||||
container_name = f"fedora-container-{get_random_string(5)}"
|
||||
pod_info = self.kubecli.get_lib_kubernetes().get_pod_info(
|
||||
target, self.config.namespace
|
||||
)
|
||||
|
||||
log_info(
|
||||
f"creating workload to inject network chaos in pod {target} network"
|
||||
f"latency:{str(self.config.latency) if self.config.latency else '0'}, "
|
||||
f"packet drop:{str(self.config.loss) if self.config.loss else '0'} "
|
||||
f"bandwidth restriction:{str(self.config.bandwidth) if self.config.bandwidth else '0'} ",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
if not pod_info:
|
||||
raise Exception(
|
||||
f"impossible to retrieve infos for pod {target} namespace {self.config.namespace}"
|
||||
)
|
||||
|
||||
container_ids, interfaces = setup_network_chaos_ng_scenario(
|
||||
self.config,
|
||||
pod_info.nodeName,
|
||||
network_chaos_pod_name,
|
||||
container_name,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
target,
|
||||
parallel,
|
||||
False,
|
||||
)
|
||||
|
||||
if len(self.config.interfaces) == 0:
|
||||
if len(interfaces) == 0:
|
||||
log_error(
|
||||
"no network interface found in pod, impossible to execute the network chaos scenario",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
return
|
||||
log_info(
|
||||
f"detected network interfaces: {','.join(interfaces)}",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
else:
|
||||
interfaces = self.config.interfaces
|
||||
|
||||
if len(container_ids) == 0:
|
||||
raise Exception(
|
||||
f"impossible to resolve container id for pod {target} namespace {self.config.namespace}"
|
||||
)
|
||||
|
||||
log_info(
|
||||
f"targeting container {container_ids[0]}",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
pids = self.kubecli.get_lib_kubernetes().get_pod_pids(
|
||||
base_pod_name=network_chaos_pod_name,
|
||||
base_pod_namespace=self.config.namespace,
|
||||
base_pod_container_name=container_name,
|
||||
pod_name=target,
|
||||
pod_namespace=self.config.namespace,
|
||||
pod_container_id=container_ids[0],
|
||||
)
|
||||
|
||||
if not pids:
|
||||
raise Exception(f"impossible to resolve pid for pod {target}")
|
||||
|
||||
log_info(
|
||||
f"resolved pids {pids} in node {pod_info.nodeName} for pod {target}",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
common_set_limit_rules(
|
||||
self.config.egress,
|
||||
self.config.ingress,
|
||||
interfaces,
|
||||
self.config.bandwidth,
|
||||
self.config.latency,
|
||||
self.config.loss,
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
network_chaos_pod_name,
|
||||
self.config.namespace,
|
||||
pids,
|
||||
)
|
||||
|
||||
time.sleep(self.config.test_duration)
|
||||
|
||||
log_info("removing tc rules", parallel, network_chaos_pod_name)
|
||||
|
||||
common_delete_limit_rules(
|
||||
self.config.egress,
|
||||
self.config.ingress,
|
||||
interfaces,
|
||||
network_chaos_pod_name,
|
||||
self.config.namespace,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
pids,
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
|
||||
self.kubecli.get_lib_kubernetes().delete_pod(
|
||||
network_chaos_pod_name, self.config.namespace
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
if error_queue is None:
|
||||
raise e
|
||||
else:
|
||||
error_queue.put(str(e))
|
||||
|
||||
def get_config(self) -> Tuple[NetworkChaosScenarioType, BaseNetworkChaosConfig]:
|
||||
return NetworkChaosScenarioType.Pod, self.config
|
||||
|
||||
def get_targets(self) -> list[str]:
|
||||
return self.get_pod_targets(self.config)
|
||||
@@ -1,6 +1,6 @@
|
||||
import logging
|
||||
import queue
|
||||
import time
|
||||
from typing import Tuple
|
||||
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn_lib.utils import get_random_string
|
||||
@@ -13,12 +13,17 @@ from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
|
||||
AbstractNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import log_info, log_error
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import (
|
||||
log_info,
|
||||
log_error,
|
||||
deploy_network_chaos_ng_pod,
|
||||
get_pod_default_interface,
|
||||
setup_network_chaos_ng_scenario,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils_network_filter import (
|
||||
deploy_network_filter_pod,
|
||||
generate_namespaced_rules,
|
||||
apply_network_rules,
|
||||
clean_network_rules_namespaced,
|
||||
generate_namespaced_rules,
|
||||
)
|
||||
|
||||
|
||||
@@ -50,22 +55,18 @@ class PodNetworkFilterModule(AbstractNetworkChaosModule):
|
||||
f"impossible to retrieve infos for pod {self.config.target} namespace {self.config.namespace}"
|
||||
)
|
||||
|
||||
deploy_network_filter_pod(
|
||||
container_ids, interfaces = setup_network_chaos_ng_scenario(
|
||||
self.config,
|
||||
pod_info.nodeName,
|
||||
pod_name,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
container_name,
|
||||
host_network=False,
|
||||
self.kubecli.get_lib_kubernetes(),
|
||||
target,
|
||||
parallel,
|
||||
False,
|
||||
)
|
||||
|
||||
if len(self.config.interfaces) == 0:
|
||||
interfaces = (
|
||||
self.kubecli.get_lib_kubernetes().list_pod_network_interfaces(
|
||||
target, self.config.namespace
|
||||
)
|
||||
)
|
||||
|
||||
if len(interfaces) == 0:
|
||||
log_error(
|
||||
"no network interface found in pod, impossible to execute the network filter scenario",
|
||||
@@ -157,26 +158,8 @@ class PodNetworkFilterModule(AbstractNetworkChaosModule):
|
||||
super().__init__(config, kubecli)
|
||||
self.config = config
|
||||
|
||||
def get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig):
|
||||
def get_config(self) -> Tuple[NetworkChaosScenarioType, BaseNetworkChaosConfig]:
|
||||
return NetworkChaosScenarioType.Pod, self.config
|
||||
|
||||
def get_targets(self) -> list[str]:
|
||||
if not self.config.namespace:
|
||||
raise Exception("namespace not specified, aborting")
|
||||
if self.base_network_config.label_selector:
|
||||
return self.kubecli.get_lib_kubernetes().list_pods(
|
||||
self.config.namespace, self.config.label_selector
|
||||
)
|
||||
else:
|
||||
if not self.config.target:
|
||||
raise Exception(
|
||||
"neither node selector nor node_name (target) specified, aborting."
|
||||
)
|
||||
if not self.kubecli.get_lib_kubernetes().check_if_pod_exists(
|
||||
self.config.target, self.config.namespace
|
||||
):
|
||||
raise Exception(
|
||||
f"pod {self.config.target} not found in namespace {self.config.namespace}"
|
||||
)
|
||||
|
||||
return [self.config.target]
|
||||
return self.get_pod_targets(self.config)
|
||||
|
||||
@@ -1,4 +1,15 @@
|
||||
import logging
|
||||
import os
|
||||
from typing import Tuple
|
||||
|
||||
import yaml
|
||||
from jinja2 import FileSystemLoader, Environment
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.models.k8s import Pod
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
BaseNetworkChaosConfig,
|
||||
)
|
||||
|
||||
|
||||
def log_info(message: str, parallel: bool = False, node_name: str = ""):
|
||||
@@ -29,3 +40,101 @@ def log_warning(message: str, parallel: bool = False, node_name: str = ""):
|
||||
logging.warning(f"[{node_name}]: {message}")
|
||||
else:
|
||||
logging.warning(message)
|
||||
|
||||
|
||||
def deploy_network_chaos_ng_pod(
|
||||
config: BaseNetworkChaosConfig,
|
||||
target_node: str,
|
||||
pod_name: str,
|
||||
kubecli: KrknKubernetes,
|
||||
container_name: str = "fedora",
|
||||
host_network: bool = True,
|
||||
):
|
||||
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
|
||||
env = Environment(loader=file_loader, autoescape=True)
|
||||
pod_template = env.get_template("templates/network-chaos.j2")
|
||||
tolerations = []
|
||||
|
||||
for taint in config.taints:
|
||||
key_value_part, effect = taint.split(":", 1)
|
||||
if "=" in key_value_part:
|
||||
key, value = key_value_part.split("=", 1)
|
||||
operator = "Equal"
|
||||
else:
|
||||
key = key_value_part
|
||||
value = None
|
||||
operator = "Exists"
|
||||
toleration = {
|
||||
"key": key,
|
||||
"operator": operator,
|
||||
"effect": effect,
|
||||
}
|
||||
if value is not None:
|
||||
toleration["value"] = value
|
||||
tolerations.append(toleration)
|
||||
|
||||
pod_body = yaml.safe_load(
|
||||
pod_template.render(
|
||||
pod_name=pod_name,
|
||||
namespace=config.namespace,
|
||||
host_network=host_network,
|
||||
target=target_node,
|
||||
container_name=container_name,
|
||||
workload_image=config.image,
|
||||
taints=tolerations,
|
||||
service_account=config.service_account,
|
||||
)
|
||||
)
|
||||
|
||||
kubecli.create_pod(pod_body, config.namespace, 300)
|
||||
|
||||
|
||||
def get_pod_default_interface(
|
||||
pod_name: str, namespace: str, kubecli: KrknKubernetes
|
||||
) -> str:
|
||||
cmd = "ip r | grep default | awk '/default/ {print $5}'"
|
||||
output = kubecli.exec_cmd_in_pod([cmd], pod_name, namespace)
|
||||
return output.replace("\n", "")
|
||||
|
||||
|
||||
def setup_network_chaos_ng_scenario(
|
||||
config: BaseNetworkChaosConfig,
|
||||
node_name: str,
|
||||
pod_name: str,
|
||||
container_name: str,
|
||||
kubecli: KrknKubernetes,
|
||||
target: str,
|
||||
parallel: bool,
|
||||
host_network: bool,
|
||||
) -> Tuple[list[str], list[str]]:
|
||||
|
||||
deploy_network_chaos_ng_pod(
|
||||
config,
|
||||
node_name,
|
||||
pod_name,
|
||||
kubecli,
|
||||
container_name,
|
||||
host_network=host_network,
|
||||
)
|
||||
|
||||
if len(config.interfaces) == 0:
|
||||
interfaces = [
|
||||
get_pod_default_interface(
|
||||
pod_name,
|
||||
config.namespace,
|
||||
kubecli,
|
||||
)
|
||||
]
|
||||
|
||||
log_info(f"detected default interface {interfaces[0]}", parallel, target)
|
||||
|
||||
else:
|
||||
interfaces = config.interfaces
|
||||
# if not host_network means that the target is a pod so container_ids need to be resolved
|
||||
# otherwise it's not needed
|
||||
if not host_network:
|
||||
container_ids = kubecli.get_container_ids(target, config.namespace)
|
||||
else:
|
||||
container_ids = []
|
||||
|
||||
return container_ids, interfaces
|
||||
|
||||
@@ -0,0 +1,263 @@
|
||||
import subprocess
|
||||
import logging
|
||||
from typing import Optional
|
||||
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.utils import (
|
||||
log_info,
|
||||
log_warning,
|
||||
log_error,
|
||||
)
|
||||
|
||||
ROOT_HANDLE = "100:"
|
||||
CLASS_ID = "100:1"
|
||||
NETEM_HANDLE = "101:"
|
||||
|
||||
|
||||
def run(cmd: list[str], check: bool = True) -> subprocess.CompletedProcess:
|
||||
return subprocess.run(cmd, check=check, text=True, capture_output=True)
|
||||
|
||||
|
||||
def tc_node(args: list[str]) -> subprocess.CompletedProcess:
|
||||
return run(["tc"] + args)
|
||||
|
||||
|
||||
def get_build_tc_tree_commands(devs: list[str]) -> list[str]:
|
||||
tree = []
|
||||
for dev in devs:
|
||||
tree.append(f"tc qdisc add dev {dev} root handle {ROOT_HANDLE} htb default 1")
|
||||
tree.append(
|
||||
f"tc class add dev {dev} parent {ROOT_HANDLE} classid {CLASS_ID} htb rate 1gbit",
|
||||
)
|
||||
tree.append(
|
||||
f"tc qdisc add dev {dev} parent {CLASS_ID} handle {NETEM_HANDLE} netem delay 0ms loss 0%",
|
||||
)
|
||||
|
||||
return tree
|
||||
|
||||
|
||||
def namespaced_tc_commands(pids: list[str], commands: list[str]) -> list[str]:
|
||||
return [
|
||||
f"nsenter --target {pid} --net -- {rule}" for pid in pids for rule in commands
|
||||
]
|
||||
|
||||
|
||||
def get_egress_shaping_comand(
|
||||
devices: list[str],
|
||||
rate_mbit: Optional[str],
|
||||
delay_ms: Optional[str],
|
||||
loss_pct: Optional[str],
|
||||
) -> list[str]:
|
||||
|
||||
rate_commands = []
|
||||
rate = f"{rate_mbit}mbit" if rate_mbit is not None else "1gbit"
|
||||
d = delay_ms if delay_ms is not None else 0
|
||||
l = loss_pct if loss_pct is not None else 0
|
||||
for dev in devices:
|
||||
rate_commands.append(
|
||||
f"tc class change dev {dev} parent {ROOT_HANDLE} classid {CLASS_ID} htb rate {rate}"
|
||||
)
|
||||
rate_commands.append(
|
||||
f"tc qdisc change dev {dev} parent {CLASS_ID} handle {NETEM_HANDLE} netem delay {d}ms loss {l}%"
|
||||
)
|
||||
return rate_commands
|
||||
|
||||
|
||||
def get_clear_egress_shaping_commands(devices: list[str]) -> list[str]:
|
||||
return [f"tc qdisc del dev {dev} root handle {ROOT_HANDLE}" for dev in devices]
|
||||
|
||||
|
||||
def get_ingress_shaping_commands(
|
||||
devs: list[str],
|
||||
rate_mbit: Optional[str],
|
||||
delay_ms: Optional[str],
|
||||
loss_pct: Optional[str],
|
||||
ifb_dev: str = "ifb0",
|
||||
) -> list[str]:
|
||||
|
||||
rate_commands = [
|
||||
f"modprobe ifb || true",
|
||||
f"ip link add {ifb_dev} type ifb || true",
|
||||
f"ip link set {ifb_dev} up || true",
|
||||
]
|
||||
|
||||
for dev in devs:
|
||||
rate_commands.append(f"tc qdisc add dev {dev} handle ffff: ingress || true")
|
||||
|
||||
rate_commands.append(
|
||||
f"tc filter add dev {dev} parent ffff: protocol all prio 1 "
|
||||
f"matchall action mirred egress redirect dev {ifb_dev} || true"
|
||||
)
|
||||
|
||||
rate_commands.append(
|
||||
f"tc qdisc add dev {ifb_dev} root handle {ROOT_HANDLE} htb default 1 || true"
|
||||
)
|
||||
rate_commands.append(
|
||||
f"tc class add dev {ifb_dev} parent {ROOT_HANDLE} classid {CLASS_ID} "
|
||||
f"htb rate {rate_mbit if rate_mbit else '1gbit'} || true"
|
||||
)
|
||||
rate_commands.append(
|
||||
f"tc qdisc add dev {ifb_dev} parent {CLASS_ID} handle {NETEM_HANDLE} "
|
||||
f"netem delay {delay_ms if delay_ms else '0ms'} "
|
||||
f"loss {loss_pct if loss_pct else '0'}% || true"
|
||||
)
|
||||
|
||||
return rate_commands
|
||||
|
||||
|
||||
def get_clear_ingress_shaping_commands(
|
||||
devs: list[str],
|
||||
ifb_dev: str = "ifb0",
|
||||
) -> list[str]:
|
||||
|
||||
cmds: list[str] = []
|
||||
for dev in devs:
|
||||
cmds.append(f"tc qdisc del dev {dev} ingress || true")
|
||||
|
||||
cmds.append(f"tc qdisc del dev {ifb_dev} root handle {ROOT_HANDLE} || true")
|
||||
|
||||
cmds.append(f"ip link set {ifb_dev} down || true")
|
||||
cmds.append(f"ip link del {ifb_dev} || true")
|
||||
|
||||
return cmds
|
||||
|
||||
|
||||
def node_qdisc_is_simple(
|
||||
kubecli: KrknKubernetes, pod_name, namespace: str, interface: str
|
||||
) -> bool:
|
||||
|
||||
result = kubecli.exec_cmd_in_pod(
|
||||
[f"tc qdisc show dev {interface}"], pod_name, namespace
|
||||
)
|
||||
lines = [l for l in result.splitlines() if l.strip()]
|
||||
if len(lines) != 1:
|
||||
return False
|
||||
|
||||
line = lines[0].lower()
|
||||
if "htb" in line or "netem" in line or "clsact" in line:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def common_set_limit_rules(
|
||||
egress: bool,
|
||||
ingress: bool,
|
||||
interfaces: list[str],
|
||||
bandwidth: str,
|
||||
latency: str,
|
||||
loss: str,
|
||||
parallel: bool,
|
||||
target: str,
|
||||
kubecli: KrknKubernetes,
|
||||
network_chaos_pod_name: str,
|
||||
namespace: str,
|
||||
pids: Optional[list[str]] = None,
|
||||
):
|
||||
if egress:
|
||||
build_tree_commands = get_build_tc_tree_commands(interfaces)
|
||||
if pids:
|
||||
build_tree_commands = namespaced_tc_commands(pids, build_tree_commands)
|
||||
egress_shaping_commands = get_egress_shaping_comand(
|
||||
interfaces,
|
||||
bandwidth,
|
||||
latency,
|
||||
loss,
|
||||
)
|
||||
if pids:
|
||||
egress_shaping_commands = namespaced_tc_commands(
|
||||
pids, egress_shaping_commands
|
||||
)
|
||||
error_counter = 0
|
||||
for rule in build_tree_commands:
|
||||
result = kubecli.exec_cmd_in_pod([rule], network_chaos_pod_name, namespace)
|
||||
if not result:
|
||||
log_info(f"created tc tree in pod: {rule}", parallel, target)
|
||||
else:
|
||||
error_counter += 1
|
||||
if len(build_tree_commands) == error_counter:
|
||||
log_error(
|
||||
"failed to apply egress shaping rules on cluster", parallel, target
|
||||
)
|
||||
|
||||
for rule in egress_shaping_commands:
|
||||
result = kubecli.exec_cmd_in_pod([rule], network_chaos_pod_name, namespace)
|
||||
if not result:
|
||||
log_info(f"applied egress shaping rules: {rule}", parallel, target)
|
||||
if ingress:
|
||||
ingress_shaping_commands = get_ingress_shaping_commands(
|
||||
interfaces,
|
||||
bandwidth,
|
||||
latency,
|
||||
loss,
|
||||
)
|
||||
if pids:
|
||||
ingress_shaping_commands = namespaced_tc_commands(
|
||||
pids, ingress_shaping_commands
|
||||
)
|
||||
error_counter = 0
|
||||
for rule in ingress_shaping_commands:
|
||||
|
||||
result = kubecli.exec_cmd_in_pod([rule], network_chaos_pod_name, namespace)
|
||||
if not result:
|
||||
log_info(
|
||||
f"applied ingress shaping rule: {rule}",
|
||||
parallel,
|
||||
network_chaos_pod_name,
|
||||
)
|
||||
else:
|
||||
error_counter += 1
|
||||
|
||||
if len(ingress_shaping_commands) == error_counter:
|
||||
log_error(
|
||||
"failed to apply ingress shaping rules on cluster", parallel, target
|
||||
)
|
||||
|
||||
|
||||
def common_delete_limit_rules(
|
||||
egress: bool,
|
||||
ingress: bool,
|
||||
interfaces: list[str],
|
||||
network_chaos_pod_name: str,
|
||||
network_chaos_namespace: str,
|
||||
kubecli: KrknKubernetes,
|
||||
pids: Optional[list[str]],
|
||||
parallel: bool,
|
||||
target: str,
|
||||
):
|
||||
if egress:
|
||||
clear_commands = get_clear_egress_shaping_commands(interfaces)
|
||||
if pids:
|
||||
clear_commands = namespaced_tc_commands(pids, clear_commands)
|
||||
error_counter = 0
|
||||
for rule in clear_commands:
|
||||
result = kubecli.exec_cmd_in_pod(
|
||||
[rule], network_chaos_pod_name, network_chaos_namespace
|
||||
)
|
||||
if not result:
|
||||
log_info(f"removed egress shaping rule : {rule}", parallel, target)
|
||||
else:
|
||||
error_counter += 1
|
||||
if len(clear_commands) == error_counter:
|
||||
log_error(
|
||||
"failed to remove egress shaping rules on cluster", parallel, target
|
||||
)
|
||||
|
||||
if ingress:
|
||||
clear_commands = get_clear_ingress_shaping_commands(interfaces)
|
||||
if pids:
|
||||
clear_commands = namespaced_tc_commands(pids, clear_commands)
|
||||
error_counter = 0
|
||||
for rule in clear_commands:
|
||||
result = kubecli.exec_cmd_in_pod(
|
||||
[rule], network_chaos_pod_name, network_chaos_namespace
|
||||
)
|
||||
if not result:
|
||||
log_info(f"removed ingress shaping rule: {rule}", parallel, target)
|
||||
else:
|
||||
error_counter += 1
|
||||
if len(clear_commands) == error_counter:
|
||||
log_error(
|
||||
"failed to remove ingress shaping rules on cluster", parallel, target
|
||||
)
|
||||
@@ -1,7 +1,5 @@
|
||||
import os
|
||||
from typing import Tuple
|
||||
|
||||
import yaml
|
||||
from jinja2 import FileSystemLoader, Environment
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import NetworkFilterConfig
|
||||
@@ -10,7 +8,7 @@ from krkn.scenario_plugins.network_chaos_ng.modules.utils import log_info
|
||||
|
||||
def generate_rules(
|
||||
interfaces: list[str], config: NetworkFilterConfig
|
||||
) -> (list[str], list[str]):
|
||||
) -> Tuple[list[str], list[str]]:
|
||||
input_rules = []
|
||||
output_rules = []
|
||||
for interface in interfaces:
|
||||
@@ -29,72 +27,6 @@ def generate_rules(
|
||||
return input_rules, output_rules
|
||||
|
||||
|
||||
def generate_namespaced_rules(
|
||||
interfaces: list[str], config: NetworkFilterConfig, pids: list[str]
|
||||
) -> (list[str], list[str]):
|
||||
namespaced_input_rules: list[str] = []
|
||||
namespaced_output_rules: list[str] = []
|
||||
input_rules, output_rules = generate_rules(interfaces, config)
|
||||
for pid in pids:
|
||||
ns_input_rules = [
|
||||
f"nsenter --target {pid} --net -- {rule}" for rule in input_rules
|
||||
]
|
||||
ns_output_rules = [
|
||||
f"nsenter --target {pid} --net -- {rule}" for rule in output_rules
|
||||
]
|
||||
namespaced_input_rules.extend(ns_input_rules)
|
||||
namespaced_output_rules.extend(ns_output_rules)
|
||||
|
||||
return namespaced_input_rules, namespaced_output_rules
|
||||
|
||||
|
||||
def deploy_network_filter_pod(
|
||||
config: NetworkFilterConfig,
|
||||
target_node: str,
|
||||
pod_name: str,
|
||||
kubecli: KrknKubernetes,
|
||||
container_name: str = "fedora",
|
||||
host_network: bool = True,
|
||||
):
|
||||
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
|
||||
env = Environment(loader=file_loader, autoescape=True)
|
||||
pod_template = env.get_template("templates/network-chaos.j2")
|
||||
tolerations = []
|
||||
|
||||
for taint in config.taints:
|
||||
key_value_part, effect = taint.split(":", 1)
|
||||
if "=" in key_value_part:
|
||||
key, value = key_value_part.split("=", 1)
|
||||
operator = "Equal"
|
||||
else:
|
||||
key = key_value_part
|
||||
value = None
|
||||
operator = "Exists"
|
||||
toleration = {
|
||||
"key": key,
|
||||
"operator": operator,
|
||||
"effect": effect,
|
||||
}
|
||||
if value is not None:
|
||||
toleration["value"] = value
|
||||
tolerations.append(toleration)
|
||||
|
||||
pod_body = yaml.safe_load(
|
||||
pod_template.render(
|
||||
pod_name=pod_name,
|
||||
namespace=config.namespace,
|
||||
host_network=host_network,
|
||||
target=target_node,
|
||||
container_name=container_name,
|
||||
workload_image=config.image,
|
||||
taints=tolerations,
|
||||
service_account=config.service_account,
|
||||
)
|
||||
)
|
||||
|
||||
kubecli.create_pod(pod_body, config.namespace, 300)
|
||||
|
||||
|
||||
def apply_network_rules(
|
||||
kubecli: KrknKubernetes,
|
||||
input_rules: list[str],
|
||||
@@ -153,9 +85,20 @@ def clean_network_rules_namespaced(
|
||||
)
|
||||
|
||||
|
||||
def get_default_interface(
|
||||
pod_name: str, namespace: str, kubecli: KrknKubernetes
|
||||
) -> str:
|
||||
cmd = "ip r | grep default | awk '/default/ {print $5}'"
|
||||
output = kubecli.exec_cmd_in_pod([cmd], pod_name, namespace)
|
||||
return output.replace("\n", "")
|
||||
def generate_namespaced_rules(
|
||||
interfaces: list[str], config: NetworkFilterConfig, pids: list[str]
|
||||
) -> Tuple[list[str], list[str]]:
|
||||
namespaced_input_rules: list[str] = []
|
||||
namespaced_output_rules: list[str] = []
|
||||
input_rules, output_rules = generate_rules(interfaces, config)
|
||||
for pid in pids:
|
||||
ns_input_rules = [
|
||||
f"nsenter --target {pid} --net -- {rule}" for rule in input_rules
|
||||
]
|
||||
ns_output_rules = [
|
||||
f"nsenter --target {pid} --net -- {rule}" for rule in output_rules
|
||||
]
|
||||
namespaced_input_rules.extend(ns_input_rules)
|
||||
namespaced_output_rules.extend(ns_output_rules)
|
||||
|
||||
return namespaced_input_rules, namespaced_output_rules
|
||||
|
||||
@@ -1,17 +1,31 @@
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import NetworkFilterConfig
|
||||
from krkn.scenario_plugins.network_chaos_ng.models import (
|
||||
NetworkFilterConfig,
|
||||
NetworkChaosConfig,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.abstract_network_chaos_module import (
|
||||
AbstractNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.node_network_chaos import (
|
||||
NodeNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.node_network_filter import (
|
||||
NodeNetworkFilterModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.pod_network_chaos import (
|
||||
PodNetworkChaosModule,
|
||||
)
|
||||
from krkn.scenario_plugins.network_chaos_ng.modules.pod_network_filter import (
|
||||
PodNetworkFilterModule,
|
||||
)
|
||||
|
||||
supported_modules = ["node_network_filter", "pod_network_filter"]
|
||||
supported_modules = [
|
||||
"node_network_filter",
|
||||
"pod_network_filter",
|
||||
"pod_network_chaos",
|
||||
"node_network_chaos",
|
||||
]
|
||||
|
||||
|
||||
class NetworkChaosFactory:
|
||||
@@ -26,14 +40,28 @@ class NetworkChaosFactory:
|
||||
raise Exception(f"{config['id']} is not a supported network chaos module")
|
||||
|
||||
if config["id"] == "node_network_filter":
|
||||
config = NetworkFilterConfig(**config)
|
||||
errors = config.validate()
|
||||
scenario_config = NetworkFilterConfig(**config)
|
||||
errors = scenario_config.validate()
|
||||
if len(errors) > 0:
|
||||
raise Exception(f"config validation errors: [{';'.join(errors)}]")
|
||||
return NodeNetworkFilterModule(config, kubecli)
|
||||
return NodeNetworkFilterModule(scenario_config, kubecli)
|
||||
if config["id"] == "pod_network_filter":
|
||||
config = NetworkFilterConfig(**config)
|
||||
errors = config.validate()
|
||||
scenario_config = NetworkFilterConfig(**config)
|
||||
errors = scenario_config.validate()
|
||||
if len(errors) > 0:
|
||||
raise Exception(f"config validation errors: [{';'.join(errors)}]")
|
||||
return PodNetworkFilterModule(config, kubecli)
|
||||
return PodNetworkFilterModule(scenario_config, kubecli)
|
||||
if config["id"] == "pod_network_chaos":
|
||||
scenario_config = NetworkChaosConfig(**config)
|
||||
errors = scenario_config.validate()
|
||||
if len(errors) > 0:
|
||||
raise Exception(f"config validation errors: [{';'.join(errors)}]")
|
||||
return PodNetworkChaosModule(scenario_config, kubecli)
|
||||
if config["id"] == "node_network_chaos":
|
||||
scenario_config = NetworkChaosConfig(**config)
|
||||
errors = scenario_config.validate()
|
||||
if len(errors) > 0:
|
||||
raise Exception(f"config validation errors: [{';'.join(errors)}]")
|
||||
return NodeNetworkChaosModule(scenario_config, kubecli)
|
||||
else:
|
||||
raise Exception(f"invalid network chaos id {config['id']}")
|
||||
|
||||
@@ -18,20 +18,20 @@ class abstract_node_scenarios:
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
pass
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
pass
|
||||
|
||||
# Node scenario to stop and then start the node
|
||||
def node_stop_start_scenario(self, instance_kill_count, node, timeout, duration):
|
||||
def node_stop_start_scenario(self, instance_kill_count, node, timeout, duration, poll_interval):
|
||||
logging.info("Starting node_stop_start_scenario injection")
|
||||
self.node_stop_scenario(instance_kill_count, node, timeout)
|
||||
self.node_stop_scenario(instance_kill_count, node, timeout, poll_interval)
|
||||
logging.info("Waiting for %s seconds before starting the node" % (duration))
|
||||
time.sleep(duration)
|
||||
self.node_start_scenario(instance_kill_count, node, timeout)
|
||||
self.node_start_scenario(instance_kill_count, node, timeout, poll_interval)
|
||||
self.affected_nodes_status.merge_affected_nodes()
|
||||
logging.info("node_stop_start_scenario has been successfully injected!")
|
||||
|
||||
@@ -56,7 +56,7 @@ class abstract_node_scenarios:
|
||||
logging.error("node_disk_detach_attach_scenario failed!")
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
pass
|
||||
|
||||
# Node scenario to reboot the node
|
||||
@@ -76,7 +76,7 @@ class abstract_node_scenarios:
|
||||
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
|
||||
|
||||
logging.info("The kubelet of the node %s has been stopped" % (node))
|
||||
logging.info("stop_kubelet_scenario has been successfuly injected!")
|
||||
logging.info("stop_kubelet_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to stop the kubelet of the node. Encountered following "
|
||||
@@ -108,7 +108,7 @@ class abstract_node_scenarios:
|
||||
)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli,affected_node)
|
||||
logging.info("The kubelet of the node %s has been restarted" % (node))
|
||||
logging.info("restart_kubelet_scenario has been successfuly injected!")
|
||||
logging.info("restart_kubelet_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to restart the kubelet of the node. Encountered following "
|
||||
@@ -128,7 +128,7 @@ class abstract_node_scenarios:
|
||||
"oc debug node/" + node + " -- chroot /host "
|
||||
"dd if=/dev/urandom of=/proc/sysrq-trigger"
|
||||
)
|
||||
logging.info("node_crash_scenario has been successfuly injected!")
|
||||
logging.info("node_crash_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to crash the node. Encountered following exception: %s. "
|
||||
|
||||
@@ -234,7 +234,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
|
||||
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -260,7 +260,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -286,7 +286,7 @@ class alibaba_node_scenarios(abstract_node_scenarios):
|
||||
|
||||
# Might need to stop and then release the instance
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
|
||||
@@ -77,10 +77,21 @@ class AWS:
|
||||
# until a successful state is reached. An error is returned after 40 failed checks
|
||||
# Setting timeout for consistency with other cloud functions
|
||||
# Wait until the node instance is running
|
||||
def wait_until_running(self, instance_id, timeout=600, affected_node=None):
|
||||
def wait_until_running(self, instance_id, timeout=600, affected_node=None, poll_interval=15):
|
||||
try:
|
||||
start_time = time.time()
|
||||
self.boto_instance.wait_until_running(InstanceIds=[instance_id])
|
||||
if timeout > 0:
|
||||
max_attempts = max(1, int(timeout / poll_interval))
|
||||
else:
|
||||
max_attempts = 40
|
||||
|
||||
self.boto_instance.wait_until_running(
|
||||
InstanceIds=[instance_id],
|
||||
WaiterConfig={
|
||||
'Delay': poll_interval,
|
||||
'MaxAttempts': max_attempts
|
||||
}
|
||||
)
|
||||
end_time = time.time()
|
||||
if affected_node:
|
||||
affected_node.set_affected_node_status("running", end_time - start_time)
|
||||
@@ -93,10 +104,21 @@ class AWS:
|
||||
return False
|
||||
|
||||
# Wait until the node instance is stopped
|
||||
def wait_until_stopped(self, instance_id, timeout=600, affected_node= None):
|
||||
def wait_until_stopped(self, instance_id, timeout=600, affected_node= None, poll_interval=15):
|
||||
try:
|
||||
start_time = time.time()
|
||||
self.boto_instance.wait_until_stopped(InstanceIds=[instance_id])
|
||||
if timeout > 0:
|
||||
max_attempts = max(1, int(timeout / poll_interval))
|
||||
else:
|
||||
max_attempts = 40
|
||||
|
||||
self.boto_instance.wait_until_stopped(
|
||||
InstanceIds=[instance_id],
|
||||
WaiterConfig={
|
||||
'Delay': poll_interval,
|
||||
'MaxAttempts': max_attempts
|
||||
}
|
||||
)
|
||||
end_time = time.time()
|
||||
if affected_node:
|
||||
affected_node.set_affected_node_status("stopped", end_time - start_time)
|
||||
@@ -109,10 +131,21 @@ class AWS:
|
||||
return False
|
||||
|
||||
# Wait until the node instance is terminated
|
||||
def wait_until_terminated(self, instance_id, timeout=600, affected_node= None):
|
||||
def wait_until_terminated(self, instance_id, timeout=600, affected_node= None, poll_interval=15):
|
||||
try:
|
||||
start_time = time.time()
|
||||
self.boto_instance.wait_until_terminated(InstanceIds=[instance_id])
|
||||
if timeout > 0:
|
||||
max_attempts = max(1, int(timeout / poll_interval))
|
||||
else:
|
||||
max_attempts = 40
|
||||
|
||||
self.boto_instance.wait_until_terminated(
|
||||
InstanceIds=[instance_id],
|
||||
WaiterConfig={
|
||||
'Delay': poll_interval,
|
||||
'MaxAttempts': max_attempts
|
||||
}
|
||||
)
|
||||
end_time = time.time()
|
||||
if affected_node:
|
||||
affected_node.set_affected_node_status("terminated", end_time - start_time)
|
||||
@@ -267,7 +300,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -278,7 +311,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
"Starting the node %s with instance ID: %s " % (node, instance_id)
|
||||
)
|
||||
self.aws.start_instances(instance_id)
|
||||
self.aws.wait_until_running(instance_id, affected_node=affected_node)
|
||||
self.aws.wait_until_running(instance_id, timeout=timeout, affected_node=affected_node, poll_interval=poll_interval)
|
||||
if self.node_action_kube_check:
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
|
||||
logging.info(
|
||||
@@ -296,7 +329,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -307,7 +340,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
"Stopping the node %s with instance ID: %s " % (node, instance_id)
|
||||
)
|
||||
self.aws.stop_instances(instance_id)
|
||||
self.aws.wait_until_stopped(instance_id, affected_node=affected_node)
|
||||
self.aws.wait_until_stopped(instance_id, timeout=timeout, affected_node=affected_node, poll_interval=poll_interval)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in stopped state" % (instance_id)
|
||||
)
|
||||
@@ -324,7 +357,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -336,7 +369,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
% (node, instance_id)
|
||||
)
|
||||
self.aws.terminate_instances(instance_id)
|
||||
self.aws.wait_until_terminated(instance_id, affected_node=affected_node)
|
||||
self.aws.wait_until_terminated(instance_id, timeout=timeout, affected_node=affected_node, poll_interval=poll_interval)
|
||||
for _ in range(timeout):
|
||||
if node not in self.kubecli.list_nodes():
|
||||
break
|
||||
@@ -346,7 +379,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been terminated" % (instance_id)
|
||||
)
|
||||
logging.info("node_termination_scenario has been successfuly injected!")
|
||||
logging.info("node_termination_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to terminate node instance. Encountered following exception:"
|
||||
@@ -375,7 +408,7 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been rebooted" % (instance_id)
|
||||
)
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -18,8 +18,6 @@ class Azure:
|
||||
logging.info("azure " + str(self))
|
||||
# Acquire a credential object using CLI-based authentication.
|
||||
credentials = DefaultAzureCredential()
|
||||
# az_account = runcommand.invoke("az account list -o yaml")
|
||||
# az_account_yaml = yaml.safe_load(az_account, Loader=yaml.FullLoader)
|
||||
logger = logging.getLogger("azure")
|
||||
logger.setLevel(logging.WARNING)
|
||||
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID")
|
||||
@@ -218,7 +216,7 @@ class azure_node_scenarios(abstract_node_scenarios):
|
||||
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -246,7 +244,7 @@ class azure_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -273,7 +271,7 @@ class azure_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
|
||||
@@ -153,7 +153,7 @@ class bm_node_scenarios(abstract_node_scenarios):
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -182,7 +182,7 @@ class bm_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -210,7 +210,7 @@ class bm_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
logging.info("Node termination scenario is not supported on baremetal")
|
||||
|
||||
# Node scenario to reboot the node
|
||||
@@ -229,7 +229,7 @@ class bm_node_scenarios(abstract_node_scenarios):
|
||||
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
|
||||
logging.info("Node with bmc address: %s has been rebooted" % (bmc_addr))
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -11,7 +11,7 @@ def get_node_by_name(node_name_list, kubecli: KrknKubernetes):
|
||||
for node_name in node_name_list:
|
||||
if node_name not in killable_nodes:
|
||||
logging.info(
|
||||
f"Node with provided ${node_name} does not exist or the node might "
|
||||
f"Node with provided {node_name} does not exist or the node might "
|
||||
"be in NotReady state."
|
||||
)
|
||||
return
|
||||
|
||||
@@ -2,49 +2,176 @@ import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
|
||||
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
|
||||
abstract_node_scenarios,
|
||||
)
|
||||
import os
|
||||
import platform
|
||||
import logging
|
||||
import docker
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.models.k8s import AffectedNode, AffectedNodeStatus
|
||||
|
||||
class Docker:
|
||||
"""
|
||||
Container runtime client wrapper supporting both Docker and Podman.
|
||||
|
||||
This class automatically detects and connects to either Docker or Podman
|
||||
container runtimes using the Docker-compatible API. It tries multiple
|
||||
connection methods in order of preference:
|
||||
|
||||
1. Docker Unix socket (unix:///var/run/docker.sock)
|
||||
2. Platform-specific Podman sockets:
|
||||
- macOS: ~/.local/share/containers/podman/machine/podman.sock
|
||||
- Linux rootful: unix:///run/podman/podman.sock
|
||||
- Linux rootless: unix:///run/user/<uid>/podman/podman.sock
|
||||
3. Environment variables (DOCKER_HOST or CONTAINER_HOST)
|
||||
|
||||
The runtime type (docker/podman) is auto-detected and logged for debugging.
|
||||
Supports Kind clusters running on Podman.
|
||||
|
||||
Assisted By: Claude Code
|
||||
"""
|
||||
def __init__(self):
|
||||
self.client = docker.from_env()
|
||||
self.client = None
|
||||
self.runtime = 'unknown'
|
||||
|
||||
|
||||
# Try multiple connection methods in order of preference
|
||||
# Supports both Docker and Podman
|
||||
connection_methods = [
|
||||
('unix:///var/run/docker.sock', 'Docker Unix socket'),
|
||||
]
|
||||
|
||||
# Add platform-specific Podman sockets
|
||||
if platform.system() == 'Darwin': # macOS
|
||||
# On macOS, Podman uses podman-machine with socket typically at:
|
||||
# ~/.local/share/containers/podman/machine/podman.sock
|
||||
# This is often symlinked to /var/run/docker.sock
|
||||
podman_machine_sock = os.path.expanduser('~/.local/share/containers/podman/machine/podman.sock')
|
||||
if os.path.exists(podman_machine_sock):
|
||||
connection_methods.append((f'unix://{podman_machine_sock}', 'Podman machine socket (macOS)'))
|
||||
else: # Linux
|
||||
connection_methods.extend([
|
||||
('unix:///run/podman/podman.sock', 'Podman Unix socket (rootful)'),
|
||||
('unix:///run/user/{uid}/podman/podman.sock', 'Podman Unix socket (rootless)'),
|
||||
])
|
||||
|
||||
# Always try from_env as last resort
|
||||
connection_methods.append(('from_env', 'Environment variables (DOCKER_HOST/CONTAINER_HOST)'))
|
||||
|
||||
for method, description in connection_methods:
|
||||
try:
|
||||
# Handle rootless Podman socket path with {uid} placeholder
|
||||
if '{uid}' in method:
|
||||
uid = os.getuid()
|
||||
method = method.format(uid=uid)
|
||||
logging.info(f'Attempting to connect using {description}: {method}')
|
||||
|
||||
if method == 'from_env':
|
||||
logging.info(f'Attempting to connect using {description}')
|
||||
self.client = docker.from_env()
|
||||
else:
|
||||
logging.info(f'Attempting to connect using {description}: {method}')
|
||||
self.client = docker.DockerClient(base_url=method)
|
||||
|
||||
# Test the connection
|
||||
self.client.ping()
|
||||
|
||||
# Detect runtime type
|
||||
try:
|
||||
version_info = self.client.version()
|
||||
version_str = version_info.get('Version', '')
|
||||
if 'podman' in version_str.lower():
|
||||
self.runtime = 'podman'
|
||||
else:
|
||||
self.runtime = 'docker'
|
||||
logging.debug(f'Runtime version info: {version_str}')
|
||||
except Exception as version_err:
|
||||
logging.warning(f'Could not detect runtime version: {version_err}')
|
||||
self.runtime = 'unknown'
|
||||
|
||||
logging.info(f'Successfully connected to {self.runtime} using {description}')
|
||||
|
||||
# Log available containers for debugging
|
||||
try:
|
||||
containers = self.client.containers.list(all=True)
|
||||
logging.info(f'Found {len(containers)} total containers')
|
||||
for container in containers[:5]: # Log first 5
|
||||
logging.debug(f' Container: {container.name} ({container.status})')
|
||||
except Exception as list_err:
|
||||
logging.warning(f'Could not list containers: {list_err}')
|
||||
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
logging.warning(f'Failed to connect using {description}: {e}')
|
||||
continue
|
||||
|
||||
if self.client is None:
|
||||
error_msg = 'Failed to initialize container runtime client (Docker/Podman) with any connection method'
|
||||
logging.error(error_msg)
|
||||
logging.error('Attempted connection methods:')
|
||||
for method, desc in connection_methods:
|
||||
logging.error(f' - {desc}: {method}')
|
||||
raise RuntimeError(error_msg)
|
||||
|
||||
logging.info(f'Container runtime client initialized successfully: {self.runtime}')
|
||||
|
||||
def get_container_id(self, node_name):
|
||||
"""Get the container ID for a given node name."""
|
||||
container = self.client.containers.get(node_name)
|
||||
logging.info(f'Found {self.runtime} container for node {node_name}: {container.id}')
|
||||
return container.id
|
||||
|
||||
# Start the node instance
|
||||
def start_instances(self, node_name):
|
||||
"""Start a container instance (works with both Docker and Podman)."""
|
||||
logging.info(f'Starting {self.runtime} container for node: {node_name}')
|
||||
container = self.client.containers.get(node_name)
|
||||
container.start()
|
||||
logging.info(f'Container {container.id} started successfully')
|
||||
|
||||
# Stop the node instance
|
||||
def stop_instances(self, node_name):
|
||||
"""Stop a container instance (works with both Docker and Podman)."""
|
||||
logging.info(f'Stopping {self.runtime} container for node: {node_name}')
|
||||
container = self.client.containers.get(node_name)
|
||||
container.stop()
|
||||
logging.info(f'Container {container.id} stopped successfully')
|
||||
|
||||
# Reboot the node instance
|
||||
def reboot_instances(self, node_name):
|
||||
"""Restart a container instance (works with both Docker and Podman)."""
|
||||
logging.info(f'Restarting {self.runtime} container for node: {node_name}')
|
||||
container = self.client.containers.get(node_name)
|
||||
container.restart()
|
||||
logging.info(f'Container {container.id} restarted successfully')
|
||||
|
||||
# Terminate the node instance
|
||||
def terminate_instances(self, node_name):
|
||||
"""Stop and remove a container instance (works with both Docker and Podman)."""
|
||||
logging.info(f'Terminating {self.runtime} container for node: {node_name}')
|
||||
container = self.client.containers.get(node_name)
|
||||
container.stop()
|
||||
container.remove()
|
||||
logging.info(f'Container {container.id} terminated and removed successfully')
|
||||
|
||||
|
||||
class docker_node_scenarios(abstract_node_scenarios):
|
||||
"""
|
||||
Node chaos scenarios for containerized Kubernetes nodes.
|
||||
|
||||
Supports both Docker and Podman container runtimes. This class provides
|
||||
methods to inject chaos into Kubernetes nodes running as containers
|
||||
(e.g., Kind clusters, Podman-based clusters).
|
||||
"""
|
||||
def __init__(self, kubecli: KrknKubernetes, node_action_kube_check: bool, affected_nodes_status: AffectedNodeStatus):
|
||||
logging.info('Initializing docker_node_scenarios (supports Docker and Podman)')
|
||||
super().__init__(kubecli, node_action_kube_check, affected_nodes_status)
|
||||
self.docker = Docker()
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
logging.info(f'Node scenarios initialized successfully using {self.docker.runtime} runtime')
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -71,7 +198,7 @@ class docker_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -97,7 +224,7 @@ class docker_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Starting node_termination_scenario injection")
|
||||
@@ -110,7 +237,7 @@ class docker_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with container ID: %s has been terminated" % (container_id)
|
||||
)
|
||||
logging.info("node_termination_scenario has been successfuly injected!")
|
||||
logging.info("node_termination_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to terminate node instance. Encountered following exception:"
|
||||
@@ -137,7 +264,7 @@ class docker_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with container ID: %s has been rebooted" % (container_id)
|
||||
)
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -227,7 +227,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -257,7 +257,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -286,7 +286,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -309,7 +309,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been terminated" % instance_id
|
||||
)
|
||||
logging.info("node_termination_scenario has been successfuly injected!")
|
||||
logging.info("node_termination_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to terminate node instance. Encountered following exception:"
|
||||
@@ -341,7 +341,7 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been rebooted" % instance_id
|
||||
)
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
|
||||
@@ -18,21 +18,21 @@ class general_node_scenarios(abstract_node_scenarios):
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
logging.info(
|
||||
"Node start is not set up yet for this cloud type, "
|
||||
"no action is going to be taken"
|
||||
)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
logging.info(
|
||||
"Node stop is not set up yet for this cloud type,"
|
||||
" no action is going to be taken"
|
||||
)
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
logging.info(
|
||||
"Node termination is not set up yet for this cloud type, "
|
||||
"no action is going to be taken"
|
||||
|
||||
@@ -284,7 +284,7 @@ class ibm_node_scenarios(abstract_node_scenarios):
|
||||
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
instance_id = self.ibmcloud.get_instance_id( node)
|
||||
affected_node = AffectedNode(node, node_id=instance_id)
|
||||
@@ -317,7 +317,7 @@ class ibm_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
instance_id = self.ibmcloud.get_instance_id(node)
|
||||
for _ in range(instance_kill_count):
|
||||
@@ -327,14 +327,20 @@ class ibm_node_scenarios(abstract_node_scenarios):
|
||||
vm_stopped = self.ibmcloud.stop_instances(instance_id)
|
||||
if vm_stopped:
|
||||
self.ibmcloud.wait_until_stopped(instance_id, timeout, affected_node)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in stopped state" % node
|
||||
)
|
||||
logging.info(
|
||||
"node_stop_scenario has been successfully injected!"
|
||||
)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in stopped state" % node
|
||||
)
|
||||
logging.info(
|
||||
"node_stop_scenario has been successfully injected!"
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
"Failed to stop node instance %s. Stop command failed." % instance_id
|
||||
)
|
||||
raise Exception("Stop command failed for instance %s" % instance_id)
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
except Exception as e:
|
||||
logging.error("Failed to stop node instance. Test Failed")
|
||||
logging.error("Failed to stop node instance. Test Failed: %s" % str(e))
|
||||
logging.error("node_stop_scenario injection failed!")
|
||||
|
||||
|
||||
@@ -345,28 +351,35 @@ class ibm_node_scenarios(abstract_node_scenarios):
|
||||
affected_node = AffectedNode(node, node_id=instance_id)
|
||||
logging.info("Starting node_reboot_scenario injection")
|
||||
logging.info("Rebooting the node %s " % (node))
|
||||
self.ibmcloud.reboot_instances(instance_id)
|
||||
self.ibmcloud.wait_until_rebooted(instance_id, timeout, affected_node)
|
||||
if self.node_action_kube_check:
|
||||
nodeaction.wait_for_unknown_status(
|
||||
node, timeout, affected_node
|
||||
vm_rebooted = self.ibmcloud.reboot_instances(instance_id)
|
||||
if vm_rebooted:
|
||||
self.ibmcloud.wait_until_rebooted(instance_id, timeout, affected_node)
|
||||
if self.node_action_kube_check:
|
||||
nodeaction.wait_for_unknown_status(
|
||||
node, timeout, self.kubecli, affected_node
|
||||
)
|
||||
nodeaction.wait_for_ready_status(
|
||||
node, timeout, self.kubecli, affected_node
|
||||
)
|
||||
logging.info(
|
||||
"Node with instance ID: %s has rebooted successfully" % node
|
||||
)
|
||||
nodeaction.wait_for_ready_status(
|
||||
node, timeout, affected_node
|
||||
logging.info(
|
||||
"node_reboot_scenario has been successfully injected!"
|
||||
)
|
||||
logging.info(
|
||||
"Node with instance ID: %s has rebooted successfully" % node
|
||||
)
|
||||
logging.info(
|
||||
"node_reboot_scenario has been successfully injected!"
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
"Failed to reboot node instance %s. Reboot command failed." % instance_id
|
||||
)
|
||||
raise Exception("Reboot command failed for instance %s" % instance_id)
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
except Exception as e:
|
||||
logging.error("Failed to reboot node instance. Test Failed")
|
||||
logging.error("Failed to reboot node instance. Test Failed: %s" % str(e))
|
||||
logging.error("node_reboot_scenario injection failed!")
|
||||
|
||||
|
||||
def node_terminate_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_terminate_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
instance_id = self.ibmcloud.get_instance_id(node)
|
||||
for _ in range(instance_kill_count):
|
||||
@@ -383,7 +396,8 @@ class ibm_node_scenarios(abstract_node_scenarios):
|
||||
logging.info(
|
||||
"node_terminate_scenario has been successfully injected!"
|
||||
)
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
except Exception as e:
|
||||
logging.error("Failed to terminate node instance. Test Failed")
|
||||
logging.error("Failed to terminate node instance. Test Failed: %s" % str(e))
|
||||
logging.error("node_terminate_scenario injection failed!")
|
||||
|
||||
|
||||
@@ -298,7 +298,7 @@ class ibmcloud_power_node_scenarios(abstract_node_scenarios):
|
||||
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
instance_id = self.ibmcloud_power.get_instance_id( node)
|
||||
affected_node = AffectedNode(node, node_id=instance_id)
|
||||
@@ -331,7 +331,7 @@ class ibmcloud_power_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
instance_id = self.ibmcloud_power.get_instance_id(node)
|
||||
for _ in range(instance_kill_count):
|
||||
@@ -380,7 +380,7 @@ class ibmcloud_power_node_scenarios(abstract_node_scenarios):
|
||||
logging.error("node_reboot_scenario injection failed!")
|
||||
|
||||
|
||||
def node_terminate_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_terminate_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
instance_id = self.ibmcloud_power.get_instance_id(node)
|
||||
for _ in range(instance_kill_count):
|
||||
|
||||
@@ -196,13 +196,11 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
exclude_nodes = common_node_functions.get_node(
|
||||
exclude_label, 0, kubecli
|
||||
)
|
||||
|
||||
for node in nodes:
|
||||
if node in exclude_nodes:
|
||||
logging.info(
|
||||
f"excluding node {node} with exclude label {exclude_nodes}"
|
||||
)
|
||||
nodes.remove(node)
|
||||
if exclude_nodes:
|
||||
logging.info(
|
||||
f"excluding nodes {exclude_nodes} with exclude label {exclude_label}"
|
||||
)
|
||||
nodes = [node for node in nodes if node not in exclude_nodes]
|
||||
|
||||
# GCP api doesn't support multiprocessing calls, will only actually run 1
|
||||
if parallel_nodes:
|
||||
@@ -236,7 +234,7 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
# Get the scenario specifics for running action nodes
|
||||
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
|
||||
duration = get_yaml_item_value(node_scenario, "duration", 120)
|
||||
|
||||
poll_interval = get_yaml_item_value(node_scenario, "poll_interval", 15)
|
||||
timeout = get_yaml_item_value(node_scenario, "timeout", 120)
|
||||
service = get_yaml_item_value(node_scenario, "service", "")
|
||||
soft_reboot = get_yaml_item_value(node_scenario, "soft_reboot", False)
|
||||
@@ -254,19 +252,19 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
else:
|
||||
if action == "node_start_scenario":
|
||||
node_scenario_object.node_start_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
run_kill_count, single_node, timeout, poll_interval
|
||||
)
|
||||
elif action == "node_stop_scenario":
|
||||
node_scenario_object.node_stop_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
run_kill_count, single_node, timeout, poll_interval
|
||||
)
|
||||
elif action == "node_stop_start_scenario":
|
||||
node_scenario_object.node_stop_start_scenario(
|
||||
run_kill_count, single_node, timeout, duration
|
||||
run_kill_count, single_node, timeout, duration, poll_interval
|
||||
)
|
||||
elif action == "node_termination_scenario":
|
||||
node_scenario_object.node_termination_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
run_kill_count, single_node, timeout, poll_interval
|
||||
)
|
||||
elif action == "node_reboot_scenario":
|
||||
node_scenario_object.node_reboot_scenario(
|
||||
|
||||
@@ -122,7 +122,7 @@ class openstack_node_scenarios(abstract_node_scenarios):
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
# Node scenario to start the node
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -147,7 +147,7 @@ class openstack_node_scenarios(abstract_node_scenarios):
|
||||
self.affected_nodes_status.affected_nodes.append(affected_node)
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
try:
|
||||
@@ -184,7 +184,7 @@ class openstack_node_scenarios(abstract_node_scenarios):
|
||||
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli, affected_node)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli, affected_node)
|
||||
logging.info("Node with instance name: %s has been rebooted" % (node))
|
||||
logging.info("node_reboot_scenario has been successfuly injected!")
|
||||
logging.info("node_reboot_scenario has been successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to reboot node instance. Encountered following exception:"
|
||||
@@ -249,7 +249,7 @@ class openstack_node_scenarios(abstract_node_scenarios):
|
||||
node_ip.strip(), service, ssh_private_key, timeout
|
||||
)
|
||||
logging.info("Service status checked on %s" % (node_ip))
|
||||
logging.info("Check service status is successfuly injected!")
|
||||
logging.info("Check service status is successfully injected!")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to check service status. Encountered following exception:"
|
||||
|
||||
@@ -389,7 +389,7 @@ class vmware_node_scenarios(abstract_node_scenarios):
|
||||
self.vsphere = vSphere()
|
||||
self.node_action_kube_check = node_action_kube_check
|
||||
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_start_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
@@ -409,7 +409,7 @@ class vmware_node_scenarios(abstract_node_scenarios):
|
||||
f"node_start_scenario injection failed! " f"Error was: {str(e)}"
|
||||
)
|
||||
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
@@ -456,7 +456,7 @@ class vmware_node_scenarios(abstract_node_scenarios):
|
||||
)
|
||||
|
||||
|
||||
def node_terminate_scenario(self, instance_kill_count, node, timeout):
|
||||
def node_terminate_scenario(self, instance_kill_count, node, timeout, poll_interval):
|
||||
try:
|
||||
for _ in range(instance_kill_count):
|
||||
affected_node = AffectedNode(node)
|
||||
|
||||
@@ -2,7 +2,7 @@ import logging
|
||||
import random
|
||||
import time
|
||||
from asyncio import Future
|
||||
|
||||
import traceback
|
||||
import yaml
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.k8s.pod_monitor import select_and_monitor_by_namespace_pattern_and_label, \
|
||||
@@ -11,6 +11,7 @@ from krkn_lib.k8s.pod_monitor import select_and_monitor_by_namespace_pattern_and
|
||||
from krkn.scenario_plugins.pod_disruption.models.models import InputParams
|
||||
from krkn_lib.models.telemetry import ScenarioTelemetry
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn_lib.models.pod_monitor.models import PodsSnapshot
|
||||
from datetime import datetime
|
||||
from dataclasses import dataclass
|
||||
|
||||
@@ -40,10 +41,27 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
kill_scenario_config,
|
||||
lib_telemetry
|
||||
)
|
||||
self.killing_pods(
|
||||
ret = self.killing_pods(
|
||||
kill_scenario_config, lib_telemetry.get_lib_kubernetes()
|
||||
)
|
||||
# returning 2 if configuration issue and exiting immediately
|
||||
if ret > 1:
|
||||
# Cancel the monitoring future since killing_pods already failed
|
||||
logging.info("Cancelling pod monitoring future")
|
||||
future_snapshot.cancel()
|
||||
# Wait for the future to finish (monitoring will stop when stop_event is set)
|
||||
while not future_snapshot.done():
|
||||
logging.info("waiting for future to finish")
|
||||
time.sleep(1)
|
||||
logging.info("future snapshot cancelled and finished")
|
||||
# Get the snapshot result (even if cancelled, it will have partial data)
|
||||
snapshot = future_snapshot.result()
|
||||
result = snapshot.get_pods_status()
|
||||
scenario_telemetry.affected_pods = result
|
||||
|
||||
logging.error("PodDisruptionScenarioPlugin failed during setup" + str(result))
|
||||
return 1
|
||||
|
||||
snapshot = future_snapshot.result()
|
||||
result = snapshot.get_pods_status()
|
||||
scenario_telemetry.affected_pods = result
|
||||
@@ -51,7 +69,12 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
logging.info("PodDisruptionScenarioPlugin failed with unrecovered pods")
|
||||
return 1
|
||||
|
||||
if ret > 0:
|
||||
logging.info("PodDisruptionScenarioPlugin failed")
|
||||
return 1
|
||||
|
||||
except (RuntimeError, Exception) as e:
|
||||
logging.error("Stack trace:\n%s", traceback.format_exc())
|
||||
logging.error("PodDisruptionScenariosPlugin exiting due to Exception %s" % e)
|
||||
return 1
|
||||
else:
|
||||
@@ -128,7 +151,7 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
field_selector=combined_field_selector
|
||||
)
|
||||
|
||||
def get_pods(self, name_pattern, label_selector, namespace, kubecli: KrknKubernetes, field_selector: str = None, node_label_selector: str = None, node_names: list = None, quiet: bool = False):
|
||||
def get_pods(self, name_pattern, label_selector, namespace, kubecli: KrknKubernetes, field_selector: str = None, node_label_selector: str = None, node_names: list = None):
|
||||
if label_selector and name_pattern:
|
||||
logging.error('Only, one of name pattern or label pattern can be specified')
|
||||
return []
|
||||
@@ -139,8 +162,7 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
|
||||
# If specific node names are provided, make multiple calls with field selector
|
||||
if node_names:
|
||||
if not quiet:
|
||||
logging.info(f"Targeting pods on {len(node_names)} specific nodes")
|
||||
logging.debug(f"Targeting pods on {len(node_names)} specific nodes")
|
||||
all_pods = []
|
||||
for node_name in node_names:
|
||||
pods = self._select_pods_with_field_selector(
|
||||
@@ -150,8 +172,7 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
if pods:
|
||||
all_pods.extend(pods)
|
||||
|
||||
if not quiet:
|
||||
logging.info(f"Found {len(all_pods)} target pods across {len(node_names)} nodes")
|
||||
logging.debug(f"Found {len(all_pods)} target pods across {len(node_names)} nodes")
|
||||
return all_pods
|
||||
|
||||
# Node label selector approach - use field selectors
|
||||
@@ -159,11 +180,10 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
# Get nodes matching the label selector first
|
||||
nodes_with_label = kubecli.list_nodes(label_selector=node_label_selector)
|
||||
if not nodes_with_label:
|
||||
logging.info(f"No nodes found with label selector: {node_label_selector}")
|
||||
logging.debug(f"No nodes found with label selector: {node_label_selector}")
|
||||
return []
|
||||
|
||||
if not quiet:
|
||||
logging.info(f"Targeting pods on {len(nodes_with_label)} nodes with label: {node_label_selector}")
|
||||
logging.debug(f"Targeting pods on {len(nodes_with_label)} nodes with label: {node_label_selector}")
|
||||
# Use field selector for each node
|
||||
all_pods = []
|
||||
for node_name in nodes_with_label:
|
||||
@@ -174,8 +194,7 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
if pods:
|
||||
all_pods.extend(pods)
|
||||
|
||||
if not quiet:
|
||||
logging.info(f"Found {len(all_pods)} target pods across {len(nodes_with_label)} nodes")
|
||||
logging.debug(f"Found {len(all_pods)} target pods across {len(nodes_with_label)} nodes")
|
||||
return all_pods
|
||||
|
||||
# Standard pod selection (no node targeting)
|
||||
@@ -185,37 +204,40 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
|
||||
def killing_pods(self, config: InputParams, kubecli: KrknKubernetes):
|
||||
# region Select target pods
|
||||
try:
|
||||
namespace = config.namespace_pattern
|
||||
if not namespace:
|
||||
logging.error('Namespace pattern must be specified')
|
||||
|
||||
pods = self.get_pods(config.name_pattern,config.label_selector,config.namespace_pattern, kubecli, field_selector="status.phase=Running", node_label_selector=config.node_label_selector, node_names=config.node_names)
|
||||
exclude_pods = set()
|
||||
if config.exclude_label:
|
||||
_exclude_pods = self.get_pods("",config.exclude_label,config.namespace_pattern, kubecli, field_selector="status.phase=Running", node_label_selector=config.node_label_selector, node_names=config.node_names)
|
||||
for pod in _exclude_pods:
|
||||
exclude_pods.add(pod[0])
|
||||
|
||||
|
||||
pods_count = len(pods)
|
||||
if len(pods) < config.kill:
|
||||
logging.error("Not enough pods match the criteria, expected {} but found only {} pods".format(
|
||||
config.kill, len(pods)))
|
||||
return 1
|
||||
|
||||
namespace = config.namespace_pattern
|
||||
if not namespace:
|
||||
logging.error('Namespace pattern must be specified')
|
||||
random.shuffle(pods)
|
||||
for i in range(config.kill):
|
||||
pod = pods[i]
|
||||
logging.info(pod)
|
||||
if pod[0] in exclude_pods:
|
||||
logging.info(f"Excluding {pod[0]} from chaos")
|
||||
else:
|
||||
logging.info(f'Deleting pod {pod[0]}')
|
||||
kubecli.delete_pod(pod[0], pod[1])
|
||||
|
||||
return_val = self.wait_for_pods(config.label_selector,config.name_pattern,config.namespace_pattern, pods_count, config.duration, config.timeout, kubecli, config.node_label_selector, config.node_names)
|
||||
except Exception as e:
|
||||
raise(e)
|
||||
|
||||
pods = self.get_pods(config.name_pattern,config.label_selector,config.namespace_pattern, kubecli, field_selector="status.phase=Running", node_label_selector=config.node_label_selector, node_names=config.node_names)
|
||||
exclude_pods = set()
|
||||
if config.exclude_label:
|
||||
_exclude_pods = self.get_pods("",config.exclude_label,config.namespace_pattern, kubecli, field_selector="status.phase=Running", node_label_selector=config.node_label_selector, node_names=config.node_names)
|
||||
for pod in _exclude_pods:
|
||||
exclude_pods.add(pod[0])
|
||||
|
||||
|
||||
pods_count = len(pods)
|
||||
if len(pods) < config.kill:
|
||||
logging.error("Not enough pods match the criteria, expected {} but found only {} pods".format(
|
||||
config.kill, len(pods)))
|
||||
return 1
|
||||
|
||||
random.shuffle(pods)
|
||||
for i in range(config.kill):
|
||||
pod = pods[i]
|
||||
logging.info(pod)
|
||||
if pod[0] in exclude_pods:
|
||||
logging.info(f"Excluding {pod[0]} from chaos")
|
||||
else:
|
||||
logging.info(f'Deleting pod {pod[0]}')
|
||||
kubecli.delete_pod(pod[0], pod[1])
|
||||
|
||||
self.wait_for_pods(config.label_selector,config.name_pattern,config.namespace_pattern, pods_count, config.duration, config.timeout, kubecli, config.node_label_selector, config.node_names)
|
||||
return 0
|
||||
return return_val
|
||||
|
||||
def wait_for_pods(
|
||||
self, label_selector, pod_name, namespace, pod_count, duration, wait_timeout, kubecli: KrknKubernetes, node_label_selector, node_names
|
||||
@@ -224,10 +246,10 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
start_time = datetime.now()
|
||||
|
||||
while not timeout:
|
||||
pods = self.get_pods(name_pattern=pod_name, label_selector=label_selector,namespace=namespace, field_selector="status.phase=Running", kubecli=kubecli, node_label_selector=node_label_selector, node_names=node_names, quiet=True)
|
||||
pods = self.get_pods(name_pattern=pod_name, label_selector=label_selector,namespace=namespace, field_selector="status.phase=Running", kubecli=kubecli, node_label_selector=node_label_selector, node_names=node_names)
|
||||
if pod_count == len(pods):
|
||||
return
|
||||
|
||||
return 0
|
||||
|
||||
time.sleep(duration)
|
||||
|
||||
now_time = datetime.now()
|
||||
@@ -236,4 +258,5 @@ class PodDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
if time_diff.seconds > wait_timeout:
|
||||
logging.error("timeout while waiting for pods to come up")
|
||||
return 1
|
||||
|
||||
return 0
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
import base64
|
||||
import json
|
||||
import logging
|
||||
import random
|
||||
import re
|
||||
@@ -11,9 +13,12 @@ from krkn_lib.utils import get_yaml_item_value, log_exception
|
||||
|
||||
from krkn import cerberus, utils
|
||||
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
|
||||
from krkn.rollback.config import RollbackContent
|
||||
from krkn.rollback.handler import set_rollback_context_decorator
|
||||
|
||||
|
||||
class PvcScenarioPlugin(AbstractScenarioPlugin):
|
||||
@set_rollback_context_decorator
|
||||
def run(
|
||||
self,
|
||||
run_uuid: str,
|
||||
@@ -229,6 +234,24 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
|
||||
logging.info("\n" + str(response))
|
||||
if str(file_name).lower() in str(response).lower():
|
||||
logging.info("%s file successfully created" % (str(full_path)))
|
||||
|
||||
# Set rollback callable to ensure temp file cleanup on failure or interruption
|
||||
rollback_data = {
|
||||
"pod_name": pod_name,
|
||||
"container_name": container_name,
|
||||
"mount_path": mount_path,
|
||||
"file_name": file_name,
|
||||
"full_path": full_path,
|
||||
}
|
||||
json_str = json.dumps(rollback_data)
|
||||
encoded_data = base64.b64encode(json_str.encode('utf-8')).decode('utf-8')
|
||||
self.rollback_handler.set_rollback_callable(
|
||||
self.rollback_temp_file,
|
||||
RollbackContent(
|
||||
namespace=namespace,
|
||||
resource_identifier=encoded_data,
|
||||
),
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
"PvcScenarioPlugin Failed to create tmp file with %s size"
|
||||
@@ -313,5 +336,57 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
|
||||
res = int(value[:-2]) * (base**exp)
|
||||
return res
|
||||
|
||||
@staticmethod
|
||||
def rollback_temp_file(
|
||||
rollback_content: RollbackContent,
|
||||
lib_telemetry: KrknTelemetryOpenshift,
|
||||
):
|
||||
"""Rollback function to remove temporary file created during the PVC scenario.
|
||||
|
||||
:param rollback_content: Rollback content containing namespace and encoded rollback data in resource_identifier.
|
||||
:param lib_telemetry: Instance of KrknTelemetryOpenshift for Kubernetes operations.
|
||||
"""
|
||||
try:
|
||||
namespace = rollback_content.namespace
|
||||
import base64 # noqa
|
||||
import json # noqa
|
||||
decoded_data = base64.b64decode(rollback_content.resource_identifier.encode('utf-8')).decode('utf-8')
|
||||
rollback_data = json.loads(decoded_data)
|
||||
pod_name = rollback_data["pod_name"]
|
||||
container_name = rollback_data["container_name"]
|
||||
full_path = rollback_data["full_path"]
|
||||
file_name = rollback_data["file_name"]
|
||||
mount_path = rollback_data["mount_path"]
|
||||
|
||||
logging.info(
|
||||
f"Rolling back PVC scenario: removing temp file {full_path} from pod {pod_name} in namespace {namespace}"
|
||||
)
|
||||
|
||||
# Remove the temp file
|
||||
command = "rm -f %s" % (str(full_path))
|
||||
logging.info("Remove temp file from the PVC command:\n %s" % command)
|
||||
response = lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
|
||||
[command], pod_name, namespace, container_name
|
||||
)
|
||||
logging.info("\n" + str(response))
|
||||
# Verify removal
|
||||
command = "ls -lh %s" % (str(mount_path))
|
||||
logging.info("Check temp file is removed command:\n %s" % command)
|
||||
response = lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
|
||||
[command], pod_name, namespace, container_name
|
||||
)
|
||||
logging.info("\n" + str(response))
|
||||
|
||||
if not (str(file_name).lower() in str(response).lower()):
|
||||
logging.info("Temp file successfully removed during rollback")
|
||||
else:
|
||||
logging.warning(
|
||||
f"Temp file {file_name} may still exist after rollback attempt"
|
||||
)
|
||||
|
||||
logging.info("PVC scenario rollback completed successfully.")
|
||||
except Exception as e:
|
||||
logging.error(f"Failed to rollback PVC scenario temp file: {e}")
|
||||
|
||||
def get_scenario_types(self) -> list[str]:
|
||||
return ["pvc_scenarios"]
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
import importlib
|
||||
import inspect
|
||||
import pkgutil
|
||||
from typing import Type, Tuple, Optional
|
||||
from typing import Type, Tuple, Optional, Any
|
||||
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ class ScenarioPluginNotFound(Exception):
|
||||
|
||||
class ScenarioPluginFactory:
|
||||
|
||||
loaded_plugins: dict[str, any] = {}
|
||||
loaded_plugins: dict[str, Any] = {}
|
||||
failed_plugins: list[Tuple[str, str, str]] = []
|
||||
package_name = None
|
||||
|
||||
|
||||
@@ -209,7 +209,7 @@ class ServiceDisruptionScenarioPlugin(AbstractScenarioPlugin):
|
||||
try:
|
||||
statefulsets = kubecli.get_all_statefulset(namespace)
|
||||
for statefulset in statefulsets:
|
||||
logging.info("Deleting statefulsets" + statefulsets)
|
||||
logging.info("Deleting statefulset" + statefulset)
|
||||
kubecli.delete_statefulset(statefulset, namespace)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
|
||||
@@ -1,13 +1,17 @@
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import base64
|
||||
import yaml
|
||||
from krkn_lib.models.telemetry import ScenarioTelemetry
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
|
||||
from krkn_lib.utils import get_yaml_item_value
|
||||
from krkn.rollback.config import RollbackContent
|
||||
from krkn.rollback.handler import set_rollback_context_decorator
|
||||
|
||||
class ServiceHijackingScenarioPlugin(AbstractScenarioPlugin):
|
||||
@set_rollback_context_decorator
|
||||
def run(
|
||||
self,
|
||||
run_uuid: str,
|
||||
@@ -78,6 +82,24 @@ class ServiceHijackingScenarioPlugin(AbstractScenarioPlugin):
|
||||
|
||||
logging.info(f"service: {service_name} successfully patched!")
|
||||
logging.info(f"original service manifest:\n\n{yaml.dump(original_service)}")
|
||||
|
||||
# Set rollback callable to ensure service restoration and pod cleanup on failure or interruption
|
||||
rollback_data = {
|
||||
"service_name": service_name,
|
||||
"service_namespace": service_namespace,
|
||||
"original_selectors": original_service["spec"]["selector"],
|
||||
"webservice_pod_name": webservice.pod_name,
|
||||
}
|
||||
json_str = json.dumps(rollback_data)
|
||||
encoded_data = base64.b64encode(json_str.encode("utf-8")).decode("utf-8")
|
||||
self.rollback_handler.set_rollback_callable(
|
||||
self.rollback_service_hijacking,
|
||||
RollbackContent(
|
||||
namespace=service_namespace,
|
||||
resource_identifier=encoded_data,
|
||||
),
|
||||
)
|
||||
|
||||
logging.info(f"waiting {chaos_duration} before restoring the service")
|
||||
time.sleep(chaos_duration)
|
||||
selectors = [
|
||||
@@ -106,5 +128,63 @@ class ServiceHijackingScenarioPlugin(AbstractScenarioPlugin):
|
||||
)
|
||||
return 1
|
||||
|
||||
@staticmethod
|
||||
def rollback_service_hijacking(
|
||||
rollback_content: RollbackContent,
|
||||
lib_telemetry: KrknTelemetryOpenshift,
|
||||
):
|
||||
"""Rollback function to restore original service selectors and cleanup hijacker pod.
|
||||
|
||||
:param rollback_content: Rollback content containing namespace and encoded rollback data in resource_identifier.
|
||||
:param lib_telemetry: Instance of KrknTelemetryOpenshift for Kubernetes operations.
|
||||
"""
|
||||
try:
|
||||
namespace = rollback_content.namespace
|
||||
import json # noqa
|
||||
import base64 # noqa
|
||||
# Decode rollback data from resource_identifier
|
||||
decoded_data = base64.b64decode(rollback_content.resource_identifier.encode("utf-8")).decode("utf-8")
|
||||
rollback_data = json.loads(decoded_data)
|
||||
service_name = rollback_data["service_name"]
|
||||
service_namespace = rollback_data["service_namespace"]
|
||||
original_selectors = rollback_data["original_selectors"]
|
||||
webservice_pod_name = rollback_data["webservice_pod_name"]
|
||||
|
||||
logging.info(
|
||||
f"Rolling back service hijacking: restoring service {service_name} in namespace {service_namespace}"
|
||||
)
|
||||
|
||||
# Restore original service selectors
|
||||
selectors = [
|
||||
"=".join([key, original_selectors[key]])
|
||||
for key in original_selectors.keys()
|
||||
]
|
||||
logging.info(f"Restoring original service selectors: {selectors}")
|
||||
|
||||
restored_service = lib_telemetry.get_lib_kubernetes().replace_service_selector(
|
||||
selectors, service_name, service_namespace
|
||||
)
|
||||
|
||||
if restored_service is None:
|
||||
logging.warning(
|
||||
f"Failed to restore service {service_name} in namespace {service_namespace}"
|
||||
)
|
||||
else:
|
||||
logging.info(f"Successfully restored service {service_name}")
|
||||
|
||||
# Delete the hijacker pod
|
||||
logging.info(f"Deleting hijacker pod: {webservice_pod_name}")
|
||||
try:
|
||||
lib_telemetry.get_lib_kubernetes().delete_pod(
|
||||
webservice_pod_name, service_namespace
|
||||
)
|
||||
logging.info(f"Successfully deleted hijacker pod: {webservice_pod_name}")
|
||||
except Exception as e:
|
||||
logging.warning(f"Failed to delete hijacker pod {webservice_pod_name}: {e}")
|
||||
|
||||
logging.info("Service hijacking rollback completed successfully.")
|
||||
except Exception as e:
|
||||
logging.error(f"Failed to rollback service hijacking: {e}")
|
||||
|
||||
def get_scenario_types(self) -> list[str]:
|
||||
return ["service_hijacking_scenarios"]
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
import base64
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
@@ -7,9 +9,12 @@ from krkn_lib import utils as krkn_lib_utils
|
||||
from krkn_lib.models.telemetry import ScenarioTelemetry
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
|
||||
from krkn.rollback.config import RollbackContent
|
||||
from krkn.rollback.handler import set_rollback_context_decorator
|
||||
|
||||
|
||||
class SynFloodScenarioPlugin(AbstractScenarioPlugin):
|
||||
@set_rollback_context_decorator
|
||||
def run(
|
||||
self,
|
||||
run_uuid: str,
|
||||
@@ -50,6 +55,16 @@ class SynFloodScenarioPlugin(AbstractScenarioPlugin):
|
||||
config["attacker-nodes"],
|
||||
)
|
||||
pod_names.append(pod_name)
|
||||
|
||||
# Set rollback callable to ensure pod cleanup on failure or interruption
|
||||
rollback_data = base64.b64encode(json.dumps(pod_names).encode('utf-8')).decode('utf-8')
|
||||
self.rollback_handler.set_rollback_callable(
|
||||
self.rollback_syn_flood_pods,
|
||||
RollbackContent(
|
||||
namespace=config["namespace"],
|
||||
resource_identifier=rollback_data,
|
||||
),
|
||||
)
|
||||
|
||||
logging.info("waiting all the attackers to finish:")
|
||||
did_finish = False
|
||||
@@ -137,3 +152,23 @@ class SynFloodScenarioPlugin(AbstractScenarioPlugin):
|
||||
|
||||
def get_scenario_types(self) -> list[str]:
|
||||
return ["syn_flood_scenarios"]
|
||||
|
||||
@staticmethod
|
||||
def rollback_syn_flood_pods(rollback_content: RollbackContent, lib_telemetry: KrknTelemetryOpenshift):
|
||||
"""
|
||||
Rollback function to delete syn flood pods.
|
||||
|
||||
:param rollback_content: Rollback content containing namespace and resource_identifier.
|
||||
:param lib_telemetry: Instance of KrknTelemetryOpenshift for Kubernetes operations
|
||||
"""
|
||||
try:
|
||||
namespace = rollback_content.namespace
|
||||
import base64 # noqa
|
||||
import json # noqa
|
||||
pod_names = json.loads(base64.b64decode(rollback_content.resource_identifier.encode('utf-8')).decode('utf-8'))
|
||||
logging.info(f"Rolling back syn flood pods: {pod_names} in namespace: {namespace}")
|
||||
for pod_name in pod_names:
|
||||
lib_telemetry.get_lib_kubernetes().delete_pod(pod_name, namespace)
|
||||
logging.info("Rollback of syn flood pods completed successfully.")
|
||||
except Exception as e:
|
||||
logging.error(f"Failed to rollback syn flood pods: {e}")
|
||||
@@ -43,7 +43,7 @@ class TimeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
cerberus.publish_kraken_status(
|
||||
krkn_config, not_reset, start_time, end_time
|
||||
)
|
||||
except (RuntimeError, Exception):
|
||||
except (RuntimeError, Exception) as e:
|
||||
logging.error(
|
||||
f"TimeActionsScenarioPlugin scenario {scenario} failed with exception: {e}"
|
||||
)
|
||||
@@ -144,6 +144,10 @@ class TimeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
node_names = scenario["object_name"]
|
||||
elif "label_selector" in scenario.keys() and scenario["label_selector"]:
|
||||
node_names = kubecli.list_nodes(scenario["label_selector"])
|
||||
# going to filter out nodes with the exclude_label if it is provided
|
||||
if "exclude_label" in scenario.keys() and scenario["exclude_label"]:
|
||||
excluded_nodes = kubecli.list_nodes(scenario["exclude_label"])
|
||||
node_names = [node for node in node_names if node not in excluded_nodes]
|
||||
for node in node_names:
|
||||
self.skew_node(node, scenario["action"], kubecli)
|
||||
logging.info("Reset date/time on node " + str(node))
|
||||
@@ -189,6 +193,10 @@ class TimeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
counter += 1
|
||||
elif "label_selector" in scenario.keys() and scenario["label_selector"]:
|
||||
pod_names = kubecli.get_all_pods(scenario["label_selector"])
|
||||
# and here filter out the pods with exclude_label if it is provided
|
||||
if "exclude_label" in scenario.keys() and scenario["exclude_label"]:
|
||||
excluded_pods = kubecli.get_all_pods(scenario["exclude_label"])
|
||||
pod_names = [pod for pod in pod_names if pod not in excluded_pods]
|
||||
|
||||
if len(pod_names) == 0:
|
||||
logging.info(
|
||||
|
||||
@@ -140,7 +140,7 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
network_association_ids[0], acl_id
|
||||
)
|
||||
|
||||
# capture the orginal_acl_id, created_acl_id and
|
||||
# capture the original_acl_id, created_acl_id and
|
||||
# new association_id to use during the recovery
|
||||
ids[new_association_id] = original_acl_id
|
||||
|
||||
@@ -156,7 +156,7 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
new_association_id, original_acl_id
|
||||
)
|
||||
logging.info(
|
||||
"Wating for 60 seconds to make sure " "the changes are in place"
|
||||
"Waiting for 60 seconds to make sure " "the changes are in place"
|
||||
)
|
||||
time.sleep(60)
|
||||
|
||||
|
||||
@@ -1,10 +1,17 @@
|
||||
import json
|
||||
import tempfile
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
|
||||
from krkn.scenario_plugins.scenario_plugin_factory import ScenarioPluginFactory
|
||||
from krkn.scenario_plugins.native.plugins import PluginStep, Plugins, PLUGINS
|
||||
from krkn.tests.test_classes.correct_scenario_plugin import (
|
||||
CorrectScenarioPlugin,
|
||||
)
|
||||
import yaml
|
||||
|
||||
|
||||
|
||||
class TestPluginFactory(unittest.TestCase):
|
||||
@@ -108,3 +115,437 @@ class TestPluginFactory(unittest.TestCase):
|
||||
self.assertEqual(
|
||||
message, "scenario plugin folder cannot contain `scenario` or `plugin` word"
|
||||
)
|
||||
|
||||
|
||||
class TestPluginStep(unittest.TestCase):
|
||||
"""Test cases for PluginStep class"""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures"""
|
||||
# Create a mock schema
|
||||
self.mock_schema = Mock()
|
||||
self.mock_schema.id = "test_step"
|
||||
|
||||
# Create mock output
|
||||
mock_output = Mock()
|
||||
mock_output.serialize = Mock(return_value={"status": "success", "message": "test"})
|
||||
self.mock_schema.outputs = {
|
||||
"success": mock_output,
|
||||
"error": mock_output
|
||||
}
|
||||
|
||||
self.plugin_step = PluginStep(
|
||||
schema=self.mock_schema,
|
||||
error_output_ids=["error"]
|
||||
)
|
||||
|
||||
def test_render_output(self):
|
||||
"""Test render_output method"""
|
||||
output_id = "success"
|
||||
output_data = {"status": "success", "message": "test output"}
|
||||
|
||||
result = self.plugin_step.render_output(output_id, output_data)
|
||||
|
||||
# Verify it returns a JSON string
|
||||
self.assertIsInstance(result, str)
|
||||
|
||||
# Verify it can be parsed as JSON
|
||||
parsed = json.loads(result)
|
||||
self.assertEqual(parsed["output_id"], output_id)
|
||||
self.assertIn("output_data", parsed)
|
||||
|
||||
|
||||
class TestPlugins(unittest.TestCase):
|
||||
"""Test cases for Plugins class"""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures"""
|
||||
# Create mock steps with proper id attribute
|
||||
self.mock_step1 = Mock()
|
||||
self.mock_step1.id = "step1"
|
||||
|
||||
self.mock_step2 = Mock()
|
||||
self.mock_step2.id = "step2"
|
||||
|
||||
self.plugin_step1 = PluginStep(schema=self.mock_step1, error_output_ids=["error"])
|
||||
self.plugin_step2 = PluginStep(schema=self.mock_step2, error_output_ids=["error"])
|
||||
|
||||
def test_init_with_valid_steps(self):
|
||||
"""Test Plugins initialization with valid steps"""
|
||||
plugins = Plugins([self.plugin_step1, self.plugin_step2])
|
||||
|
||||
self.assertEqual(len(plugins.steps_by_id), 2)
|
||||
self.assertIn("step1", plugins.steps_by_id)
|
||||
self.assertIn("step2", plugins.steps_by_id)
|
||||
|
||||
def test_init_with_duplicate_step_ids(self):
|
||||
"""Test Plugins initialization with duplicate step IDs raises exception"""
|
||||
# Create two steps with the same ID
|
||||
duplicate_step = PluginStep(schema=self.mock_step1, error_output_ids=["error"])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
Plugins([self.plugin_step1, duplicate_step])
|
||||
|
||||
self.assertIn("Duplicate step ID", str(context.exception))
|
||||
|
||||
def test_unserialize_scenario(self):
|
||||
"""Test unserialize_scenario method"""
|
||||
# Create a temporary YAML file
|
||||
test_data = [
|
||||
{"id": "test_step", "config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([self.plugin_step1])
|
||||
result = plugins.unserialize_scenario(temp_file)
|
||||
|
||||
self.assertIsInstance(result, list)
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
def test_run_with_invalid_scenario_not_list(self):
|
||||
"""Test run method with scenario that is not a list"""
|
||||
# Create a temporary YAML file with dict instead of list
|
||||
test_data = {"id": "test_step", "config": {"param": "value"}}
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([self.plugin_step1])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
self.assertIn("expected list", str(context.exception))
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
def test_run_with_invalid_entry_not_dict(self):
|
||||
"""Test run method with entry that is not a dict"""
|
||||
# Create a temporary YAML file with list of strings instead of dicts
|
||||
test_data = ["invalid", "entries"]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([self.plugin_step1])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
self.assertIn("expected a list of dict's", str(context.exception))
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
def test_run_with_missing_id_field(self):
|
||||
"""Test run method with missing 'id' field"""
|
||||
# Create a temporary YAML file with missing id
|
||||
test_data = [
|
||||
{"config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([self.plugin_step1])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
self.assertIn("missing 'id' field", str(context.exception))
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
def test_run_with_missing_config_field(self):
|
||||
"""Test run method with missing 'config' field"""
|
||||
# Create a temporary YAML file with missing config
|
||||
test_data = [
|
||||
{"id": "step1"}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([self.plugin_step1])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
self.assertIn("missing 'config' field", str(context.exception))
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
def test_run_with_invalid_step_id(self):
|
||||
"""Test run method with invalid step ID"""
|
||||
# Create a proper mock schema with string ID
|
||||
mock_schema = Mock()
|
||||
mock_schema.id = "valid_step"
|
||||
plugin_step = PluginStep(schema=mock_schema, error_output_ids=["error"])
|
||||
|
||||
# Create a temporary YAML file with unknown step ID
|
||||
test_data = [
|
||||
{"id": "unknown_step", "config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([plugin_step])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
self.assertIn("Invalid step", str(context.exception))
|
||||
self.assertIn("expected one of", str(context.exception))
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
@patch('krkn.scenario_plugins.native.plugins.logging')
|
||||
def test_run_with_valid_scenario(self, mock_logging):
|
||||
"""Test run method with valid scenario"""
|
||||
# Create mock schema with all necessary attributes
|
||||
mock_schema = Mock()
|
||||
mock_schema.id = "test_step"
|
||||
|
||||
# Mock input schema
|
||||
mock_input = Mock()
|
||||
mock_input.properties = {}
|
||||
mock_input.unserialize = Mock(return_value=Mock(spec=[]))
|
||||
mock_schema.input = mock_input
|
||||
|
||||
# Mock output
|
||||
mock_output = Mock()
|
||||
mock_output.serialize = Mock(return_value={"status": "success"})
|
||||
mock_schema.outputs = {"success": mock_output}
|
||||
|
||||
# Mock schema call
|
||||
mock_schema.return_value = ("success", {"status": "success"})
|
||||
|
||||
plugin_step = PluginStep(schema=mock_schema, error_output_ids=["error"])
|
||||
|
||||
# Create a temporary YAML file
|
||||
test_data = [
|
||||
{"id": "test_step", "config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([plugin_step])
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
# Verify schema was called
|
||||
mock_schema.assert_called_once()
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
@patch('krkn.scenario_plugins.native.plugins.logging')
|
||||
def test_run_with_error_output(self, mock_logging):
|
||||
"""Test run method when step returns error output"""
|
||||
# Create mock schema with error output
|
||||
mock_schema = Mock()
|
||||
mock_schema.id = "test_step"
|
||||
|
||||
# Mock input schema
|
||||
mock_input = Mock()
|
||||
mock_input.properties = {}
|
||||
mock_input.unserialize = Mock(return_value=Mock(spec=[]))
|
||||
mock_schema.input = mock_input
|
||||
|
||||
# Mock output
|
||||
mock_output = Mock()
|
||||
mock_output.serialize = Mock(return_value={"error": "test error"})
|
||||
mock_schema.outputs = {"error": mock_output}
|
||||
|
||||
# Mock schema call to return error
|
||||
mock_schema.return_value = ("error", {"error": "test error"})
|
||||
|
||||
plugin_step = PluginStep(schema=mock_schema, error_output_ids=["error"])
|
||||
|
||||
# Create a temporary YAML file
|
||||
test_data = [
|
||||
{"id": "test_step", "config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([plugin_step])
|
||||
|
||||
with self.assertRaises(Exception) as context:
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
self.assertIn("failed", str(context.exception))
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
@patch('krkn.scenario_plugins.native.plugins.logging')
|
||||
def test_run_with_kubeconfig_path_injection(self, mock_logging):
|
||||
"""Test run method injects kubeconfig_path when property exists"""
|
||||
# Create mock schema with kubeconfig_path in input properties
|
||||
mock_schema = Mock()
|
||||
mock_schema.id = "test_step"
|
||||
|
||||
# Mock input schema with kubeconfig_path property
|
||||
mock_input_instance = Mock()
|
||||
mock_input = Mock()
|
||||
mock_input.properties = {"kubeconfig_path": Mock()}
|
||||
mock_input.unserialize = Mock(return_value=mock_input_instance)
|
||||
mock_schema.input = mock_input
|
||||
|
||||
# Mock output
|
||||
mock_output = Mock()
|
||||
mock_output.serialize = Mock(return_value={"status": "success"})
|
||||
mock_schema.outputs = {"success": mock_output}
|
||||
|
||||
# Mock schema call
|
||||
mock_schema.return_value = ("success", {"status": "success"})
|
||||
|
||||
plugin_step = PluginStep(schema=mock_schema, error_output_ids=["error"])
|
||||
|
||||
# Create a temporary YAML file
|
||||
test_data = [
|
||||
{"id": "test_step", "config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([plugin_step])
|
||||
plugins.run(temp_file, "/custom/kubeconfig", "/path/to/kraken_config", "test-uuid")
|
||||
|
||||
# Verify kubeconfig_path was set
|
||||
self.assertEqual(mock_input_instance.kubeconfig_path, "/custom/kubeconfig")
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
@patch('krkn.scenario_plugins.native.plugins.logging')
|
||||
def test_run_with_kraken_config_injection(self, mock_logging):
|
||||
"""Test run method injects kraken_config when property exists"""
|
||||
# Create mock schema with kraken_config in input properties
|
||||
mock_schema = Mock()
|
||||
mock_schema.id = "test_step"
|
||||
|
||||
# Mock input schema with kraken_config property
|
||||
mock_input_instance = Mock()
|
||||
mock_input = Mock()
|
||||
mock_input.properties = {"kraken_config": Mock()}
|
||||
mock_input.unserialize = Mock(return_value=mock_input_instance)
|
||||
mock_schema.input = mock_input
|
||||
|
||||
# Mock output
|
||||
mock_output = Mock()
|
||||
mock_output.serialize = Mock(return_value={"status": "success"})
|
||||
mock_schema.outputs = {"success": mock_output}
|
||||
|
||||
# Mock schema call
|
||||
mock_schema.return_value = ("success", {"status": "success"})
|
||||
|
||||
plugin_step = PluginStep(schema=mock_schema, error_output_ids=["error"])
|
||||
|
||||
# Create a temporary YAML file
|
||||
test_data = [
|
||||
{"id": "test_step", "config": {"param": "value"}}
|
||||
]
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
||||
yaml.dump(test_data, f)
|
||||
temp_file = f.name
|
||||
|
||||
try:
|
||||
plugins = Plugins([plugin_step])
|
||||
plugins.run(temp_file, "/path/to/kubeconfig", "/custom/kraken.yaml", "test-uuid")
|
||||
|
||||
# Verify kraken_config was set
|
||||
self.assertEqual(mock_input_instance.kraken_config, "/custom/kraken.yaml")
|
||||
finally:
|
||||
Path(temp_file).unlink()
|
||||
|
||||
def test_json_schema(self):
|
||||
"""Test json_schema method"""
|
||||
# Create mock schema with jsonschema support
|
||||
mock_schema = Mock()
|
||||
mock_schema.id = "test_step"
|
||||
|
||||
plugin_step = PluginStep(schema=mock_schema, error_output_ids=["error"])
|
||||
|
||||
with patch('krkn.scenario_plugins.native.plugins.jsonschema') as mock_jsonschema:
|
||||
# Mock the step_input function
|
||||
mock_jsonschema.step_input.return_value = {
|
||||
"$id": "http://example.com",
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "Test Schema",
|
||||
"description": "Test description",
|
||||
"type": "object",
|
||||
"properties": {"param": {"type": "string"}}
|
||||
}
|
||||
|
||||
plugins = Plugins([plugin_step])
|
||||
result = plugins.json_schema()
|
||||
|
||||
# Verify it returns a JSON string
|
||||
self.assertIsInstance(result, str)
|
||||
|
||||
# Parse and verify structure
|
||||
parsed = json.loads(result)
|
||||
self.assertEqual(parsed["$id"], "https://github.com/redhat-chaos/krkn/")
|
||||
self.assertEqual(parsed["type"], "array")
|
||||
self.assertEqual(parsed["minContains"], 1)
|
||||
self.assertIn("items", parsed)
|
||||
self.assertIn("oneOf", parsed["items"])
|
||||
|
||||
# Verify step is included
|
||||
self.assertEqual(len(parsed["items"]["oneOf"]), 1)
|
||||
step_schema = parsed["items"]["oneOf"][0]
|
||||
self.assertEqual(step_schema["properties"]["id"]["const"], "test_step")
|
||||
|
||||
|
||||
class TestPLUGINSConstant(unittest.TestCase):
|
||||
"""Test cases for the PLUGINS constant"""
|
||||
|
||||
def test_plugins_initialized(self):
|
||||
"""Test that PLUGINS constant is properly initialized"""
|
||||
self.assertIsInstance(PLUGINS, Plugins)
|
||||
|
||||
# Verify all expected steps are registered
|
||||
expected_steps = [
|
||||
"run_python",
|
||||
"network_chaos",
|
||||
"pod_network_outage",
|
||||
"pod_egress_shaping",
|
||||
"pod_ingress_shaping"
|
||||
]
|
||||
|
||||
for step_id in expected_steps:
|
||||
self.assertIn(step_id, PLUGINS.steps_by_id)
|
||||
|
||||
# Ensure the registered id matches the decorator and no legacy alias is present
|
||||
self.assertEqual(
|
||||
PLUGINS.steps_by_id["pod_network_outage"].schema.id,
|
||||
"pod_network_outage",
|
||||
)
|
||||
self.assertNotIn("pod_outage", PLUGINS.steps_by_id)
|
||||
|
||||
def test_plugins_step_count(self):
|
||||
"""Test that PLUGINS has the expected number of steps"""
|
||||
self.assertEqual(len(PLUGINS.steps_by_id), 5)
|
||||
|
||||
71
krkn/utils/ErrorCollectionHandler.py
Normal file
71
krkn/utils/ErrorCollectionHandler.py
Normal file
@@ -0,0 +1,71 @@
|
||||
import logging
|
||||
import threading
|
||||
from datetime import datetime, timezone
|
||||
from krkn.utils.ErrorLog import ErrorLog
|
||||
|
||||
|
||||
class ErrorCollectionHandler(logging.Handler):
|
||||
"""
|
||||
Custom logging handler that captures ERROR and CRITICAL level logs
|
||||
in structured format for telemetry collection.
|
||||
|
||||
Stores logs in memory as ErrorLog objects for later retrieval.
|
||||
Thread-safe for concurrent logging operations.
|
||||
"""
|
||||
|
||||
def __init__(self, level=logging.ERROR):
|
||||
"""
|
||||
Initialize the error collection handler.
|
||||
|
||||
Args:
|
||||
level: Minimum log level to capture (default: ERROR)
|
||||
"""
|
||||
super().__init__(level)
|
||||
self.error_logs: list[ErrorLog] = []
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def emit(self, record: logging.LogRecord):
|
||||
"""
|
||||
Capture ERROR and CRITICAL logs and store as ErrorLog objects.
|
||||
|
||||
Args:
|
||||
record: LogRecord from Python logging framework
|
||||
"""
|
||||
try:
|
||||
# Only capture ERROR (40) and CRITICAL (50) levels
|
||||
if record.levelno < logging.ERROR:
|
||||
return
|
||||
|
||||
# Format timestamp as ISO 8601 UTC
|
||||
timestamp = datetime.fromtimestamp(
|
||||
record.created, tz=timezone.utc
|
||||
).strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3] + "Z"
|
||||
|
||||
# Create ErrorLog object
|
||||
error_log = ErrorLog(
|
||||
timestamp=timestamp,
|
||||
message=record.getMessage()
|
||||
)
|
||||
|
||||
# Thread-safe append
|
||||
with self._lock:
|
||||
self.error_logs.append(error_log)
|
||||
|
||||
except Exception:
|
||||
# Handler should never raise exceptions (logging best practice)
|
||||
self.handleError(record)
|
||||
|
||||
def get_error_logs(self) -> list[dict]:
|
||||
"""
|
||||
Retrieve all collected error logs as list of dictionaries.
|
||||
|
||||
Returns:
|
||||
List of error log dictionaries with timestamp and message
|
||||
"""
|
||||
with self._lock:
|
||||
return [log.to_dict() for log in self.error_logs]
|
||||
|
||||
def clear(self):
|
||||
"""Clear all collected error logs (useful for testing)"""
|
||||
with self._lock:
|
||||
self.error_logs.clear()
|
||||
18
krkn/utils/ErrorLog.py
Normal file
18
krkn/utils/ErrorLog.py
Normal file
@@ -0,0 +1,18 @@
|
||||
from dataclasses import dataclass, asdict
|
||||
|
||||
|
||||
@dataclass
|
||||
class ErrorLog:
|
||||
"""
|
||||
Represents a single error log entry for telemetry collection.
|
||||
|
||||
Attributes:
|
||||
timestamp: ISO 8601 formatted timestamp (UTC)
|
||||
message: Full error message text
|
||||
"""
|
||||
timestamp: str
|
||||
message: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization"""
|
||||
return asdict(self)
|
||||
@@ -77,7 +77,7 @@ class HealthChecker:
|
||||
success_response = {
|
||||
"url": url,
|
||||
"status": True,
|
||||
"status_code": response["status_code"],
|
||||
"status_code": health_check_tracker[url]["status_code"],
|
||||
"start_timestamp": health_check_tracker[url]["start_timestamp"].isoformat(),
|
||||
"end_timestamp": health_check_end_time_stamp.isoformat(),
|
||||
"duration": duration
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
|
||||
import time
|
||||
import logging
|
||||
import math
|
||||
import queue
|
||||
from datetime import datetime
|
||||
from krkn_lib.models.telemetry.models import VirtCheck
|
||||
@@ -19,38 +20,57 @@ class VirtChecker:
|
||||
self.namespace = get_yaml_item_value(kubevirt_check_config, "namespace", "")
|
||||
self.vm_list = []
|
||||
self.threads = []
|
||||
self.iteration_lock = threading.Lock() # Lock to protect current_iterations
|
||||
self.threads_limit = threads_limit
|
||||
if self.namespace == "":
|
||||
logging.info("kube virt checks config is not defined, skipping them")
|
||||
return
|
||||
# setting to 0 in case no variables are set, so no threads later get made
|
||||
self.batch_size = 0
|
||||
self.ret_value = 0
|
||||
vmi_name_match = get_yaml_item_value(kubevirt_check_config, "name", ".*")
|
||||
self.krkn_lib = krkn_lib
|
||||
self.disconnected = get_yaml_item_value(kubevirt_check_config, "disconnected", False)
|
||||
self.only_failures = get_yaml_item_value(kubevirt_check_config, "only_failures", False)
|
||||
self.interval = get_yaml_item_value(kubevirt_check_config, "interval", 2)
|
||||
self.ssh_node = get_yaml_item_value(kubevirt_check_config, "ssh_node", "")
|
||||
self.node_names = get_yaml_item_value(kubevirt_check_config, "node_names", "")
|
||||
self.exit_on_failure = get_yaml_item_value(kubevirt_check_config, "exit_on_failure", False)
|
||||
if self.namespace == "":
|
||||
logging.info("kube virt checks config is not defined, skipping them")
|
||||
return
|
||||
try:
|
||||
self.kube_vm_plugin = KubevirtVmOutageScenarioPlugin()
|
||||
self.kube_vm_plugin.init_clients(k8s_client=krkn_lib)
|
||||
vmis = self.kube_vm_plugin.get_vmis(vmi_name_match,self.namespace)
|
||||
|
||||
self.kube_vm_plugin.get_vmis(vmi_name_match,self.namespace)
|
||||
except Exception as e:
|
||||
logging.error('Virt Check init exception: ' + str(e))
|
||||
return
|
||||
|
||||
for vmi in vmis:
|
||||
return
|
||||
# See if multiple node names exist
|
||||
node_name_list = [node_name for node_name in self.node_names.split(',') if node_name]
|
||||
for vmi in self.kube_vm_plugin.vmis_list:
|
||||
node_name = vmi.get("status",{}).get("nodeName")
|
||||
vmi_name = vmi.get("metadata",{}).get("name")
|
||||
ip_address = vmi.get("status",{}).get("interfaces",[])[0].get("ipAddress")
|
||||
self.vm_list.append(VirtCheck({'vm_name':vmi_name, 'ip_address': ip_address, 'namespace':self.namespace, 'node_name':node_name, "new_ip_address":""}))
|
||||
interfaces = vmi.get("status",{}).get("interfaces",[])
|
||||
if not interfaces:
|
||||
logging.warning(f"VMI {vmi_name} has no network interfaces, skipping")
|
||||
continue
|
||||
ip_address = interfaces[0].get("ipAddress")
|
||||
namespace = vmi.get("metadata",{}).get("namespace")
|
||||
# If node_name_list exists, only add if node name is in list
|
||||
|
||||
if len(node_name_list) > 0 and node_name in node_name_list:
|
||||
self.vm_list.append(VirtCheck({'vm_name':vmi_name, 'ip_address': ip_address, 'namespace':namespace, 'node_name':node_name, "new_ip_address":""}))
|
||||
elif len(node_name_list) == 0:
|
||||
# If node_name_list is blank, add all vms
|
||||
self.vm_list.append(VirtCheck({'vm_name':vmi_name, 'ip_address': ip_address, 'namespace':namespace, 'node_name':node_name, "new_ip_address":""}))
|
||||
self.batch_size = math.ceil(len(self.vm_list)/self.threads_limit)
|
||||
|
||||
def check_disconnected_access(self, ip_address: str, worker_name:str = '', vmi_name: str = ''):
|
||||
|
||||
virtctl_vm_cmd = f"ssh core@{worker_name} 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address}'"
|
||||
virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address}'"
|
||||
|
||||
all_out = invoke_no_exit(virtctl_vm_cmd)
|
||||
logging.debug(f"Checking disconnected access for {ip_address} on {worker_name} output: {all_out}")
|
||||
virtctl_vm_cmd = f"ssh core@{worker_name} 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
logging.debug(f"Checking disconnected access for {ip_address} on {worker_name} with command: {virtctl_vm_cmd}")
|
||||
virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
output = invoke_no_exit(virtctl_vm_cmd)
|
||||
if 'True' in output:
|
||||
logging.debug(f"Disconnected access for {ip_address} on {worker_name} is successful: {output}")
|
||||
@@ -58,20 +78,19 @@ class VirtChecker:
|
||||
else:
|
||||
logging.debug(f"Disconnected access for {ip_address} on {worker_name} is failed: {output}")
|
||||
vmi = self.kube_vm_plugin.get_vmi(vmi_name,self.namespace)
|
||||
new_ip_address = vmi.get("status",{}).get("interfaces",[])[0].get("ipAddress")
|
||||
interfaces = vmi.get("status",{}).get("interfaces",[])
|
||||
new_ip_address = interfaces[0].get("ipAddress") if interfaces else None
|
||||
new_node_name = vmi.get("status",{}).get("nodeName")
|
||||
# if vm gets deleted, it'll start up with a new ip address
|
||||
if new_ip_address != ip_address:
|
||||
virtctl_vm_cmd = f"ssh core@{worker_name} 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
logging.debug(f"Checking disconnected access for {new_ip_address} on {worker_name} with command: {virtctl_vm_cmd}")
|
||||
virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
new_output = invoke_no_exit(virtctl_vm_cmd)
|
||||
logging.debug(f"Disconnected access for {ip_address} on {worker_name}: {new_output}")
|
||||
if 'True' in new_output:
|
||||
return True, new_ip_address, None
|
||||
# if node gets stopped, vmis will start up with a new node (and with new ip)
|
||||
if new_node_name != worker_name:
|
||||
virtctl_vm_cmd = f"ssh core@{new_node_name} 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
logging.debug(f"Checking disconnected access for {new_ip_address} on {new_node_name} with command: {virtctl_vm_cmd}")
|
||||
virtctl_vm_cmd = f"ssh core@{new_node_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
new_output = invoke_no_exit(virtctl_vm_cmd)
|
||||
logging.debug(f"Disconnected access for {ip_address} on {new_node_name}: {new_output}")
|
||||
if 'True' in new_output:
|
||||
@@ -79,8 +98,7 @@ class VirtChecker:
|
||||
# try to connect with a common "up" node as last resort
|
||||
if self.ssh_node:
|
||||
# using new_ip_address here since if it hasn't changed it'll match ip_address
|
||||
virtctl_vm_cmd = f"ssh core@{self.ssh_node} 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
logging.debug(f"Checking disconnected access for {new_ip_address} on {self.ssh_node} with command: {virtctl_vm_cmd}")
|
||||
virtctl_vm_cmd = f"ssh core@{self.ssh_node} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
|
||||
new_output = invoke_no_exit(virtctl_vm_cmd)
|
||||
logging.debug(f"Disconnected access for {new_ip_address} on {self.ssh_node}: {new_output}")
|
||||
if 'True' in new_output:
|
||||
@@ -89,7 +107,7 @@ class VirtChecker:
|
||||
|
||||
def get_vm_access(self, vm_name: str = '', namespace: str = ''):
|
||||
"""
|
||||
This method returns True when the VM is access and an error message when it is not, using virtctl protocol
|
||||
This method returns True when the VM is accessible and an error message when it is not, using virtctl protocol
|
||||
:param vm_name:
|
||||
:param namespace:
|
||||
:return: virtctl_status 'True' if successful, or an error message if it fails.
|
||||
@@ -108,22 +126,36 @@ class VirtChecker:
|
||||
for thread in self.threads:
|
||||
thread.join()
|
||||
|
||||
def batch_list(self, queue: queue.Queue, batch_size=20):
|
||||
# Provided prints to easily visualize how the threads are processed.
|
||||
for i in range (0, len(self.vm_list),batch_size):
|
||||
sub_list = self.vm_list[i: i+batch_size]
|
||||
index = i
|
||||
t = threading.Thread(target=self.run_virt_check,name=str(index), args=(sub_list,queue))
|
||||
self.threads.append(t)
|
||||
t.start()
|
||||
def batch_list(self, queue: queue.SimpleQueue = None):
|
||||
if self.batch_size > 0:
|
||||
# Provided prints to easily visualize how the threads are processed.
|
||||
for i in range (0, len(self.vm_list),self.batch_size):
|
||||
if i+self.batch_size > len(self.vm_list):
|
||||
sub_list = self.vm_list[i:]
|
||||
else:
|
||||
sub_list = self.vm_list[i: i+self.batch_size]
|
||||
index = i
|
||||
t = threading.Thread(target=self.run_virt_check,name=str(index), args=(sub_list,queue))
|
||||
self.threads.append(t)
|
||||
t.start()
|
||||
|
||||
|
||||
def run_virt_check(self, vm_list_batch, virt_check_telemetry_queue: queue.Queue):
|
||||
def increment_iterations(self):
|
||||
"""Thread-safe method to increment current_iterations"""
|
||||
with self.iteration_lock:
|
||||
self.current_iterations += 1
|
||||
|
||||
def run_virt_check(self, vm_list_batch, virt_check_telemetry_queue: queue.SimpleQueue):
|
||||
|
||||
virt_check_telemetry = []
|
||||
virt_check_tracker = {}
|
||||
while self.current_iterations < self.iterations:
|
||||
while True:
|
||||
# Thread-safe read of current_iterations
|
||||
with self.iteration_lock:
|
||||
current = self.current_iterations
|
||||
if current >= self.iterations:
|
||||
break
|
||||
for vm in vm_list_batch:
|
||||
start_time= datetime.now()
|
||||
try:
|
||||
if not self.disconnected:
|
||||
vm_status = self.get_vm_access(vm.vm_name, vm.namespace)
|
||||
@@ -139,8 +171,9 @@ class VirtChecker:
|
||||
if new_node_name and vm.node_name != new_node_name:
|
||||
vm.node_name = new_node_name
|
||||
except Exception:
|
||||
logging.info('Exception in get vm status')
|
||||
vm_status = False
|
||||
|
||||
|
||||
if vm.vm_name not in virt_check_tracker:
|
||||
start_timestamp = datetime.now()
|
||||
virt_check_tracker[vm.vm_name] = {
|
||||
@@ -153,6 +186,7 @@ class VirtChecker:
|
||||
"new_ip_address": vm.new_ip_address
|
||||
}
|
||||
else:
|
||||
|
||||
if vm_status != virt_check_tracker[vm.vm_name]["status"]:
|
||||
end_timestamp = datetime.now()
|
||||
start_timestamp = virt_check_tracker[vm.vm_name]["start_timestamp"]
|
||||
@@ -181,4 +215,66 @@ class VirtChecker:
|
||||
virt_check_telemetry.append(VirtCheck(virt_check_tracker[vm]))
|
||||
else:
|
||||
virt_check_telemetry.append(VirtCheck(virt_check_tracker[vm]))
|
||||
virt_check_telemetry_queue.put(virt_check_telemetry)
|
||||
try:
|
||||
virt_check_telemetry_queue.put(virt_check_telemetry)
|
||||
except Exception as e:
|
||||
logging.error('Put queue error ' + str(e))
|
||||
def run_post_virt_check(self, vm_list_batch, virt_check_telemetry, post_virt_check_queue: queue.SimpleQueue):
|
||||
|
||||
virt_check_telemetry = []
|
||||
virt_check_tracker = {}
|
||||
start_timestamp = datetime.now()
|
||||
for vm in vm_list_batch:
|
||||
|
||||
try:
|
||||
if not self.disconnected:
|
||||
vm_status = self.get_vm_access(vm.vm_name, vm.namespace)
|
||||
else:
|
||||
vm_status, new_ip_address, new_node_name = self.check_disconnected_access(vm.ip_address, vm.node_name, vm.vm_name)
|
||||
if new_ip_address and vm.ip_address != new_ip_address:
|
||||
vm.new_ip_address = new_ip_address
|
||||
if new_node_name and vm.node_name != new_node_name:
|
||||
vm.node_name = new_node_name
|
||||
except Exception:
|
||||
vm_status = False
|
||||
|
||||
if not vm_status:
|
||||
|
||||
virt_check_tracker= {
|
||||
"vm_name": vm.vm_name,
|
||||
"ip_address": vm.ip_address,
|
||||
"namespace": vm.namespace,
|
||||
"node_name": vm.node_name,
|
||||
"status": vm_status,
|
||||
"start_timestamp": start_timestamp.isoformat(),
|
||||
"new_ip_address": vm.new_ip_address,
|
||||
"duration": 0,
|
||||
"end_timestamp": start_timestamp.isoformat()
|
||||
}
|
||||
|
||||
virt_check_telemetry.append(VirtCheck(virt_check_tracker))
|
||||
post_virt_check_queue.put(virt_check_telemetry)
|
||||
|
||||
|
||||
def gather_post_virt_checks(self, kubevirt_check_telem):
|
||||
|
||||
post_kubevirt_check_queue = queue.SimpleQueue()
|
||||
post_threads = []
|
||||
|
||||
if self.batch_size > 0:
|
||||
for i in range (0, len(self.vm_list),self.batch_size):
|
||||
sub_list = self.vm_list[i: i+self.batch_size]
|
||||
index = i
|
||||
t = threading.Thread(target=self.run_post_virt_check,name=str(index), args=(sub_list,kubevirt_check_telem, post_kubevirt_check_queue))
|
||||
post_threads.append(t)
|
||||
t.start()
|
||||
|
||||
kubevirt_check_telem = []
|
||||
for thread in post_threads:
|
||||
thread.join()
|
||||
if not post_kubevirt_check_queue.empty():
|
||||
kubevirt_check_telem.extend(post_kubevirt_check_queue.get_nowait())
|
||||
|
||||
if self.exit_on_failure and len(kubevirt_check_telem) > 0:
|
||||
self.ret_value = 2
|
||||
return kubevirt_check_telem
|
||||
|
||||
@@ -1,2 +1,4 @@
|
||||
from .TeeLogHandler import TeeLogHandler
|
||||
from .ErrorLog import ErrorLog
|
||||
from .ErrorCollectionHandler import ErrorCollectionHandler
|
||||
from .functions import *
|
||||
|
||||
@@ -1,24 +1,23 @@
|
||||
aliyun-python-sdk-core==2.13.36
|
||||
aliyun-python-sdk-ecs==4.24.25
|
||||
arcaflow-plugin-sdk==0.14.0
|
||||
boto3==1.28.61
|
||||
boto3>=1.34.0 # Updated to support urllib3 2.x
|
||||
azure-identity==1.16.1
|
||||
azure-keyvault==4.2.0
|
||||
azure-mgmt-compute==30.5.0
|
||||
azure-mgmt-network==27.0.0
|
||||
itsdangerous==2.0.1
|
||||
coverage==7.6.12
|
||||
datetime==5.4
|
||||
docker==7.0.0
|
||||
docker>=6.0,<7.0 # docker 7.0+ has breaking changes; works with requests<2.32
|
||||
gitpython==3.1.41
|
||||
google-auth==2.37.0
|
||||
google-cloud-compute==1.22.0
|
||||
ibm_cloud_sdk_core==3.18.0
|
||||
ibm_vpc==0.20.0
|
||||
ibm_cloud_sdk_core>=3.20.0 # Requires urllib3>=2.1.0 (compatible with updated boto3)
|
||||
ibm_vpc==0.26.3 # Requires ibm_cloud_sdk_core
|
||||
jinja2==3.1.6
|
||||
krkn-lib==5.1.11
|
||||
lxml==5.1.0
|
||||
kubernetes==34.1.0
|
||||
krkn-lib==6.0.3
|
||||
numpy==1.26.4
|
||||
pandas==2.2.0
|
||||
openshift-client==1.0.21
|
||||
@@ -28,13 +27,15 @@ pyfiglet==1.0.2
|
||||
pytest==8.0.0
|
||||
python-ipmi==0.5.4
|
||||
python-openstackclient==6.5.0
|
||||
requests==2.32.4
|
||||
requests<2.32 # requests 2.32+ breaks Unix socket support (http+docker scheme)
|
||||
requests-unixsocket>=0.4.0 # Required for Docker Unix socket support
|
||||
urllib3>=2.1.0,<2.4.0 # Compatible with all dependencies
|
||||
service_identity==24.1.0
|
||||
PyYAML==6.0.1
|
||||
setuptools==78.1.1
|
||||
werkzeug==3.0.6
|
||||
wheel==0.42.0
|
||||
zope.interface==5.4.0
|
||||
wheel>=0.44.0
|
||||
zope.interface==6.1
|
||||
colorlog==6.10.1
|
||||
|
||||
git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.0.0
|
||||
cryptography>=42.0.4 # not directly required, pinned by Snyk to avoid a vulnerability
|
||||
|
||||
@@ -6,6 +6,7 @@ import sys
|
||||
import yaml
|
||||
import logging
|
||||
import optparse
|
||||
from colorlog import ColoredFormatter
|
||||
import pyfiglet
|
||||
import uuid
|
||||
import time
|
||||
@@ -27,7 +28,7 @@ from krkn_lib.models.telemetry import ChaosRunTelemetry
|
||||
from krkn_lib.utils import SafeLogger
|
||||
from krkn_lib.utils.functions import get_yaml_item_value, get_junit_test_case
|
||||
|
||||
from krkn.utils import TeeLogHandler
|
||||
from krkn.utils import TeeLogHandler, ErrorCollectionHandler
|
||||
from krkn.utils.HealthChecker import HealthChecker
|
||||
from krkn.utils.VirtChecker import VirtChecker
|
||||
from krkn.scenario_plugins.scenario_plugin_factory import (
|
||||
@@ -133,7 +134,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
telemetry_api_url = config["telemetry"].get("api_url")
|
||||
health_check_config = get_yaml_item_value(config, "health_checks",{})
|
||||
kubevirt_check_config = get_yaml_item_value(config, "kubevirt_checks", {})
|
||||
|
||||
|
||||
# Initialize clients
|
||||
if not os.path.isfile(kubeconfig_path) and not os.path.isfile(
|
||||
"/var/run/secrets/kubernetes.io/serviceaccount/token"
|
||||
@@ -141,7 +142,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
logging.error(
|
||||
"Cannot read the kubeconfig file at %s, please check" % kubeconfig_path
|
||||
)
|
||||
return 1
|
||||
return -1
|
||||
logging.info("Initializing client to talk to the Kubernetes cluster")
|
||||
|
||||
# Generate uuid for the run
|
||||
@@ -184,10 +185,10 @@ def main(options, command: Optional[str]) -> int:
|
||||
# Set up kraken url to track signal
|
||||
if not 0 <= int(port) <= 65535:
|
||||
logging.error("%s isn't a valid port number, please check" % (port))
|
||||
return 1
|
||||
return -1
|
||||
if not signal_address:
|
||||
logging.error("Please set the signal address in the config")
|
||||
return 1
|
||||
return -1
|
||||
address = (signal_address, port)
|
||||
|
||||
# If publish_running_status is False this should keep us going
|
||||
@@ -220,7 +221,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
"invalid distribution selected, running openshift scenarios against kubernetes cluster."
|
||||
"Please set 'kubernetes' in config.yaml krkn.platform and try again"
|
||||
)
|
||||
return 1
|
||||
return -1
|
||||
if cv != "":
|
||||
logging.info(cv)
|
||||
else:
|
||||
@@ -326,7 +327,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
args=(health_check_config, health_check_telemetry_queue))
|
||||
health_check_worker.start()
|
||||
|
||||
kubevirt_check_telemetry_queue = queue.Queue()
|
||||
kubevirt_check_telemetry_queue = queue.SimpleQueue()
|
||||
kubevirt_checker = VirtChecker(kubevirt_check_config, iterations=iterations, krkn_lib=kubecli)
|
||||
kubevirt_checker.batch_list(kubevirt_check_telemetry_queue)
|
||||
|
||||
@@ -361,7 +362,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
logging.error(
|
||||
f"impossible to find scenario {scenario_type}, plugin not found. Exiting"
|
||||
)
|
||||
sys.exit(1)
|
||||
sys.exit(-1)
|
||||
|
||||
failed_post_scenarios, scenario_telemetries = (
|
||||
scenario_plugin.run_scenarios(
|
||||
@@ -393,8 +394,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
|
||||
iteration += 1
|
||||
health_checker.current_iterations += 1
|
||||
kubevirt_checker.current_iterations += 1
|
||||
|
||||
kubevirt_checker.increment_iterations()
|
||||
# telemetry
|
||||
# in order to print decoded telemetry data even if telemetry collection
|
||||
# is disabled, it's necessary to serialize the ChaosRunTelemetry object
|
||||
@@ -408,15 +408,12 @@ def main(options, command: Optional[str]) -> int:
|
||||
|
||||
kubevirt_checker.thread_join()
|
||||
kubevirt_check_telem = []
|
||||
i =0
|
||||
while i <= kubevirt_checker.threads_limit:
|
||||
if not kubevirt_check_telemetry_queue.empty():
|
||||
kubevirt_check_telem.extend(kubevirt_check_telemetry_queue.get_nowait())
|
||||
else:
|
||||
break
|
||||
i+= 1
|
||||
while not kubevirt_check_telemetry_queue.empty():
|
||||
kubevirt_check_telem.extend(kubevirt_check_telemetry_queue.get_nowait())
|
||||
chaos_telemetry.virt_checks = kubevirt_check_telem
|
||||
|
||||
|
||||
post_kubevirt_check = kubevirt_checker.gather_post_virt_checks(kubevirt_check_telem)
|
||||
chaos_telemetry.post_virt_checks = post_kubevirt_check
|
||||
# if platform is openshift will be collected
|
||||
# Cloud platform and network plugins metadata
|
||||
# through OCP specific APIs
|
||||
@@ -429,16 +426,22 @@ def main(options, command: Optional[str]) -> int:
|
||||
logging.info("collecting Kubernetes cluster metadata....")
|
||||
telemetry_k8s.collect_cluster_metadata(chaos_telemetry)
|
||||
|
||||
# Collect error logs from handler
|
||||
error_logs = error_collection_handler.get_error_logs()
|
||||
if error_logs:
|
||||
logging.info(f"Collected {len(error_logs)} error logs for telemetry")
|
||||
chaos_telemetry.error_logs = error_logs
|
||||
else:
|
||||
logging.info("No error logs collected during chaos run")
|
||||
chaos_telemetry.error_logs = []
|
||||
|
||||
telemetry_json = chaos_telemetry.to_json()
|
||||
decoded_chaos_run_telemetry = ChaosRunTelemetry(json.loads(telemetry_json))
|
||||
chaos_output.telemetry = decoded_chaos_run_telemetry
|
||||
logging.info(f"Chaos data:\n{chaos_output.to_json()}")
|
||||
if enable_elastic:
|
||||
elastic_telemetry = ElasticChaosRunTelemetry(
|
||||
chaos_run_telemetry=decoded_chaos_run_telemetry
|
||||
)
|
||||
result = elastic_search.push_telemetry(
|
||||
elastic_telemetry, elastic_telemetry_index
|
||||
decoded_chaos_run_telemetry, elastic_telemetry_index
|
||||
)
|
||||
if result == -1:
|
||||
safe_logger.error(
|
||||
@@ -526,7 +529,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
|
||||
else:
|
||||
logging.error("Alert profile is not defined")
|
||||
return 1
|
||||
return -1
|
||||
# sys.exit(1)
|
||||
if enable_metrics:
|
||||
logging.info(f'Capturing metrics using file {metrics_profile}')
|
||||
@@ -541,21 +544,28 @@ def main(options, command: Optional[str]) -> int:
|
||||
telemetry_json
|
||||
)
|
||||
|
||||
# want to exit with 1 first to show failure of scenario
|
||||
# even if alerts failing
|
||||
if failed_post_scenarios:
|
||||
logging.error(
|
||||
"Post scenarios are still failing at the end of all iterations"
|
||||
)
|
||||
# sys.exit(1)
|
||||
return 1
|
||||
|
||||
if post_critical_alerts > 0:
|
||||
logging.error("Critical alerts are firing, please check; exiting")
|
||||
# sys.exit(2)
|
||||
return 2
|
||||
|
||||
if failed_post_scenarios:
|
||||
logging.error(
|
||||
"Post scenarios are still failing at the end of all iterations"
|
||||
)
|
||||
# sys.exit(2)
|
||||
return 2
|
||||
if health_checker.ret_value != 0:
|
||||
logging.error("Health check failed for the applications, Please check; exiting")
|
||||
return health_checker.ret_value
|
||||
|
||||
if kubevirt_checker.ret_value != 0:
|
||||
logging.error("Kubevirt check still had failed VMIs at end of run, Please check; exiting")
|
||||
return kubevirt_checker.ret_value
|
||||
|
||||
logging.info(
|
||||
"Successfully finished running Kraken. UUID for the run: "
|
||||
"%s. Report generated at %s. Exiting" % (run_uuid, report_file)
|
||||
@@ -563,7 +573,7 @@ def main(options, command: Optional[str]) -> int:
|
||||
else:
|
||||
logging.error("Cannot find a config at %s, please check" % (cfg))
|
||||
# sys.exit(1)
|
||||
return 2
|
||||
return -1
|
||||
|
||||
return 0
|
||||
|
||||
@@ -643,15 +653,30 @@ if __name__ == "__main__":
|
||||
# If no command or regular execution, continue with existing logic
|
||||
report_file = options.output
|
||||
tee_handler = TeeLogHandler()
|
||||
|
||||
fmt = "%(asctime)s [%(levelname)s] %(message)s"
|
||||
plain = logging.Formatter(fmt)
|
||||
colored = ColoredFormatter(
|
||||
"%(asctime)s [%(log_color)s%(levelname)s%(reset)s] %(message)s",
|
||||
log_colors={'DEBUG': 'white', 'INFO': 'white', 'WARNING': 'yellow', 'ERROR': 'red', 'CRITICAL': 'bold_red'},
|
||||
reset=True, style='%'
|
||||
)
|
||||
file_handler = logging.FileHandler(report_file, mode="w")
|
||||
file_handler.setFormatter(plain)
|
||||
stream_handler = logging.StreamHandler()
|
||||
stream_handler.setFormatter(colored)
|
||||
tee_handler.setFormatter(plain)
|
||||
error_collection_handler = ErrorCollectionHandler(level=logging.ERROR)
|
||||
|
||||
handlers = [
|
||||
logging.FileHandler(report_file, mode="w"),
|
||||
logging.StreamHandler(),
|
||||
file_handler,
|
||||
stream_handler,
|
||||
tee_handler,
|
||||
error_collection_handler,
|
||||
]
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG if options.debug else logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
handlers=handlers,
|
||||
)
|
||||
option_error = False
|
||||
@@ -732,4 +757,4 @@ if __name__ == "__main__":
|
||||
with open(junit_testcase_file_path, "w") as stream:
|
||||
stream.write(junit_testcase_xml)
|
||||
|
||||
sys.exit(retval)
|
||||
sys.exit(retval)
|
||||
|
||||
@@ -1,16 +1,18 @@
|
||||
node_scenarios:
|
||||
- actions: # node chaos scenarios to be injected
|
||||
- node_stop_start_scenario
|
||||
node_name: kind-worker # node on which scenario has to be injected; can set multiple names separated by comma
|
||||
# label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
|
||||
# node_name: kind-control-plane # node on which scenario has to be injected; can set multiple names separated by comma
|
||||
label_selector: kubernetes.io/hostname=kind-worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
|
||||
instance_count: 1 # Number of nodes to perform action/select that match the label selector
|
||||
runs: 1 # number of times to inject each scenario under actions (will perform on same node each time)
|
||||
timeout: 120 # duration to wait for completion of node scenario injection
|
||||
cloud_type: docker # cloud type on which Kubernetes/OpenShift runs
|
||||
duration: 10
|
||||
- actions:
|
||||
- node_reboot_scenario
|
||||
node_name: kind-worker
|
||||
# label_selector: node-role.kubernetes.io/infra
|
||||
node_name: kind-control-plane
|
||||
# label_selector: kubernetes.io/hostname=kind-worker
|
||||
instance_count: 1
|
||||
timeout: 120
|
||||
cloud_type: docker
|
||||
kube_check: false
|
||||
|
||||
@@ -3,3 +3,4 @@
|
||||
namespace_pattern: "kube-system"
|
||||
label_selector: "component=etcd"
|
||||
krkn_pod_recovery_time: 120
|
||||
kill: 1
|
||||
6
scenarios/kind/pod_path_provisioner.yml
Executable file
6
scenarios/kind/pod_path_provisioner.yml
Executable file
@@ -0,0 +1,6 @@
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: "local-path-storage"
|
||||
label_selector: "app=local-path-provisioner"
|
||||
krkn_pod_recovery_time: 20
|
||||
kill: 1
|
||||
7
scenarios/kind/pvc_scenario.yaml
Normal file
7
scenarios/kind/pvc_scenario.yaml
Normal file
@@ -0,0 +1,7 @@
|
||||
pvc_scenario:
|
||||
pvc_name: kraken-test-pvc # Name of the target PVC
|
||||
pod_name: kraken-test-pod # Name of the pod where the PVC is mounted, it will be ignored if the pvc_name is defined
|
||||
namespace: kraken # Namespace where the PVC is
|
||||
fill_percentage: 98 # Target percentage to fill up the cluster, value must be higher than current percentage, valid values are between 0 and 99
|
||||
duration: 10 # Duration in seconds for the fault
|
||||
block_size: 102400 # used only by dd if fallocate not present in the container
|
||||
@@ -6,3 +6,4 @@ scenarios:
|
||||
action: 1
|
||||
count: 1
|
||||
retry_wait: 60
|
||||
exclude_label: ""
|
||||
|
||||
18
scenarios/kube/node-network-chaos.yml
Normal file
18
scenarios/kube/node-network-chaos.yml
Normal file
@@ -0,0 +1,18 @@
|
||||
- id: node_network_chaos
|
||||
image: "quay.io/krkn-chaos/krkn-network-chaos:latest"
|
||||
wait_duration: 1
|
||||
test_duration: 60
|
||||
label_selector: ""
|
||||
service_account: ""
|
||||
taints: []
|
||||
namespace: 'default'
|
||||
instance_count: 1
|
||||
target: "<node_name>"
|
||||
execution: parallel
|
||||
interfaces: []
|
||||
ingress: true
|
||||
egress: true
|
||||
latency: 0s # supported units are us (microseconds), ms, s
|
||||
loss: 10 # percentage
|
||||
bandwidth: 1gbit #supported units are bit kbit mbit gbit tbit
|
||||
force: false
|
||||
@@ -4,7 +4,7 @@
|
||||
test_duration: 10
|
||||
label_selector: "<node_selector>"
|
||||
service_account: ""
|
||||
taints: [] # example ["node-role.kubernetes.io/master:NoSchedule"]
|
||||
taints: []
|
||||
namespace: 'default'
|
||||
instance_count: 1
|
||||
execution: parallel
|
||||
|
||||
17
scenarios/kube/pod-network-chaos.yml
Normal file
17
scenarios/kube/pod-network-chaos.yml
Normal file
@@ -0,0 +1,17 @@
|
||||
- id: pod_network_chaos
|
||||
image: "quay.io/krkn-chaos/krkn-network-chaos:latest"
|
||||
wait_duration: 1
|
||||
test_duration: 60
|
||||
label_selector: ""
|
||||
service_account: ""
|
||||
taints: []
|
||||
namespace: 'default'
|
||||
instance_count: 1
|
||||
target: "<pod_name>"
|
||||
execution: parallel
|
||||
interfaces: []
|
||||
ingress: true
|
||||
egress: true
|
||||
latency: 0s # supported units are us (microseconds), ms, s
|
||||
loss: 10 # percentage
|
||||
bandwidth: 1gbit #supported units are bit kbit mbit gbit tbit
|
||||
@@ -4,7 +4,7 @@
|
||||
test_duration: 60
|
||||
label_selector: "<pod_selector>"
|
||||
service_account: ""
|
||||
taints: [] # example ["node-role.kubernetes.io/master:NoSchedule"]
|
||||
taints: []
|
||||
namespace: 'default'
|
||||
instance_count: 1
|
||||
execution: parallel
|
||||
|
||||
@@ -3,3 +3,4 @@ application_outage: # Scenario to create an out
|
||||
namespace: <namespace-with-application> # Namespace to target - all application routes will go inaccessible if pod selector is empty
|
||||
pod_selector: {app: foo} # Pods to target
|
||||
block: [Ingress, Egress] # It can be Ingress or Egress or Ingress, Egress
|
||||
exclude_label: "" # Optional label selector to exclude pods. Supports dict, string, or list format
|
||||
|
||||
@@ -10,6 +10,7 @@ node_scenarios:
|
||||
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
|
||||
parallel: true # Run action on label or node name in parallel or sequential, defaults to sequential
|
||||
kube_check: true # Run the kubernetes api calls to see if the node gets to a certain state during the node scenario
|
||||
poll_interval: 15 # Time interval(in seconds) to periodically check the node's status
|
||||
- actions:
|
||||
- node_reboot_scenario
|
||||
node_name:
|
||||
|
||||
@@ -6,3 +6,4 @@ scenarios:
|
||||
action: 1
|
||||
count: 1
|
||||
expected_recovery_time: 120
|
||||
exclude_label: ""
|
||||
@@ -1,215 +0,0 @@
|
||||
import unittest
|
||||
import time
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import yaml
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.models.telemetry import ScenarioTelemetry
|
||||
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
|
||||
|
||||
from krkn.scenario_plugins.kubevirt_vm_outage.kubevirt_vm_outage_scenario_plugin import KubevirtVmOutageScenarioPlugin
|
||||
|
||||
|
||||
class TestKubevirtVmOutageScenarioPlugin(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
"""
|
||||
Set up test fixtures for KubevirtVmOutageScenarioPlugin
|
||||
"""
|
||||
self.plugin = KubevirtVmOutageScenarioPlugin()
|
||||
|
||||
# Create mock k8s client
|
||||
self.k8s_client = MagicMock()
|
||||
self.custom_object_client = MagicMock()
|
||||
self.k8s_client.custom_object_client = self.custom_object_client
|
||||
self.plugin.k8s_client = self.k8s_client
|
||||
|
||||
# Mock methods needed for KubeVirt operations
|
||||
self.k8s_client.list_custom_resource_definition = MagicMock()
|
||||
|
||||
# Mock custom resource definition list with KubeVirt CRDs
|
||||
crd_list = MagicMock()
|
||||
crd_item = MagicMock()
|
||||
crd_item.spec = MagicMock()
|
||||
crd_item.spec.group = "kubevirt.io"
|
||||
crd_list.items = [crd_item]
|
||||
self.k8s_client.list_custom_resource_definition.return_value = crd_list
|
||||
|
||||
# Mock VMI data
|
||||
self.mock_vmi = {
|
||||
"metadata": {
|
||||
"name": "test-vm",
|
||||
"namespace": "default"
|
||||
},
|
||||
"status": {
|
||||
"phase": "Running"
|
||||
}
|
||||
}
|
||||
|
||||
# Create test config
|
||||
self.config = {
|
||||
"scenarios": [
|
||||
{
|
||||
"name": "kubevirt outage test",
|
||||
"scenario": "kubevirt_vm_outage",
|
||||
"parameters": {
|
||||
"vm_name": "test-vm",
|
||||
"namespace": "default",
|
||||
"duration": 0
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Create a temporary config file
|
||||
import tempfile, os
|
||||
temp_dir = tempfile.gettempdir()
|
||||
self.scenario_file = os.path.join(temp_dir, "test_kubevirt_scenario.yaml")
|
||||
with open(self.scenario_file, "w") as f:
|
||||
yaml.dump(self.config, f)
|
||||
|
||||
# Mock dependencies
|
||||
self.telemetry = MagicMock(spec=KrknTelemetryOpenshift)
|
||||
self.scenario_telemetry = MagicMock(spec=ScenarioTelemetry)
|
||||
self.telemetry.get_lib_kubernetes.return_value = self.k8s_client
|
||||
|
||||
def test_successful_injection_and_recovery(self):
|
||||
"""
|
||||
Test successful deletion and recovery of a VMI
|
||||
"""
|
||||
# Mock get_vmi to return our mock VMI
|
||||
with patch.object(self.plugin, 'get_vmi', return_value=self.mock_vmi):
|
||||
# Mock inject and recover to simulate success
|
||||
with patch.object(self.plugin, 'inject', return_value=0) as mock_inject:
|
||||
with patch.object(self.plugin, 'recover', return_value=0) as mock_recover:
|
||||
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
|
||||
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
|
||||
|
||||
self.assertEqual(result, 0)
|
||||
mock_inject.assert_called_once_with("test-vm", "default", False)
|
||||
mock_recover.assert_called_once_with("test-vm", "default", False)
|
||||
|
||||
def test_injection_failure(self):
|
||||
"""
|
||||
Test failure during VMI deletion
|
||||
"""
|
||||
# Mock get_vmi to return our mock VMI
|
||||
with patch.object(self.plugin, 'get_vmi', return_value=self.mock_vmi):
|
||||
# Mock inject to simulate failure
|
||||
with patch.object(self.plugin, 'inject', return_value=1) as mock_inject:
|
||||
with patch.object(self.plugin, 'recover', return_value=0) as mock_recover:
|
||||
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
|
||||
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
|
||||
|
||||
self.assertEqual(result, 1)
|
||||
mock_inject.assert_called_once_with("test-vm", "default", False)
|
||||
mock_recover.assert_not_called()
|
||||
|
||||
def test_disable_auto_restart(self):
|
||||
"""
|
||||
Test VM auto-restart can be disabled
|
||||
"""
|
||||
# Configure test with disable_auto_restart=True
|
||||
self.config["scenarios"][0]["parameters"]["disable_auto_restart"] = True
|
||||
|
||||
# Mock VM object for patching
|
||||
mock_vm = {
|
||||
"metadata": {"name": "test-vm", "namespace": "default"},
|
||||
"spec": {}
|
||||
}
|
||||
|
||||
# Mock get_vmi to return our mock VMI
|
||||
with patch.object(self.plugin, 'get_vmi', return_value=self.mock_vmi):
|
||||
# Mock VM patch operation
|
||||
with patch.object(self.plugin, 'patch_vm_spec') as mock_patch_vm:
|
||||
mock_patch_vm.return_value = True
|
||||
# Mock inject and recover
|
||||
with patch.object(self.plugin, 'inject', return_value=0) as mock_inject:
|
||||
with patch.object(self.plugin, 'recover', return_value=0) as mock_recover:
|
||||
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
|
||||
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
|
||||
|
||||
self.assertEqual(result, 0)
|
||||
# Should call patch_vm_spec to disable auto-restart
|
||||
mock_patch_vm.assert_any_call("test-vm", "default", False)
|
||||
# Should call patch_vm_spec to re-enable auto-restart during recovery
|
||||
mock_patch_vm.assert_any_call("test-vm", "default", True)
|
||||
mock_inject.assert_called_once_with("test-vm", "default", True)
|
||||
mock_recover.assert_called_once_with("test-vm", "default", True)
|
||||
|
||||
def test_recovery_when_vmi_does_not_exist(self):
|
||||
"""
|
||||
Test recovery logic when VMI does not exist after deletion
|
||||
"""
|
||||
# Store the original VMI in the plugin for recovery
|
||||
self.plugin.original_vmi = self.mock_vmi.copy()
|
||||
|
||||
# Create a cleaned vmi_dict as the plugin would
|
||||
vmi_dict = self.mock_vmi.copy()
|
||||
|
||||
# Set up running VMI data for after recovery
|
||||
running_vmi = {
|
||||
"metadata": {"name": "test-vm", "namespace": "default"},
|
||||
"status": {"phase": "Running"}
|
||||
}
|
||||
|
||||
# Set up time.time to immediately exceed the timeout for auto-recovery
|
||||
with patch('time.time', side_effect=[0, 301, 301, 301, 301, 310, 320]):
|
||||
# Mock get_vmi to always return None (not auto-recovered)
|
||||
with patch.object(self.plugin, 'get_vmi', side_effect=[None, None, running_vmi]):
|
||||
# Mock the custom object API to return success
|
||||
self.custom_object_client.create_namespaced_custom_object = MagicMock(return_value=running_vmi)
|
||||
|
||||
# Run recovery with mocked time.sleep
|
||||
with patch('time.sleep'):
|
||||
result = self.plugin.recover("test-vm", "default", False)
|
||||
|
||||
self.assertEqual(result, 0)
|
||||
# Verify create was called with the right arguments for our API version and kind
|
||||
self.custom_object_client.create_namespaced_custom_object.assert_called_once_with(
|
||||
group="kubevirt.io",
|
||||
version="v1",
|
||||
namespace="default",
|
||||
plural="virtualmachineinstances",
|
||||
body=vmi_dict
|
||||
)
|
||||
|
||||
def test_validation_failure(self):
|
||||
"""
|
||||
Test validation failure when KubeVirt is not installed
|
||||
"""
|
||||
# Mock empty CRD list (no KubeVirt CRDs)
|
||||
empty_crd_list = MagicMock()
|
||||
empty_crd_list.items = []
|
||||
self.k8s_client.list_custom_resource_definition.return_value = empty_crd_list
|
||||
|
||||
with patch("builtins.open", unittest.mock.mock_open(read_data=yaml.dump(self.config))):
|
||||
result = self.plugin.run("test-uuid", self.scenario_file, {}, self.telemetry, self.scenario_telemetry)
|
||||
|
||||
self.assertEqual(result, 1)
|
||||
|
||||
def test_delete_vmi_timeout(self):
|
||||
"""
|
||||
Test timeout during VMI deletion
|
||||
"""
|
||||
# Mock successful delete operation
|
||||
self.custom_object_client.delete_namespaced_custom_object = MagicMock(return_value={})
|
||||
|
||||
# Mock that get_vmi always returns VMI (never gets deleted)
|
||||
with patch.object(self.plugin, 'get_vmi', return_value=self.mock_vmi):
|
||||
# Simulate timeout by making time.time return values that exceed the timeout
|
||||
with patch('time.sleep'), patch('time.time', side_effect=[0, 10, 20, 130, 130, 130, 130, 140]):
|
||||
result = self.plugin.inject("test-vm", "default", False)
|
||||
|
||||
self.assertEqual(result, 1)
|
||||
self.custom_object_client.delete_namespaced_custom_object.assert_called_once_with(
|
||||
group="kubevirt.io",
|
||||
version="v1",
|
||||
namespace="default",
|
||||
plural="virtualmachineinstances",
|
||||
name="test-vm"
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
37
tests/run_python_plugin.py
Normal file
37
tests/run_python_plugin.py
Normal file
@@ -0,0 +1,37 @@
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from krkn.scenario_plugins.native.run_python_plugin import (
|
||||
RunPythonFileInput,
|
||||
run_python_file,
|
||||
)
|
||||
|
||||
|
||||
class RunPythonPluginTest(unittest.TestCase):
|
||||
def test_success_execution(self):
|
||||
tmp_file = tempfile.NamedTemporaryFile()
|
||||
tmp_file.write(bytes("print('Hello world!')", "utf-8"))
|
||||
tmp_file.flush()
|
||||
output_id, output_data = run_python_file(
|
||||
params=RunPythonFileInput(tmp_file.name),
|
||||
run_id="test-python-plugin-success",
|
||||
)
|
||||
self.assertEqual("success", output_id)
|
||||
self.assertEqual("Hello world!\n", output_data.stdout)
|
||||
|
||||
def test_error_execution(self):
|
||||
tmp_file = tempfile.NamedTemporaryFile()
|
||||
tmp_file.write(
|
||||
bytes("import sys\nprint('Hello world!')\nsys.exit(42)\n", "utf-8")
|
||||
)
|
||||
tmp_file.flush()
|
||||
output_id, output_data = run_python_file(
|
||||
params=RunPythonFileInput(tmp_file.name), run_id="test-python-plugin-error"
|
||||
)
|
||||
self.assertEqual("error", output_id)
|
||||
self.assertEqual(42, output_data.exit_code)
|
||||
self.assertEqual("Hello world!\n", output_data.stdout)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
415
tests/test_abstract_node_scenarios.py
Normal file
415
tests/test_abstract_node_scenarios.py
Normal file
@@ -0,0 +1,415 @@
|
||||
"""
|
||||
Test suite for AbstractNode Scenarios
|
||||
|
||||
Usage:
|
||||
python -m coverage run -a -m unittest tests/test_abstract_node_scenarios.py
|
||||
|
||||
Assisted By: Claude Code
|
||||
"""
|
||||
|
||||
import unittest
|
||||
from unittest.mock import Mock, patch
|
||||
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import abstract_node_scenarios
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.models.k8s import AffectedNode, AffectedNodeStatus
|
||||
|
||||
|
||||
class TestAbstractNodeScenarios(unittest.TestCase):
|
||||
"""Test suite for abstract_node_scenarios class"""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures before each test method"""
|
||||
self.mock_kubecli = Mock(spec=KrknKubernetes)
|
||||
self.mock_affected_nodes_status = Mock(spec=AffectedNodeStatus)
|
||||
self.mock_affected_nodes_status.affected_nodes = []
|
||||
self.node_action_kube_check = True
|
||||
|
||||
self.scenarios = abstract_node_scenarios(
|
||||
kubecli=self.mock_kubecli,
|
||||
node_action_kube_check=self.node_action_kube_check,
|
||||
affected_nodes_status=self.mock_affected_nodes_status
|
||||
)
|
||||
|
||||
def test_init(self):
|
||||
"""Test initialization of abstract_node_scenarios"""
|
||||
self.assertEqual(self.scenarios.kubecli, self.mock_kubecli)
|
||||
self.assertEqual(self.scenarios.affected_nodes_status, self.mock_affected_nodes_status)
|
||||
self.assertTrue(self.scenarios.node_action_kube_check)
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
def test_node_stop_start_scenario(self, mock_logging, mock_sleep):
|
||||
"""Test node_stop_start_scenario calls stop and start in sequence"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
duration = 60
|
||||
poll_interval = 10
|
||||
|
||||
self.scenarios.node_stop_scenario = Mock()
|
||||
self.scenarios.node_start_scenario = Mock()
|
||||
|
||||
# Act
|
||||
self.scenarios.node_stop_start_scenario(
|
||||
instance_kill_count, node, timeout, duration, poll_interval
|
||||
)
|
||||
|
||||
# Assert
|
||||
self.scenarios.node_stop_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout, poll_interval
|
||||
)
|
||||
mock_sleep.assert_called_once_with(duration)
|
||||
self.scenarios.node_start_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout, poll_interval
|
||||
)
|
||||
self.mock_affected_nodes_status.merge_affected_nodes.assert_called_once()
|
||||
|
||||
@patch('logging.info')
|
||||
def test_helper_node_stop_start_scenario(self, mock_logging):
|
||||
"""Test helper_node_stop_start_scenario calls helper stop and start"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "helper-node"
|
||||
timeout = 300
|
||||
|
||||
self.scenarios.helper_node_stop_scenario = Mock()
|
||||
self.scenarios.helper_node_start_scenario = Mock()
|
||||
|
||||
# Act
|
||||
self.scenarios.helper_node_stop_start_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert
|
||||
self.scenarios.helper_node_stop_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout
|
||||
)
|
||||
self.scenarios.helper_node_start_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout
|
||||
)
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
def test_node_disk_detach_attach_scenario_success(self, mock_logging, mock_sleep):
|
||||
"""Test disk detach/attach scenario with valid disk attachment"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
duration = 60
|
||||
disk_details = {"disk_id": "disk-123", "device": "/dev/sdb"}
|
||||
|
||||
self.scenarios.get_disk_attachment_info = Mock(return_value=disk_details)
|
||||
self.scenarios.disk_detach_scenario = Mock()
|
||||
self.scenarios.disk_attach_scenario = Mock()
|
||||
|
||||
# Act
|
||||
self.scenarios.node_disk_detach_attach_scenario(
|
||||
instance_kill_count, node, timeout, duration
|
||||
)
|
||||
|
||||
# Assert
|
||||
self.scenarios.get_disk_attachment_info.assert_called_once_with(
|
||||
instance_kill_count, node
|
||||
)
|
||||
self.scenarios.disk_detach_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout
|
||||
)
|
||||
mock_sleep.assert_called_once_with(duration)
|
||||
self.scenarios.disk_attach_scenario.assert_called_once_with(
|
||||
instance_kill_count, disk_details, timeout
|
||||
)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('logging.info')
|
||||
def test_node_disk_detach_attach_scenario_no_disk(self, mock_info, mock_error):
|
||||
"""Test disk detach/attach scenario when only root disk exists"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
duration = 60
|
||||
|
||||
self.scenarios.get_disk_attachment_info = Mock(return_value=None)
|
||||
self.scenarios.disk_detach_scenario = Mock()
|
||||
self.scenarios.disk_attach_scenario = Mock()
|
||||
|
||||
# Act
|
||||
self.scenarios.node_disk_detach_attach_scenario(
|
||||
instance_kill_count, node, timeout, duration
|
||||
)
|
||||
|
||||
# Assert
|
||||
self.scenarios.disk_detach_scenario.assert_not_called()
|
||||
self.scenarios.disk_attach_scenario.assert_not_called()
|
||||
mock_error.assert_any_call("Node %s has only root disk attached" % node)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.nodeaction.wait_for_unknown_status')
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
@patch('logging.info')
|
||||
def test_stop_kubelet_scenario_success(self, mock_logging, mock_run, mock_wait):
|
||||
"""Test successful kubelet stop scenario"""
|
||||
# Arrange
|
||||
instance_kill_count = 2
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
mock_affected_node = Mock(spec=AffectedNode)
|
||||
mock_wait.return_value = None
|
||||
|
||||
# Act
|
||||
with patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.AffectedNode') as mock_affected_node_class:
|
||||
mock_affected_node_class.return_value = mock_affected_node
|
||||
self.scenarios.stop_kubelet_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert
|
||||
self.assertEqual(mock_run.call_count, 2)
|
||||
expected_command = "oc debug node/" + node + " -- chroot /host systemctl stop kubelet"
|
||||
mock_run.assert_called_with(expected_command)
|
||||
self.assertEqual(mock_wait.call_count, 2)
|
||||
self.assertEqual(len(self.mock_affected_nodes_status.affected_nodes), 2)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.nodeaction.wait_for_unknown_status')
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
@patch('logging.error')
|
||||
@patch('logging.info')
|
||||
def test_stop_kubelet_scenario_failure(self, mock_info, mock_error, mock_run, mock_wait):
|
||||
"""Test kubelet stop scenario when command fails"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
error_msg = "Command failed"
|
||||
mock_run.side_effect = Exception(error_msg)
|
||||
|
||||
# Act & Assert
|
||||
with self.assertRaises(Exception):
|
||||
with patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.AffectedNode'):
|
||||
self.scenarios.stop_kubelet_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
mock_error.assert_any_call(
|
||||
"Failed to stop the kubelet of the node. Encountered following "
|
||||
"exception: %s. Test Failed" % error_msg
|
||||
)
|
||||
|
||||
@patch('logging.info')
|
||||
def test_stop_start_kubelet_scenario(self, mock_logging):
|
||||
"""Test stop/start kubelet scenario"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
|
||||
self.scenarios.stop_kubelet_scenario = Mock()
|
||||
self.scenarios.node_reboot_scenario = Mock()
|
||||
|
||||
# Act
|
||||
self.scenarios.stop_start_kubelet_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert
|
||||
self.scenarios.stop_kubelet_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout
|
||||
)
|
||||
self.scenarios.node_reboot_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout
|
||||
)
|
||||
self.mock_affected_nodes_status.merge_affected_nodes.assert_called_once()
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.nodeaction.wait_for_ready_status')
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
@patch('logging.info')
|
||||
def test_restart_kubelet_scenario_success(self, mock_logging, mock_run, mock_wait):
|
||||
"""Test successful kubelet restart scenario"""
|
||||
# Arrange
|
||||
instance_kill_count = 2
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
mock_affected_node = Mock(spec=AffectedNode)
|
||||
mock_wait.return_value = None
|
||||
|
||||
# Act
|
||||
with patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.AffectedNode') as mock_affected_node_class:
|
||||
mock_affected_node_class.return_value = mock_affected_node
|
||||
self.scenarios.restart_kubelet_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert
|
||||
self.assertEqual(mock_run.call_count, 2)
|
||||
expected_command = "oc debug node/" + node + " -- chroot /host systemctl restart kubelet &"
|
||||
mock_run.assert_called_with(expected_command)
|
||||
self.assertEqual(mock_wait.call_count, 2)
|
||||
self.assertEqual(len(self.mock_affected_nodes_status.affected_nodes), 2)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.nodeaction.wait_for_ready_status')
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
@patch('logging.error')
|
||||
@patch('logging.info')
|
||||
def test_restart_kubelet_scenario_failure(self, mock_info, mock_error, mock_run, mock_wait):
|
||||
"""Test kubelet restart scenario when command fails"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
error_msg = "Restart failed"
|
||||
mock_run.side_effect = Exception(error_msg)
|
||||
|
||||
# Act & Assert
|
||||
with self.assertRaises(Exception):
|
||||
with patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.AffectedNode'):
|
||||
self.scenarios.restart_kubelet_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
mock_error.assert_any_call(
|
||||
"Failed to restart the kubelet of the node. Encountered following "
|
||||
"exception: %s. Test Failed" % error_msg
|
||||
)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
@patch('logging.info')
|
||||
def test_node_crash_scenario_success(self, mock_logging, mock_run):
|
||||
"""Test successful node crash scenario"""
|
||||
# Arrange
|
||||
instance_kill_count = 2
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
|
||||
# Act
|
||||
result = self.scenarios.node_crash_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert
|
||||
self.assertEqual(mock_run.call_count, 2)
|
||||
expected_command = (
|
||||
"oc debug node/" + node + " -- chroot /host "
|
||||
"dd if=/dev/urandom of=/proc/sysrq-trigger"
|
||||
)
|
||||
mock_run.assert_called_with(expected_command)
|
||||
self.assertIsNone(result)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
@patch('logging.error')
|
||||
@patch('logging.info')
|
||||
def test_node_crash_scenario_failure(self, mock_info, mock_error, mock_run):
|
||||
"""Test node crash scenario when command fails"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
error_msg = "Crash command failed"
|
||||
mock_run.side_effect = Exception(error_msg)
|
||||
|
||||
# Act
|
||||
result = self.scenarios.node_crash_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert
|
||||
self.assertEqual(result, 1)
|
||||
mock_error.assert_any_call(
|
||||
"Failed to crash the node. Encountered following exception: %s. "
|
||||
"Test Failed" % error_msg
|
||||
)
|
||||
|
||||
def test_node_start_scenario_not_implemented(self):
|
||||
"""Test that node_start_scenario returns None (not implemented)"""
|
||||
result = self.scenarios.node_start_scenario(1, "test-node", 300, 10)
|
||||
self.assertIsNone(result)
|
||||
|
||||
def test_node_stop_scenario_not_implemented(self):
|
||||
"""Test that node_stop_scenario returns None (not implemented)"""
|
||||
result = self.scenarios.node_stop_scenario(1, "test-node", 300, 10)
|
||||
self.assertIsNone(result)
|
||||
|
||||
def test_node_termination_scenario_not_implemented(self):
|
||||
"""Test that node_termination_scenario returns None (not implemented)"""
|
||||
result = self.scenarios.node_termination_scenario(1, "test-node", 300, 10)
|
||||
self.assertIsNone(result)
|
||||
|
||||
def test_node_reboot_scenario_not_implemented(self):
|
||||
"""Test that node_reboot_scenario returns None (not implemented)"""
|
||||
result = self.scenarios.node_reboot_scenario(1, "test-node", 300)
|
||||
self.assertIsNone(result)
|
||||
|
||||
def test_node_service_status_not_implemented(self):
|
||||
"""Test that node_service_status returns None (not implemented)"""
|
||||
result = self.scenarios.node_service_status("test-node", "service", "key", 300)
|
||||
self.assertIsNone(result)
|
||||
|
||||
def test_node_block_scenario_not_implemented(self):
|
||||
"""Test that node_block_scenario returns None (not implemented)"""
|
||||
result = self.scenarios.node_block_scenario(1, "test-node", 300, 60)
|
||||
self.assertIsNone(result)
|
||||
|
||||
|
||||
class TestAbstractNodeScenariosIntegration(unittest.TestCase):
|
||||
"""Integration tests for abstract_node_scenarios workflows"""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures before each test method"""
|
||||
self.mock_kubecli = Mock(spec=KrknKubernetes)
|
||||
self.mock_affected_nodes_status = Mock(spec=AffectedNodeStatus)
|
||||
self.mock_affected_nodes_status.affected_nodes = []
|
||||
|
||||
self.scenarios = abstract_node_scenarios(
|
||||
kubecli=self.mock_kubecli,
|
||||
node_action_kube_check=True,
|
||||
affected_nodes_status=self.mock_affected_nodes_status
|
||||
)
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.nodeaction.wait_for_unknown_status')
|
||||
@patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.runcommand.run')
|
||||
def test_complete_stop_start_kubelet_workflow(self, mock_run, mock_wait, mock_sleep):
|
||||
"""Test complete workflow of stop/start kubelet scenario"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
|
||||
self.scenarios.node_reboot_scenario = Mock()
|
||||
|
||||
# Act
|
||||
with patch('krkn.scenario_plugins.node_actions.abstract_node_scenarios.AffectedNode'):
|
||||
self.scenarios.stop_start_kubelet_scenario(instance_kill_count, node, timeout)
|
||||
|
||||
# Assert - verify stop kubelet was called
|
||||
expected_stop_command = "oc debug node/" + node + " -- chroot /host systemctl stop kubelet"
|
||||
mock_run.assert_any_call(expected_stop_command)
|
||||
|
||||
# Verify reboot was called
|
||||
self.scenarios.node_reboot_scenario.assert_called_once_with(
|
||||
instance_kill_count, node, timeout
|
||||
)
|
||||
|
||||
# Verify merge was called
|
||||
self.mock_affected_nodes_status.merge_affected_nodes.assert_called_once()
|
||||
|
||||
@patch('time.sleep')
|
||||
def test_node_stop_start_scenario_workflow(self, mock_sleep):
|
||||
"""Test complete workflow of node stop/start scenario"""
|
||||
# Arrange
|
||||
instance_kill_count = 1
|
||||
node = "test-node"
|
||||
timeout = 300
|
||||
duration = 60
|
||||
poll_interval = 10
|
||||
|
||||
self.scenarios.node_stop_scenario = Mock()
|
||||
self.scenarios.node_start_scenario = Mock()
|
||||
|
||||
# Act
|
||||
self.scenarios.node_stop_start_scenario(
|
||||
instance_kill_count, node, timeout, duration, poll_interval
|
||||
)
|
||||
|
||||
# Assert - verify order of operations
|
||||
call_order = []
|
||||
|
||||
# Verify stop was called first
|
||||
self.scenarios.node_stop_scenario.assert_called_once()
|
||||
|
||||
# Verify sleep was called
|
||||
mock_sleep.assert_called_once_with(duration)
|
||||
|
||||
# Verify start was called after sleep
|
||||
self.scenarios.node_start_scenario.assert_called_once()
|
||||
|
||||
# Verify merge was called
|
||||
self.mock_affected_nodes_status.merge_affected_nodes.assert_called_once()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
680
tests/test_alibaba_node_scenarios.py
Normal file
680
tests/test_alibaba_node_scenarios.py
Normal file
@@ -0,0 +1,680 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
Test suite for alibaba_node_scenarios class
|
||||
|
||||
Usage:
|
||||
python -m coverage run -a -m unittest tests/test_alibaba_node_scenarios.py -v
|
||||
|
||||
Assisted By: Claude Code
|
||||
"""
|
||||
|
||||
import unittest
|
||||
from unittest.mock import MagicMock, Mock, patch, PropertyMock, call
|
||||
import logging
|
||||
import json
|
||||
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
from krkn_lib.models.k8s import AffectedNode, AffectedNodeStatus
|
||||
|
||||
from krkn.scenario_plugins.node_actions.alibaba_node_scenarios import Alibaba, alibaba_node_scenarios
|
||||
|
||||
|
||||
class TestAlibaba(unittest.TestCase):
|
||||
"""Test suite for Alibaba class"""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures"""
|
||||
# Mock environment variables
|
||||
self.env_patcher = patch.dict('os.environ', {
|
||||
'ALIBABA_ID': 'test-access-key',
|
||||
'ALIBABA_SECRET': 'test-secret-key',
|
||||
'ALIBABA_REGION_ID': 'cn-hangzhou'
|
||||
})
|
||||
self.env_patcher.start()
|
||||
|
||||
def tearDown(self):
|
||||
"""Clean up after tests"""
|
||||
self.env_patcher.stop()
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_alibaba_init_success(self, mock_acs_client, mock_logging):
|
||||
"""Test Alibaba class initialization"""
|
||||
mock_client = Mock()
|
||||
mock_acs_client.return_value = mock_client
|
||||
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_acs_client.assert_called_once_with('test-access-key', 'test-secret-key', 'cn-hangzhou')
|
||||
self.assertEqual(alibaba.compute_client, mock_client)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_alibaba_init_failure(self, mock_acs_client, mock_logging):
|
||||
"""Test Alibaba initialization handles errors"""
|
||||
mock_acs_client.side_effect = Exception("Credential error")
|
||||
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("Initializing alibaba", str(mock_logging.call_args))
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_send_request_success(self, mock_acs_client):
|
||||
"""Test _send_request successfully sends request"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_request = Mock()
|
||||
mock_response = {'Instances': {'Instance': []}}
|
||||
alibaba.compute_client.do_action.return_value = json.dumps(mock_response).encode('utf-8')
|
||||
|
||||
result = alibaba._send_request(mock_request)
|
||||
|
||||
mock_request.set_accept_format.assert_called_once_with('json')
|
||||
alibaba.compute_client.do_action.assert_called_once_with(mock_request)
|
||||
self.assertEqual(result, mock_response)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_send_request_failure(self, mock_acs_client, mock_logging):
|
||||
"""Test _send_request handles errors"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_request = Mock()
|
||||
alibaba.compute_client.do_action.side_effect = Exception("API error")
|
||||
|
||||
# The actual code has a bug in the format string (%S instead of %s)
|
||||
# So we expect this to raise a ValueError
|
||||
with self.assertRaises(ValueError):
|
||||
alibaba._send_request(mock_request)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_list_instances_success(self, mock_acs_client):
|
||||
"""Test list_instances returns instance list"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_instances = [
|
||||
{'InstanceId': 'i-123', 'InstanceName': 'node1'},
|
||||
{'InstanceId': 'i-456', 'InstanceName': 'node2'}
|
||||
]
|
||||
mock_response = {'Instances': {'Instance': mock_instances}}
|
||||
alibaba.compute_client.do_action.return_value = json.dumps(mock_response).encode('utf-8')
|
||||
|
||||
result = alibaba.list_instances()
|
||||
|
||||
self.assertEqual(result, mock_instances)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_list_instances_no_instances_key(self, mock_acs_client, mock_logging):
|
||||
"""Test list_instances handles missing Instances key"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_response = {'SomeOtherKey': 'value'}
|
||||
alibaba.compute_client.do_action.return_value = json.dumps(mock_response).encode('utf-8')
|
||||
|
||||
with self.assertRaises(RuntimeError):
|
||||
alibaba.list_instances()
|
||||
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_list_instances_none_response(self, mock_acs_client):
|
||||
"""Test list_instances handles None response"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(return_value=None)
|
||||
|
||||
result = alibaba.list_instances()
|
||||
|
||||
self.assertEqual(result, [])
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_list_instances_exception(self, mock_acs_client, mock_logging):
|
||||
"""Test list_instances handles exceptions"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(side_effect=Exception("Network error"))
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
alibaba.list_instances()
|
||||
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_get_instance_id_found(self, mock_acs_client):
|
||||
"""Test get_instance_id when instance is found"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_instances = [
|
||||
{'InstanceId': 'i-123', 'InstanceName': 'test-node'},
|
||||
{'InstanceId': 'i-456', 'InstanceName': 'other-node'}
|
||||
]
|
||||
alibaba.list_instances = Mock(return_value=mock_instances)
|
||||
|
||||
result = alibaba.get_instance_id('test-node')
|
||||
|
||||
self.assertEqual(result, 'i-123')
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_get_instance_id_not_found(self, mock_acs_client, mock_logging):
|
||||
"""Test get_instance_id when instance is not found"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.list_instances = Mock(return_value=[])
|
||||
|
||||
with self.assertRaises(RuntimeError):
|
||||
alibaba.get_instance_id('nonexistent-node')
|
||||
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("Couldn't find vm", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_start_instances_success(self, mock_acs_client, mock_logging):
|
||||
"""Test start_instances successfully starts instance"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(return_value={'RequestId': 'req-123'})
|
||||
|
||||
alibaba.start_instances('i-123')
|
||||
|
||||
alibaba._send_request.assert_called_once()
|
||||
mock_logging.assert_called()
|
||||
call_str = str(mock_logging.call_args_list)
|
||||
self.assertTrue('started' in call_str or 'submit successfully' in call_str)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_start_instances_failure(self, mock_acs_client, mock_logging):
|
||||
"""Test start_instances handles failure"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(side_effect=Exception("Start failed"))
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
alibaba.start_instances('i-123')
|
||||
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("Failed to start", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_stop_instances_success(self, mock_acs_client, mock_logging):
|
||||
"""Test stop_instances successfully stops instance"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(return_value={'RequestId': 'req-123'})
|
||||
|
||||
alibaba.stop_instances('i-123', force_stop=True)
|
||||
|
||||
alibaba._send_request.assert_called_once()
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_stop_instances_failure(self, mock_acs_client, mock_logging):
|
||||
"""Test stop_instances handles failure"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(side_effect=Exception("Stop failed"))
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
alibaba.stop_instances('i-123')
|
||||
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("Failed to stop", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_release_instance_success(self, mock_acs_client, mock_logging):
|
||||
"""Test release_instance successfully releases instance"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(return_value={'RequestId': 'req-123'})
|
||||
|
||||
alibaba.release_instance('i-123', force_release=True)
|
||||
|
||||
alibaba._send_request.assert_called_once()
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("released", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_release_instance_failure(self, mock_acs_client, mock_logging):
|
||||
"""Test release_instance handles failure"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(side_effect=Exception("Release failed"))
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
alibaba.release_instance('i-123')
|
||||
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("Failed to terminate", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_reboot_instances_success(self, mock_acs_client, mock_logging):
|
||||
"""Test reboot_instances successfully reboots instance"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(return_value={'RequestId': 'req-123'})
|
||||
|
||||
alibaba.reboot_instances('i-123', force_reboot=True)
|
||||
|
||||
alibaba._send_request.assert_called_once()
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("rebooted", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_reboot_instances_failure(self, mock_acs_client, mock_logging):
|
||||
"""Test reboot_instances handles failure"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(side_effect=Exception("Reboot failed"))
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
alibaba.reboot_instances('i-123')
|
||||
|
||||
mock_logging.assert_called()
|
||||
self.assertIn("Failed to reboot", str(mock_logging.call_args))
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_get_vm_status_success(self, mock_acs_client, mock_logging):
|
||||
"""Test get_vm_status returns instance status"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_response = {
|
||||
'Instances': {
|
||||
'Instance': [{'Status': 'Running'}]
|
||||
}
|
||||
}
|
||||
alibaba._send_request = Mock(return_value=mock_response)
|
||||
|
||||
result = alibaba.get_vm_status('i-123')
|
||||
|
||||
self.assertEqual(result, 'Running')
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_get_vm_status_no_instances(self, mock_acs_client, mock_logging):
|
||||
"""Test get_vm_status when no instances found"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
mock_response = {
|
||||
'Instances': {
|
||||
'Instance': []
|
||||
}
|
||||
}
|
||||
alibaba._send_request = Mock(return_value=mock_response)
|
||||
|
||||
result = alibaba.get_vm_status('i-123')
|
||||
|
||||
self.assertIsNone(result)
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_get_vm_status_none_response(self, mock_acs_client, mock_logging):
|
||||
"""Test get_vm_status with None response"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(return_value=None)
|
||||
|
||||
result = alibaba.get_vm_status('i-123')
|
||||
|
||||
self.assertEqual(result, 'Unknown')
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_get_vm_status_exception(self, mock_acs_client, mock_logging):
|
||||
"""Test get_vm_status handles exceptions"""
|
||||
alibaba = Alibaba()
|
||||
alibaba._send_request = Mock(side_effect=Exception("API error"))
|
||||
|
||||
result = alibaba.get_vm_status('i-123')
|
||||
|
||||
self.assertIsNone(result)
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_running_success(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_running waits for instance to be running"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(side_effect=['Starting', 'Running'])
|
||||
mock_affected_node = Mock(spec=AffectedNode)
|
||||
|
||||
result = alibaba.wait_until_running('i-123', 300, mock_affected_node)
|
||||
|
||||
self.assertTrue(result)
|
||||
mock_affected_node.set_affected_node_status.assert_called_once()
|
||||
args = mock_affected_node.set_affected_node_status.call_args[0]
|
||||
self.assertEqual(args[0], 'running')
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_running_timeout(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_running returns False on timeout"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(return_value='Starting')
|
||||
|
||||
result = alibaba.wait_until_running('i-123', 10, None)
|
||||
|
||||
self.assertFalse(result)
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_stopped_success(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_stopped waits for instance to be stopped"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(side_effect=['Stopping', 'Stopped'])
|
||||
mock_affected_node = Mock(spec=AffectedNode)
|
||||
|
||||
result = alibaba.wait_until_stopped('i-123', 300, mock_affected_node)
|
||||
|
||||
self.assertTrue(result)
|
||||
mock_affected_node.set_affected_node_status.assert_called_once()
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_stopped_timeout(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_stopped returns False on timeout"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(return_value='Stopping')
|
||||
|
||||
result = alibaba.wait_until_stopped('i-123', 10, None)
|
||||
|
||||
self.assertFalse(result)
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_released_success(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_released waits for instance to be released"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(side_effect=['Deleting', 'Released'])
|
||||
mock_affected_node = Mock(spec=AffectedNode)
|
||||
|
||||
result = alibaba.wait_until_released('i-123', 300, mock_affected_node)
|
||||
|
||||
self.assertTrue(result)
|
||||
mock_affected_node.set_affected_node_status.assert_called_once()
|
||||
args = mock_affected_node.set_affected_node_status.call_args[0]
|
||||
self.assertEqual(args[0], 'terminated')
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_released_timeout(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_released returns False on timeout"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(return_value='Deleting')
|
||||
|
||||
result = alibaba.wait_until_released('i-123', 10, None)
|
||||
|
||||
self.assertFalse(result)
|
||||
|
||||
@patch('time.sleep')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.AcsClient')
|
||||
def test_wait_until_released_none_status(self, mock_acs_client, mock_logging, mock_sleep):
|
||||
"""Test wait_until_released when status becomes None"""
|
||||
alibaba = Alibaba()
|
||||
|
||||
alibaba.get_vm_status = Mock(side_effect=['Deleting', None])
|
||||
mock_affected_node = Mock(spec=AffectedNode)
|
||||
|
||||
result = alibaba.wait_until_released('i-123', 300, mock_affected_node)
|
||||
|
||||
self.assertTrue(result)
|
||||
|
||||
|
||||
class TestAlibabaNodeScenarios(unittest.TestCase):
|
||||
"""Test suite for alibaba_node_scenarios class"""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures"""
|
||||
self.env_patcher = patch.dict('os.environ', {
|
||||
'ALIBABA_ID': 'test-access-key',
|
||||
'ALIBABA_SECRET': 'test-secret-key',
|
||||
'ALIBABA_REGION_ID': 'cn-hangzhou'
|
||||
})
|
||||
self.env_patcher.start()
|
||||
|
||||
self.mock_kubecli = Mock(spec=KrknKubernetes)
|
||||
self.affected_nodes_status = AffectedNodeStatus()
|
||||
|
||||
def tearDown(self):
|
||||
"""Clean up after tests"""
|
||||
self.env_patcher.stop()
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_init(self, mock_alibaba_class, mock_logging):
|
||||
"""Test alibaba_node_scenarios initialization"""
|
||||
mock_alibaba_instance = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba_instance
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, True, self.affected_nodes_status)
|
||||
|
||||
self.assertEqual(scenarios.kubecli, self.mock_kubecli)
|
||||
self.assertTrue(scenarios.node_action_kube_check)
|
||||
self.assertEqual(scenarios.alibaba, mock_alibaba_instance)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_start_scenario_success(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_start_scenario successfully starts node"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.wait_until_running.return_value = True
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, True, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_start_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_alibaba.get_instance_id.assert_called_once_with('test-node')
|
||||
mock_alibaba.start_instances.assert_called_once_with('i-123')
|
||||
mock_alibaba.wait_until_running.assert_called_once()
|
||||
mock_nodeaction.wait_for_ready_status.assert_called_once()
|
||||
self.assertEqual(len(self.affected_nodes_status.affected_nodes), 1)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_start_scenario_no_kube_check(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_start_scenario without Kubernetes check"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.wait_until_running.return_value = True
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_start_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_alibaba.start_instances.assert_called_once()
|
||||
mock_nodeaction.wait_for_ready_status.assert_not_called()
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_start_scenario_failure(self, mock_alibaba_class, mock_logging):
|
||||
"""Test node_start_scenario handles failure"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.start_instances.side_effect = Exception('Start failed')
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
scenarios.node_start_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_start_scenario_multiple_runs(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_start_scenario with multiple runs"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.wait_until_running.return_value = True
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, True, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_start_scenario(3, 'test-node', 300, 15)
|
||||
|
||||
self.assertEqual(mock_alibaba.start_instances.call_count, 3)
|
||||
self.assertEqual(len(self.affected_nodes_status.affected_nodes), 3)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_stop_scenario_success(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_stop_scenario successfully stops node"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.wait_until_stopped.return_value = True
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, True, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_stop_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_alibaba.get_instance_id.assert_called_once_with('test-node')
|
||||
mock_alibaba.stop_instances.assert_called_once_with('i-123')
|
||||
mock_alibaba.wait_until_stopped.assert_called_once()
|
||||
mock_nodeaction.wait_for_unknown_status.assert_called_once()
|
||||
self.assertEqual(len(self.affected_nodes_status.affected_nodes), 1)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_stop_scenario_failure(self, mock_alibaba_class, mock_logging):
|
||||
"""Test node_stop_scenario handles failure"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.stop_instances.side_effect = Exception('Stop failed')
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
scenarios.node_stop_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_termination_scenario_success(self, mock_alibaba_class, mock_logging):
|
||||
"""Test node_termination_scenario successfully terminates node"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.wait_until_stopped.return_value = True
|
||||
mock_alibaba.wait_until_released.return_value = True
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_termination_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_alibaba.stop_instances.assert_called_once_with('i-123')
|
||||
mock_alibaba.wait_until_stopped.assert_called_once()
|
||||
mock_alibaba.release_instance.assert_called_once_with('i-123')
|
||||
mock_alibaba.wait_until_released.assert_called_once()
|
||||
self.assertEqual(len(self.affected_nodes_status.affected_nodes), 1)
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_termination_scenario_failure(self, mock_alibaba_class, mock_logging):
|
||||
"""Test node_termination_scenario handles failure"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.stop_instances.side_effect = Exception('Stop failed')
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
scenarios.node_termination_scenario(1, 'test-node', 300, 15)
|
||||
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_reboot_scenario_success(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_reboot_scenario successfully reboots node"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, True, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_reboot_scenario(1, 'test-node', 300, soft_reboot=False)
|
||||
|
||||
mock_alibaba.reboot_instances.assert_called_once_with('i-123')
|
||||
mock_nodeaction.wait_for_unknown_status.assert_called_once()
|
||||
mock_nodeaction.wait_for_ready_status.assert_called_once()
|
||||
self.assertEqual(len(self.affected_nodes_status.affected_nodes), 1)
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_reboot_scenario_no_kube_check(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_reboot_scenario without Kubernetes check"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_reboot_scenario(1, 'test-node', 300)
|
||||
|
||||
mock_alibaba.reboot_instances.assert_called_once()
|
||||
mock_nodeaction.wait_for_unknown_status.assert_not_called()
|
||||
mock_nodeaction.wait_for_ready_status.assert_not_called()
|
||||
|
||||
@patch('logging.error')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_reboot_scenario_failure(self, mock_alibaba_class, mock_logging):
|
||||
"""Test node_reboot_scenario handles failure"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
mock_alibaba.reboot_instances.side_effect = Exception('Reboot failed')
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, False, self.affected_nodes_status)
|
||||
|
||||
with self.assertRaises(Exception):
|
||||
scenarios.node_reboot_scenario(1, 'test-node', 300)
|
||||
|
||||
mock_logging.assert_called()
|
||||
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.nodeaction')
|
||||
@patch('logging.info')
|
||||
@patch('krkn.scenario_plugins.node_actions.alibaba_node_scenarios.Alibaba')
|
||||
def test_node_reboot_scenario_multiple_runs(self, mock_alibaba_class, mock_logging, mock_nodeaction):
|
||||
"""Test node_reboot_scenario with multiple runs"""
|
||||
mock_alibaba = Mock()
|
||||
mock_alibaba_class.return_value = mock_alibaba
|
||||
mock_alibaba.get_instance_id.return_value = 'i-123'
|
||||
|
||||
scenarios = alibaba_node_scenarios(self.mock_kubecli, True, self.affected_nodes_status)
|
||||
|
||||
scenarios.node_reboot_scenario(2, 'test-node', 300)
|
||||
|
||||
self.assertEqual(mock_alibaba.reboot_instances.call_count, 2)
|
||||
self.assertEqual(len(self.affected_nodes_status.affected_nodes), 2)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user