Compare commits

..

14 Commits

Author SHA1 Message Date
Tullio Sebastiani
0aac6119b0 hotfix: krkn-lib update (#709)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-07 08:22:31 -04:00
Tullio Sebastiani
7e5bdfd5cf disabled elastic (#708)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-04 12:42:34 -04:00
Tullio Sebastiani
3c207ab2ea hotfix: krkn-lib update (#706)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-04 11:11:20 -04:00
Tullio Sebastiani
d91172d9b2 Core Refactoring, Krkn Scenario Plugin API (#694)
* relocated shared libraries from `kraken` to `krkn` folder

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* AbstractScenarioPlugin and ScenarioPluginFactory

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* application_outage porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* arcaflow_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* managedcluster_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* network_chaos porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* node_actions porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* plugin_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pvc_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* service_disruption porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* service_hijacking porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* cluster_shut_down_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* syn_flood porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* time_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* zone_outages porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* ScenarioPluginFactory tests

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* unit tests update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pod_scenarios and post actions deprecated

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

scenarios post_actions

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* funtests and config update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* run_krkn.py update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* utils porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* API Documentation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* container_scenarios porting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* funtest fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* document gif update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* Documentation + tests update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed example plugin

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* global renaming

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

test fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* config.yaml typos

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typos

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* removed `plugin_scenarios` from NativScenarioPlugin class

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* pod_network_scenarios type added

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* documentation update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

typo

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-03 20:48:04 +02:00
Tullio Sebastiani
a13fb43d94 krkn-lib updated v3.1.2
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-03 09:44:20 -04:00
Tullio Sebastiani
37ee7177bc krkn-lib update to support VirtualMachine count (#704)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-03 10:38:44 +02:00
Tullio Sebastiani
32142cc159 CVEs fix (#698)
* golang cves fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* arcaflow update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-09-20 08:33:41 -04:00
Paige Patton
34bfc0d3d9 Adding aws bare metal (#695)
* adding aws bare metal

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

* no found reservations

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

---------

Co-authored-by: Auto User <auto@users.noreply.github.com>
2024-09-18 13:55:58 -04:00
Tullio Sebastiani
736c90e937 Namespaced cluster events and logs integration (#690)
* namespaced events integration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* namespaced logs  implementation

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

namespaced logs plugin scenario

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

namespaced logs integration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* logs collection fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib 3.1.0 update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-09-12 11:54:57 +02:00
Naga Ravi Chaitanya Elluri
5e7938ba4a Update default configuration pointer for the node scenarios (#693)
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-09-09 22:10:25 -04:00
Paige Patton
b525f83261 restart kubelet (#688)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>
2024-09-09 21:57:53 -04:00
Paige Patton
26460a0dce Adding elastic set to none (#691)
* adding elastic set to none

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>

* too many ls

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

---------

Signed-off-by: Auto User <auto@users.noreply.github.com>
Co-authored-by: Auto User <auto@users.noreply.github.com>
2024-09-05 16:05:19 -04:00
dependabot[bot]
7968c2a776 Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-09-03 23:03:39 -04:00
Tullio Sebastiani
6186555c15 Elastic search krkn-lib integration (#658)
* Elastic search krkn-lib integration

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

removed default urls

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* Fix alerts bug on prometheus

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* fixed prometheus object initialization bug

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* updated requirements to krkn-lib 2.1.8

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* disabled alerts and metrics by default

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* reverted requirement to elastic branch on krkn-lib

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* numpy downgrade

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* maximium retries added to hijacking funtest

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* added elastic settings to funtest config

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* krkn-lib 3.0.0 update

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-08-28 10:46:42 -04:00
158 changed files with 5630 additions and 4515 deletions

View File

@@ -169,7 +169,7 @@ jobs:
path: krkn-lib-docs
ssh-key: ${{ secrets.KRKN_LIB_DOCS_PRIV_KEY }}
- name: Download json coverage
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4.1.7
with:
name: coverage.json
- name: Set up Python

View File

@@ -50,3 +50,15 @@ telemetry:
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
events_backup: True # enables/disables cluster events collection
telemetry_group: "funtests"
elastic:
enable_elastic: False
collect_metrics: False
collect_alerts: False
verify_certs: False
elastic_url: "https://192.168.39.196" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port: 32766
username: "elastic"
password: "test"
metrics_index: "krkn-metrics"
alerts_index: "krkn-alerts"
telemetry_index: "krkn-telemetry"

View File

@@ -10,7 +10,7 @@ function functional_test_app_outage {
yq -i '.application_outage.duration=10' scenarios/openshift/app_outage.yaml
yq -i '.application_outage.pod_selector={"scenario":"outage"}' scenarios/openshift/app_outage.yaml
yq -i '.application_outage.namespace="default"' scenarios/openshift/app_outage.yaml
export scenario_type="application_outages"
export scenario_type="application_outages_scenarios"
export scenario_file="scenarios/openshift/app_outage.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/app_outage.yaml

View File

@@ -7,9 +7,9 @@ trap finish EXIT
function functional_test_arca_cpu_hog {
yq -i '.input_list[0].node_selector={"kubernetes.io/hostname":"kind-worker2"}' scenarios/arcaflow/cpu-hog/input.yaml
export scenario_type="arcaflow_scenarios"
export scenario_file="scenarios/arcaflow/cpu-hog/input.yaml"
yq -i '.input_list[0].node_selector={"kubernetes.io/hostname":"kind-worker2"}' scenarios/kube/cpu-hog/input.yaml
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/cpu-hog/input.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/arca_cpu_hog.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/arca_cpu_hog.yaml

View File

@@ -7,9 +7,9 @@ trap finish EXIT
function functional_test_arca_io_hog {
yq -i '.input_list[0].node_selector={"kubernetes.io/hostname":"kind-worker2"}' scenarios/arcaflow/io-hog/input.yaml
export scenario_type="arcaflow_scenarios"
export scenario_file="scenarios/arcaflow/io-hog/input.yaml"
yq -i '.input_list[0].node_selector={"kubernetes.io/hostname":"kind-worker2"}' scenarios/kube/io-hog/input.yaml
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/io-hog/input.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/arca_io_hog.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/arca_io_hog.yaml

View File

@@ -7,9 +7,9 @@ trap finish EXIT
function functional_test_arca_memory_hog {
yq -i '.input_list[0].node_selector={"kubernetes.io/hostname":"kind-worker2"}' scenarios/arcaflow/memory-hog/input.yaml
export scenario_type="arcaflow_scenarios"
export scenario_file="scenarios/arcaflow/memory-hog/input.yaml"
yq -i '.input_list[0].node_selector={"kubernetes.io/hostname":"kind-worker2"}' scenarios/kube/memory-hog/input.yaml
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/memory-hog/input.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/arca_memory_hog.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/arca_memory_hog.yaml

View File

@@ -12,7 +12,7 @@ function functional_test_container_crash {
yq -i '.scenarios[0].label_selector="scenario=container"' scenarios/openshift/container_etcd.yml
yq -i '.scenarios[0].container_name="fedtools"' scenarios/openshift/container_etcd.yml
export scenario_type="container_scenarios"
export scenario_file="- scenarios/openshift/container_etcd.yml"
export scenario_file="scenarios/openshift/container_etcd.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/container_config.yaml

View File

@@ -6,8 +6,8 @@ trap error ERR
trap finish EXIT
function funtional_test_namespace_deletion {
export scenario_type="namespace_scenarios"
export scenario_file="- scenarios/openshift/ingress_namespace.yaml"
export scenario_type="service_disruption_scenarios"
export scenario_file="scenarios/openshift/ingress_namespace.yaml"
export post_config=""
yq '.scenarios[0].namespace="^namespace-scenario$"' -i scenarios/openshift/ingress_namespace.yaml
yq '.scenarios[0].wait_time=30' -i scenarios/openshift/ingress_namespace.yaml

View File

@@ -15,7 +15,7 @@ function functional_test_network_chaos {
yq -i 'del(.network_chaos.egress.latency)' scenarios/openshift/network_chaos.yaml
yq -i 'del(.network_chaos.egress.loss)' scenarios/openshift/network_chaos.yaml
export scenario_type="network_chaos"
export scenario_type="network_chaos_scenarios"
export scenario_file="scenarios/openshift/network_chaos.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/network_chaos.yaml

View File

@@ -35,14 +35,21 @@ TEXT_MIME="text/plain; charset=utf-8"
function functional_test_service_hijacking {
export scenario_type="service_hijacking"
export scenario_type="service_hijacking_scenarios"
export scenario_file="scenarios/kube/service_hijacking.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/service_hijacking.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/service_hijacking.yaml > /dev/null 2>&1 &
PID=$!
#Waiting the hijacking to have effect
while [ `curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php` == 404 ]; do echo "waiting scenario to kick in."; sleep 1; done;
COUNTER=0
while [ `curl -X GET -s -o /dev/null -I -w "%{http_code}" $SERVICE_URL/list/index.php` == 404 ]
do
echo "waiting scenario to kick in."
sleep 1
COUNTER=$((COUNTER+1))
[ $COUNTER -eq "100" ] && echo "maximum number of retry reached, test failed" && exit 1
done
#Checking Step 1 GET on /list/index.php
OUT_GET="`curl -X GET -s $SERVICE_URL/list/index.php`"

View File

@@ -18,8 +18,8 @@ function functional_test_telemetry {
yq -i '.performance_monitoring.prometheus_url="http://localhost:9090"' CI/config/common_test_config.yaml
yq -i '.telemetry.run_tag=env(RUN_TAG)' CI/config/common_test_config.yaml
export scenario_type="arcaflow_scenarios"
export scenario_file="scenarios/arcaflow/cpu-hog/input.yaml"
export scenario_type="hog_scenarios"
export scenario_file="scenarios/kube/cpu-hog/input.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/telemetry.yaml
retval=$(python3 -m coverage run -a run_kraken.py -c CI/config/telemetry.yaml)

View File

@@ -119,6 +119,12 @@ If adding a new scenario or tweaking the main config, be sure to add in updates
Please read [this file]((CI/README.md#adding-a-test-case)) for more information on updates.
### Scenario Plugin Development
If you're gearing up to develop new scenarios, take a moment to review our
[Scenario Plugin API Documentation](docs/scenario_plugin_api.md).
Its the perfect starting point to tap into your chaotic creativity!
### Community
Key Members(slack_usernames/full name): paigerube14/Paige Rubendall, mffiedler/Mike Fiedler, tsebasti/Tullio Sebastiani, yogi/Yogananth Subramanian, sahil/Sahil Shah, pradeep/Pradeep Surisetty and ravielluri/Naga Ravi Chaitanya Elluri.
* [**#krkn on Kubernetes Slack**](https://kubernetes.slack.com/messages/C05SFMHRWK1)

View File

@@ -1,6 +1,6 @@
kraken:
distribution: kubernetes # Distribution can be kubernetes or openshift
kubeconfig_path: ~/.kube/config # Path to kubeconfig
kubeconfig_path: ~/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
@@ -8,43 +8,46 @@ kraken:
port: 8081 # Signal port
chaos_scenarios:
# List of policies/chaos scenarios to load
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
- scenarios/arcaflow/memory-hog/input.yaml
- scenarios/arcaflow/io-hog/input.yaml
- application_outages:
- hog_scenarios:
- scenarios/kube/cpu-hog/input.yaml
- scenarios/kube/memory-hog/input.yaml
- scenarios/kube/io-hog/input.yaml
- scenarios/kube/io-hog/input.yaml
- application_outages_scenarios:
- scenarios/openshift/app_outage.yaml
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/openshift/container_etcd.yml
- plugin_scenarios:
- scenarios/openshift/container_etcd.yml
- pod_network_scenarios:
- scenarios/openshift/network_chaos_ingress.yml
- scenarios/openshift/pod_network_outage.yml
- pod_disruption_scenarios:
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/vmware_node_scenarios.yml
- scenarios/openshift/network_chaos_ingress.yml
- scenarios/openshift/prom_kill.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- plugin_scenarios:
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- vmware_node_scenarios:
- scenarios/openshift/vmware_node_scenarios.yml
- ibmcloud_node_scenarios:
- scenarios/openshift/ibmcloud_node_scenarios.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/aws_node_scenarios.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- cluster_shut_down_scenarios:
- - scenarios/openshift/cluster_shut_down_scenario.yml
- scenarios/openshift/post_action_shut_down.py
- scenarios/openshift/cluster_shut_down_scenario.yml
- service_disruption_scenarios:
- - scenarios/openshift/regex_namespace.yaml
- - scenarios/openshift/ingress_namespace.yaml
- scenarios/openshift/post_action_namespace.py
- zone_outages:
- scenarios/openshift/regex_namespace.yaml
- scenarios/openshift/ingress_namespace.yaml
- zone_outages_scenarios:
- scenarios/openshift/zone_outage.yaml
- pvc_scenarios:
- scenarios/openshift/pvc_scenario.yaml
- network_chaos:
- network_chaos_scenarios:
- scenarios/openshift/network_chaos.yaml
- service_hijacking:
- service_hijacking_scenarios:
- scenarios/kube/service_hijacking.yaml
- syn_flood:
- syn_flood_scenarios:
- scenarios/kube/syn_flood.yaml
cerberus:
@@ -55,12 +58,27 @@ cerberus:
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_url: '' # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
enable_metrics: False
alert_profile: config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile: config/metrics.yaml
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos
elastic:
enable_elastic: False
collect_metrics: False
collect_alerts: False
verify_certs: False
elastic_url: "" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port: 32766
username: "elastic"
password: "test"
metrics_index: "krkn-metrics"
alerts_index: "krkn-alerts"
telemetry_index: "krkn-telemetry"
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
@@ -94,9 +112,7 @@ telemetry:
- "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+" # 2023-09-15T11:20:36.123425532Z log
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
events_backup: True # enables/disables cluster events collection
elastic:
elastic_url: "" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_index: "" # Elastic search index pattern to post results to

View File

@@ -6,7 +6,7 @@ kraken:
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
chaos_scenarios: # List of policies/chaos scenarios to load
chaos_scenarios: # List of policies/chaos scenarios to load
- plugin_scenarios:
- scenarios/kind/scheduler.yml
- node_scenarios:

View File

@@ -7,7 +7,7 @@ kraken:
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
chaos_scenarios: # List of policies/chaos scenarios to load
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/kube/container_dns.yml
- scenarios/kube/container_dns.yml
- plugin_scenarios:
- scenarios/kube/scheduler.yml

View File

@@ -12,15 +12,14 @@ kraken:
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/prom_kill.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- scenarios/openshift/node_scenarios_example.yml
- plugin_scenarios:
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- cluster_shut_down_scenarios:
- - scenarios/openshift/cluster_shut_down_scenario.yml
- scenarios/openshift/post_action_shut_down.py
- scenarios/openshift/cluster_shut_down_scenario.yml
- service_disruption_scenarios:
- scenarios/openshift/regex_namespace.yaml
- scenarios/openshift/ingress_namespace.yaml

View File

@@ -1,13 +1,14 @@
# oc build
FROM golang:1.22.4 AS oc-build
FROM golang:1.22.5 AS oc-build
RUN apt-get update && apt-get install -y --no-install-recommends libkrb5-dev
WORKDIR /tmp
RUN git clone --branch release-4.18 https://github.com/openshift/oc.git
WORKDIR /tmp/oc
RUN go mod edit -go 1.22.3 &&\
RUN go mod edit -go 1.22.5 &&\
go get github.com/moby/buildkit@v0.12.5 &&\
go get github.com/containerd/containerd@v1.7.11&&\
go get github.com/docker/docker@v25.0.5&&\
go get github.com/docker/docker@v25.0.6&&\
go get github.com/opencontainers/runc@v1.1.14&&\
go mod tidy && go mod vendor
RUN make GO_REQUIRED_MIN_VERSION:= oc

View File

@@ -9,8 +9,9 @@ The following node chaos scenarios are supported:
5. **node_reboot_scenario**: Scenario to reboot the node instance.
6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
7. **stop_start_kubelet_scenario**: Scenario to stop and start the kubelet of the node instance.
8. **node_crash_scenario**: Scenario to crash the node instance.
9. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
8. **restart_kubelet_scenario**: Scenario to restart the kubelet of the node instance.
9. **node_crash_scenario**: Scenario to crash the node instance.
10. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
**NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.

136
docs/scenario_plugin_api.md Normal file
View File

@@ -0,0 +1,136 @@
# Scenario Plugin API:
This API enables seamless integration of Scenario Plugins for Krkn. Plugins are automatically
detected and loaded by the plugin loader, provided they extend the `AbstractPluginScenario`
abstract class, implement the required methods, and adhere to the specified [naming conventions](#naming-conventions).
## Plugin folder:
The plugin loader automatically loads plugins found in the `krkn/scenario_plugins` directory,
relative to the Krkn root folder. Each plugin must reside in its own directory and can consist
of one or more Python files. The entry point for each plugin is a Python class that extends the
[AbstractPluginScenario](../krkn/scenario_plugins/abstract_scenario_plugin.py) abstract class and implements its required methods.
## `AbstractPluginScenario` abstract class:
This [abstract class](../krkn/scenario_plugins/abstract_scenario_plugin.py) defines the contract between the plugin and krkn.
It consists of two methods:
- `run(...)`
- `get_scenario_type()`
Most IDEs can automatically suggest and implement the abstract methods defined in `AbstractPluginScenario`:
![pycharm](scenario_plugin_pycharm.gif)
_(IntelliJ PyCharm)_
### `run(...)`
```python
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
```
This method represents the entry point of the plugin and the first method
that will be executed.
#### Parameters:
- `run_uuid`:
- the uuid of the chaos run generated by krkn for every single run.
- `scenario`:
- the config file of the scenario that is currently executed
- `krkn_config`:
- the full dictionary representation of the `config.yaml`
- `lib_telemetry`
- it is a composite object of all the [krkn-lib](https://krkn-chaos.github.io/krkn-lib-docs/modules.html) objects and methods needed by a krkn plugin to run.
- `scenario_telemetry`
- the `ScenarioTelemetry` object of the scenario that is currently executed
### Return value:
Returns 0 if the scenario succeeds and 1 if it fails.
> [!WARNING]
> All the exception must be handled __inside__ the run method and not propagated.
### `get_scenario_types()`:
```python def get_scenario_types(self) -> list[str]:```
Indicates the scenario types specified in the `config.yaml`. For the plugin to be properly
loaded, recognized and executed, it must be implemented and must return one or more
strings matching `scenario_type` strings set in the config.
> [!WARNING]
> Multiple strings can map to a *single* `ScenarioPlugin` but the same string cannot map
> to different plugins, an exception will be thrown for scenario_type redefinition.
> [!Note]
> The `scenario_type` strings must be unique across all plugins; otherwise, an exception will be thrown.
## Naming conventions:
A key requirement for developing a plugin that will be properly loaded
by the plugin loader is following the established naming conventions.
These conventions are enforced to maintain a uniform and readable codebase,
making it easier to onboard new developers from the community.
### plugin folder:
- the plugin folder must be placed in the `krkn/scenario_plugin` folder starting from the krkn root folder
- the plugin folder __cannot__ contain the words
- `plugin`
- `scenario`
### plugin file name and class name:
- the plugin file containing the main plugin class must be named in _snake case_ and must have the suffix `_scenario_plugin`:
- `example_scenario_plugin.py`
- the main plugin class must named in _capital camel case_ and must have the suffix `ScenarioPlugin` :
- `ExampleScenarioPlugin`
- the file name must match the class name in the respective syntax:
- `example_scenario_plugin.py` -> `ExampleScenarioPlugin`
### scenario type:
- the scenario type __must__ be unique between all the scenarios.
### logging:
If your new scenario does not adhere to the naming conventions, an error log will be generated in the Krkn standard output,
providing details about the issue:
```commandline
2024-10-03 18:06:31,136 [INFO] 📣 `ScenarioPluginFactory`: types from config.yaml mapped to respective classes for execution:
2024-10-03 18:06:31,136 [INFO] ✅ type: application_outages_scenarios ➡️ `ApplicationOutageScenarioPlugin`
2024-10-03 18:06:31,136 [INFO] ✅ types: [hog_scenarios, arcaflow_scenario] ➡️ `ArcaflowScenarioPlugin`
2024-10-03 18:06:31,136 [INFO] ✅ type: container_scenarios ➡️ `ContainerScenarioPlugin`
2024-10-03 18:06:31,136 [INFO] ✅ type: managedcluster_scenarios ➡️ `ManagedClusterScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ types: [pod_disruption_scenarios, pod_network_scenario, vmware_node_scenarios, ibmcloud_node_scenarios] ➡️ `NativeScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: network_chaos_scenarios ➡️ `NetworkChaosScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: node_scenarios ➡️ `NodeActionsScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: pvc_scenarios ➡️ `PvcScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: service_disruption_scenarios ➡️ `ServiceDisruptionScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: service_hijacking_scenarios ➡️ `ServiceHijackingScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: cluster_shut_down_scenarios ➡️ `ShutDownScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: syn_flood_scenarios ➡️ `SynFloodScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: time_scenarios ➡️ `TimeActionsScenarioPlugin`
2024-10-03 18:06:31,137 [INFO] ✅ type: zone_outages_scenarios ➡️ `ZoneOutageScenarioPlugin`
2024-09-18 14:48:41,735 [INFO] Failed to load Scenario Plugins:
2024-09-18 14:48:41,735 [ERROR] ⛔ Class: ExamplePluginScenario Module: krkn.scenario_plugins.example.example_scenario_plugin
2024-09-18 14:48:41,735 [ERROR] ⚠️ scenario plugin class name must start with a capital letter, end with `ScenarioPlugin`, and cannot be just `ScenarioPlugin`.
```
>[!NOTE]
>If you're trying to understand how the scenario types in the config.yaml are mapped to
> their corresponding plugins, this log will guide you!
> Each scenario plugin class mentioned can be found in the `krkn/scenario_plugin` folder
> simply convert the camel case notation and remove the ScenarioPlugin suffix from the class name
> e.g `ShutDownScenarioPlugin` class can be found in the `krkn/scenario_plugin/shut_down` folder.
## ExampleScenarioPlugin
The [ExampleScenarioPlugin](../krkn/tests/test_classes/example_scenario_plugin.py) class included in the tests folder can be used as a scaffolding for new plugins and it is considered
part of the documentation.
For any questions or further guidance, feel free to reach out to us on the
[Kubernetes workspace](https://kubernetes.slack.com/) in the `#krkn` channel.
Were happy to assist. Now, __release the Krkn!__

Binary file not shown.

After

Width:  |  Height:  |  Size: 340 KiB

View File

@@ -1,84 +0,0 @@
import yaml
import logging
import time
import kraken.cerberus.setup as cerberus
from jinja2 import Template
import kraken.invoke.command as runcommand
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
# Reads the scenario config, applies and deletes a network policy to
# block the traffic for the specified duration
def run(scenarios_list, config, wait_duration,kubecli: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
failed_post_scenarios = ""
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
for app_outage_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = app_outage_config
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, app_outage_config)
if len(app_outage_config) > 1:
try:
with open(app_outage_config, "r") as f:
app_outage_config_yaml = yaml.full_load(f)
scenario_config = app_outage_config_yaml["application_outage"]
pod_selector = get_yaml_item_value(
scenario_config, "pod_selector", "{}"
)
traffic_type = get_yaml_item_value(
scenario_config, "block", "[Ingress, Egress]"
)
namespace = get_yaml_item_value(
scenario_config, "namespace", ""
)
duration = get_yaml_item_value(
scenario_config, "duration", 60
)
start_time = int(time.time())
network_policy_template = """---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kraken-deny
spec:
podSelector:
matchLabels: {{ pod_selector }}
policyTypes: {{ traffic_type }}
"""
t = Template(network_policy_template)
rendered_spec = t.render(pod_selector=pod_selector, traffic_type=traffic_type)
yaml_spec = yaml.safe_load(rendered_spec)
# Block the traffic by creating network policy
logging.info("Creating the network policy")
kubecli.create_net_policy(yaml_spec, namespace)
# wait for the specified duration
logging.info("Waiting for the specified duration in the config: %s" % (duration))
time.sleep(duration)
# unblock the traffic by deleting the network policy
logging.info("Deleting the network policy")
kubecli.delete_net_policy("kraken-deny", namespace)
logging.info("End of scenario. Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
except Exception as e :
scenario_telemetry.exit_status = 1
failed_scenarios.append(app_outage_config)
log_exception(app_outage_config)
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries

View File

@@ -1,2 +0,0 @@
from .arcaflow_plugin import *
from .context_auth import ContextAuth

View File

@@ -1,180 +0,0 @@
import time
import arcaflow
import os
import yaml
import logging
from pathlib import Path
from typing import List
from .context_auth import ContextAuth
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
def run(scenarios_list: List[str], kubeconfig_path: str, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
scenario_telemetries: list[ScenarioTelemetry] = []
failed_post_scenarios = []
for scenario in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = scenario
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry,scenario)
engine_args = build_args(scenario)
status_code = run_workflow(engine_args, kubeconfig_path)
scenario_telemetry.end_timestamp = time.time()
scenario_telemetry.exit_status = status_code
scenario_telemetries.append(scenario_telemetry)
if status_code != 0:
failed_post_scenarios.append(scenario)
return failed_post_scenarios, scenario_telemetries
def run_workflow(engine_args: arcaflow.EngineArgs, kubeconfig_path: str) -> int:
set_arca_kubeconfig(engine_args, kubeconfig_path)
exit_status = arcaflow.run(engine_args)
return exit_status
def build_args(input_file: str) -> arcaflow.EngineArgs:
"""sets the kubeconfig parsed by setArcaKubeConfig as an input to the arcaflow workflow"""
current_path = Path().resolve()
context = f"{current_path}/{Path(input_file).parent}"
workflow = f"{context}/workflow.yaml"
config = f"{context}/config.yaml"
if not os.path.exists(context):
raise Exception(
"context folder for arcaflow workflow not found: {}".format(
context)
)
if not os.path.exists(input_file):
raise Exception(
"input file for arcaflow workflow not found: {}".format(input_file))
if not os.path.exists(workflow):
raise Exception(
"workflow file for arcaflow workflow not found: {}".format(
workflow)
)
if not os.path.exists(config):
raise Exception(
"configuration file for arcaflow workflow not found: {}".format(
config)
)
engine_args = arcaflow.EngineArgs()
engine_args.context = context
engine_args.config = config
engine_args.workflow = workflow
engine_args.input = f"{current_path}/{input_file}"
return engine_args
def set_arca_kubeconfig(engine_args: arcaflow.EngineArgs, kubeconfig_path: str):
context_auth = ContextAuth()
if not os.path.exists(kubeconfig_path):
raise Exception("kubeconfig not found in {}".format(kubeconfig_path))
with open(kubeconfig_path, "r") as stream:
try:
kubeconfig = yaml.safe_load(stream)
context_auth.fetch_auth_data(kubeconfig)
except Exception as e:
logging.error("impossible to read kubeconfig file in: {}".format(
kubeconfig_path))
raise e
kubeconfig_str = set_kubeconfig_auth(kubeconfig, context_auth)
with open(engine_args.input, "r") as stream:
input_file = yaml.safe_load(stream)
if "input_list" in input_file and isinstance(input_file["input_list"],list):
for index, _ in enumerate(input_file["input_list"]):
if isinstance(input_file["input_list"][index], dict):
input_file["input_list"][index]["kubeconfig"] = kubeconfig_str
else:
input_file["kubeconfig"] = kubeconfig_str
stream.close()
with open(engine_args.input, "w") as stream:
yaml.safe_dump(input_file, stream)
with open(engine_args.config, "r") as stream:
config_file = yaml.safe_load(stream)
if config_file["deployers"]["image"]["deployer_name"] == "kubernetes":
kube_connection = set_kubernetes_deployer_auth(config_file["deployers"]["image"]["connection"], context_auth)
config_file["deployers"]["image"]["connection"]=kube_connection
with open(engine_args.config, "w") as stream:
yaml.safe_dump(config_file, stream,explicit_start=True, width=4096)
def set_kubernetes_deployer_auth(deployer: any, context_auth: ContextAuth) -> any:
if context_auth.clusterHost is not None :
deployer["host"] = context_auth.clusterHost
if context_auth.clientCertificateData is not None :
deployer["cert"] = context_auth.clientCertificateData
if context_auth.clientKeyData is not None:
deployer["key"] = context_auth.clientKeyData
if context_auth.clusterCertificateData is not None:
deployer["cacert"] = context_auth.clusterCertificateData
if context_auth.username is not None:
deployer["username"] = context_auth.username
if context_auth.password is not None:
deployer["password"] = context_auth.password
if context_auth.bearerToken is not None:
deployer["bearerToken"] = context_auth.bearerToken
return deployer
def set_kubeconfig_auth(kubeconfig: any, context_auth: ContextAuth) -> str:
"""
Builds an arcaflow-compatible kubeconfig representation and returns it as a string.
In order to run arcaflow plugins in kubernetes/openshift the kubeconfig must contain client certificate/key
and server certificate base64 encoded within the kubeconfig file itself in *-data fields. That is not always the
case, infact kubeconfig may contain filesystem paths to those files, this function builds an arcaflow-compatible
kubeconfig file and returns it as a string that can be safely included in input.yaml
"""
if "current-context" not in kubeconfig.keys():
raise Exception(
"invalid kubeconfig file, impossible to determine current-context"
)
user_id = None
cluster_id = None
user_name = None
cluster_name = None
current_context = kubeconfig["current-context"]
for context in kubeconfig["contexts"]:
if context["name"] == current_context:
user_name = context["context"]["user"]
cluster_name = context["context"]["cluster"]
if user_name is None:
raise Exception(
"user not set for context {} in kubeconfig file".format(current_context)
)
if cluster_name is None:
raise Exception(
"cluster not set for context {} in kubeconfig file".format(current_context)
)
for index, user in enumerate(kubeconfig["users"]):
if user["name"] == user_name:
user_id = index
for index, cluster in enumerate(kubeconfig["clusters"]):
if cluster["name"] == cluster_name:
cluster_id = index
if cluster_id is None:
raise Exception(
"no cluster {} found in kubeconfig users".format(cluster_name)
)
if "client-certificate" in kubeconfig["users"][user_id]["user"]:
kubeconfig["users"][user_id]["user"]["client-certificate-data"] = context_auth.clientCertificateDataBase64
del kubeconfig["users"][user_id]["user"]["client-certificate"]
if "client-key" in kubeconfig["users"][user_id]["user"]:
kubeconfig["users"][user_id]["user"]["client-key-data"] = context_auth.clientKeyDataBase64
del kubeconfig["users"][user_id]["user"]["client-key"]
if "certificate-authority" in kubeconfig["clusters"][cluster_id]["cluster"]:
kubeconfig["clusters"][cluster_id]["cluster"]["certificate-authority-data"] = context_auth.clusterCertificateDataBase64
del kubeconfig["clusters"][cluster_id]["cluster"]["certificate-authority"]
kubeconfig_str = yaml.dump(kubeconfig)
return kubeconfig_str

View File

@@ -1,144 +0,0 @@
import logging
from prometheus_api_client import PrometheusConnect
import pandas as pd
import urllib3
saved_metrics_path = "./utilisation.txt"
def convert_data_to_dataframe(data, label):
df = pd.DataFrame()
df['service'] = [item['metric']['pod'] for item in data]
df[label] = [item['value'][1] for item in data]
return df
def convert_data(data, service):
result = {}
for entry in data:
pod_name = entry['metric']['pod']
value = entry['value'][1]
result[pod_name] = value
return result.get(service) # for those pods whose limits are not defined they can take as much resources, there assigning a very high value
def convert_data_limits(data, node_data, service, prometheus):
result = {}
for entry in data:
pod_name = entry['metric']['pod']
value = entry['value'][1]
result[pod_name] = value
return result.get(service, get_node_capacity(node_data, service, prometheus)) # for those pods whose limits are not defined they can take as much resources, there assigning a very high value
def get_node_capacity(node_data, pod_name, prometheus ):
# Get the node name on which the pod is running
query = f'kube_pod_info{{pod="{pod_name}"}}'
result = prometheus.custom_query(query)
if not result:
return None
node_name = result[0]['metric']['node']
for item in node_data:
if item['metric']['node'] == node_name:
return item['value'][1]
return '1000000000'
def save_utilization_to_file(utilization, filename, prometheus):
merged_df = pd.DataFrame(columns=['namespace', 'service', 'CPU', 'CPU_LIMITS', 'MEM', 'MEM_LIMITS', 'NETWORK'])
for namespace in utilization:
# Loading utilization_data[] for namespace
# indexes -- 0 CPU, 1 CPU limits, 2 mem, 3 mem limits, 4 network
utilization_data = utilization[namespace]
df_cpu = convert_data_to_dataframe(utilization_data[0], "CPU")
services = df_cpu.service.unique()
logging.info(f"Services for namespace {namespace}: {services}")
for s in services:
new_row_df = pd.DataFrame({
"namespace": namespace, "service": s,
"CPU": convert_data(utilization_data[0], s),
"CPU_LIMITS": convert_data_limits(utilization_data[1],utilization_data[5], s, prometheus),
"MEM": convert_data(utilization_data[2], s),
"MEM_LIMITS": convert_data_limits(utilization_data[3], utilization_data[6], s, prometheus),
"NETWORK": convert_data(utilization_data[4], s)}, index=[0])
merged_df = pd.concat([merged_df, new_row_df], ignore_index=True)
# Convert columns to string
merged_df['CPU'] = merged_df['CPU'].astype(str)
merged_df['MEM'] = merged_df['MEM'].astype(str)
merged_df['CPU_LIMITS'] = merged_df['CPU_LIMITS'].astype(str)
merged_df['MEM_LIMITS'] = merged_df['MEM_LIMITS'].astype(str)
merged_df['NETWORK'] = merged_df['NETWORK'].astype(str)
# Extract integer part before the decimal point
#merged_df['CPU'] = merged_df['CPU'].str.split('.').str[0]
#merged_df['MEM'] = merged_df['MEM'].str.split('.').str[0]
#merged_df['CPU_LIMITS'] = merged_df['CPU_LIMITS'].str.split('.').str[0]
#merged_df['MEM_LIMITS'] = merged_df['MEM_LIMITS'].str.split('.').str[0]
#merged_df['NETWORK'] = merged_df['NETWORK'].str.split('.').str[0]
merged_df.to_csv(filename, sep='\t', index=False)
def fetch_utilization_from_prometheus(prometheus_endpoint, auth_token,
namespaces, scrape_duration):
urllib3.disable_warnings()
prometheus = PrometheusConnect(url=prometheus_endpoint, headers={
'Authorization':'Bearer {}'.format(auth_token)}, disable_ssl=True)
# Dicts for saving utilisation and queries -- key is namespace
utilization = {}
queries = {}
logging.info("Fetching utilization...")
for namespace in namespaces:
# Fetch CPU utilization
cpu_query = 'sum (rate (container_cpu_usage_seconds_total{image!="", namespace="%s"}[%s])) by (pod) *1000' % (namespace,scrape_duration)
cpu_result = prometheus.custom_query(cpu_query)
cpu_limits_query = '(sum by (pod) (kube_pod_container_resource_limits{resource="cpu", namespace="%s"}))*1000' %(namespace)
cpu_limits_result = prometheus.custom_query(cpu_limits_query)
node_cpu_limits_query = 'kube_node_status_capacity{resource="cpu", unit="core"}*1000'
node_cpu_limits_result = prometheus.custom_query(node_cpu_limits_query)
mem_query = 'sum by (pod) (avg_over_time(container_memory_usage_bytes{image!="", namespace="%s"}[%s]))' % (namespace, scrape_duration)
mem_result = prometheus.custom_query(mem_query)
mem_limits_query = 'sum by (pod) (kube_pod_container_resource_limits{resource="memory", namespace="%s"}) ' %(namespace)
mem_limits_result = prometheus.custom_query(mem_limits_query)
node_mem_limits_query = 'kube_node_status_capacity{resource="memory", unit="byte"}'
node_mem_limits_result = prometheus.custom_query(node_mem_limits_query)
network_query = 'sum by (pod) ((avg_over_time(container_network_transmit_bytes_total{namespace="%s"}[%s])) + \
(avg_over_time(container_network_receive_bytes_total{namespace="%s"}[%s])))' % (namespace, scrape_duration, namespace, scrape_duration)
network_result = prometheus.custom_query(network_query)
utilization[namespace] = [cpu_result, cpu_limits_result, mem_result, mem_limits_result, network_result, node_cpu_limits_result, node_mem_limits_result ]
queries[namespace] = json_queries(cpu_query, cpu_limits_query, mem_query, mem_limits_query, network_query)
save_utilization_to_file(utilization, saved_metrics_path, prometheus)
return saved_metrics_path, queries
def json_queries(cpu_query, cpu_limits_query, mem_query, mem_limits_query, network_query):
queries = {
"cpu_query": cpu_query,
"cpu_limit_query": cpu_limits_query,
"memory_query": mem_query,
"memory_limit_query": mem_limits_query,
"network_query": network_query
}
return queries

View File

@@ -1,68 +0,0 @@
apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
namespace: {{managedcluster_name}}
name: managedcluster-scenarios-template
spec:
workload:
manifests:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: scale-deploy
namespace: open-cluster-management
rules:
- apiGroups: ["apps"]
resources: ["deployments/scale"]
verbs: ["patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get"]
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scale-deploy-to-sa
namespace: open-cluster-management
subjects:
- kind: ServiceAccount
name: internal-kubectl
namespace: open-cluster-management
roleRef:
kind: ClusterRole
name: scale-deploy
apiGroup: rbac.authorization.k8s.io
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scale-deploy-to-sa
namespace: open-cluster-management-agent
subjects:
- kind: ServiceAccount
name: internal-kubectl
namespace: open-cluster-management
roleRef:
kind: ClusterRole
name: scale-deploy
apiGroup: rbac.authorization.k8s.io
- apiVersion: v1
kind: ServiceAccount
metadata:
name: internal-kubectl
namespace: open-cluster-management
- apiVersion: batch/v1
kind: Job
metadata:
name: managedcluster-scenarios-template
namespace: open-cluster-management
spec:
template:
spec:
serviceAccountName: internal-kubectl
containers:
- name: kubectl
image: quay.io/sighup/kubectl-kustomize:1.21.6_3.9.1
command: ["/bin/sh", "-c"]
args:
- {{args}}
restartPolicy: Never
backoffLimit: 0

View File

@@ -1,78 +0,0 @@
import yaml
import logging
import time
from kraken.managedcluster_scenarios.managedcluster_scenarios import managedcluster_scenarios
import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
import kraken.cerberus.setup as cerberus
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.utils.functions import get_yaml_item_value
# Get the managedcluster scenarios object of specfied cloud type
# krkn_lib
def get_managedcluster_scenario_object(managedcluster_scenario, kubecli: KrknKubernetes):
return managedcluster_scenarios(kubecli)
# Run defined scenarios
# krkn_lib
def run(scenarios_list, config, wait_duration, kubecli: KrknKubernetes):
for managedcluster_scenario_config in scenarios_list:
with open(managedcluster_scenario_config, "r") as f:
managedcluster_scenario_config = yaml.full_load(f)
for managedcluster_scenario in managedcluster_scenario_config["managedcluster_scenarios"]:
managedcluster_scenario_object = get_managedcluster_scenario_object(managedcluster_scenario, kubecli)
if managedcluster_scenario["actions"]:
for action in managedcluster_scenario["actions"]:
start_time = int(time.time())
inject_managedcluster_scenario(action, managedcluster_scenario, managedcluster_scenario_object, kubecli)
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.get_status(config, start_time, end_time)
logging.info("")
# Inject the specified managedcluster scenario
# krkn_lib
def inject_managedcluster_scenario(action, managedcluster_scenario, managedcluster_scenario_object, kubecli: KrknKubernetes):
# Get the managedcluster scenario configurations
run_kill_count = get_yaml_item_value(
managedcluster_scenario, "runs", 1
)
instance_kill_count = get_yaml_item_value(
managedcluster_scenario, "instance_count", 1
)
managedcluster_name = get_yaml_item_value(
managedcluster_scenario, "managedcluster_name", ""
)
label_selector = get_yaml_item_value(
managedcluster_scenario, "label_selector", ""
)
timeout = get_yaml_item_value(managedcluster_scenario, "timeout", 120)
# Get the managedcluster to apply the scenario
if managedcluster_name:
managedcluster_name_list = managedcluster_name.split(",")
else:
managedcluster_name_list = [managedcluster_name]
for single_managedcluster_name in managedcluster_name_list:
managedclusters = common_managedcluster_functions.get_managedcluster(single_managedcluster_name, label_selector, instance_kill_count, kubecli)
for single_managedcluster in managedclusters:
if action == "managedcluster_start_scenario":
managedcluster_scenario_object.managedcluster_start_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_stop_scenario":
managedcluster_scenario_object.managedcluster_stop_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_stop_start_scenario":
managedcluster_scenario_object.managedcluster_stop_start_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_termination_scenario":
managedcluster_scenario_object.managedcluster_termination_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_reboot_scenario":
managedcluster_scenario_object.managedcluster_reboot_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "stop_start_klusterlet_scenario":
managedcluster_scenario_object.stop_start_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "start_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "stop_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_crash_scenario":
managedcluster_scenario_object.managedcluster_crash_scenario(run_kill_count, single_managedcluster, timeout)
else:
logging.info("There is no managedcluster action that matches %s, skipping scenario" % action)

View File

@@ -1,210 +0,0 @@
import yaml
import logging
import time
import os
import random
import kraken.cerberus.setup as cerberus
import kraken.node_actions.common_node_functions as common_node_functions
from jinja2 import Environment, FileSystemLoader
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
# krkn_lib
# Reads the scenario config and introduces traffic variations in Node's host network interface.
def run(scenarios_list, config, wait_duration, kubecli: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
failed_post_scenarios = ""
logging.info("Runing the Network Chaos tests")
failed_post_scenarios = ""
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
for net_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = net_config
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, net_config)
try:
with open(net_config, "r") as file:
param_lst = ["latency", "loss", "bandwidth"]
test_config = yaml.safe_load(file)
test_dict = test_config["network_chaos"]
test_duration = int(
get_yaml_item_value(test_dict, "duration", 300)
)
test_interface = get_yaml_item_value(
test_dict, "interfaces", []
)
test_node = get_yaml_item_value(test_dict, "node_name", "")
test_node_label = get_yaml_item_value(
test_dict, "label_selector",
"node-role.kubernetes.io/master"
)
test_execution = get_yaml_item_value(
test_dict, "execution", "serial"
)
test_instance_count = get_yaml_item_value(
test_dict, "instance_count", 1
)
test_egress = get_yaml_item_value(
test_dict, "egress", {"bandwidth": "100mbit"}
)
if test_node:
node_name_list = test_node.split(",")
else:
node_name_list = [test_node]
nodelst = []
for single_node_name in node_name_list:
nodelst.extend(common_node_functions.get_node(single_node_name, test_node_label, test_instance_count, kubecli))
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=True)
pod_template = env.get_template("pod.j2")
test_interface = verify_interface(test_interface, nodelst, pod_template, kubecli)
joblst = []
egress_lst = [i for i in param_lst if i in test_egress]
chaos_config = {
"network_chaos": {
"duration": test_duration,
"interfaces": test_interface,
"node_name": ",".join(nodelst),
"execution": test_execution,
"instance_count": test_instance_count,
"egress": test_egress,
}
}
logging.info("Executing network chaos with config \n %s" % yaml.dump(chaos_config))
job_template = env.get_template("job.j2")
try:
for i in egress_lst:
for node in nodelst:
exec_cmd = get_egress_cmd(
test_execution, test_interface, i, test_dict["egress"], duration=test_duration
)
logging.info("Executing %s on node %s" % (exec_cmd, node))
job_body = yaml.safe_load(
job_template.render(jobname=i + str(hash(node))[:5], nodename=node, cmd=exec_cmd)
)
joblst.append(job_body["metadata"]["name"])
api_response = kubecli.create_job(job_body)
if api_response is None:
raise Exception("Error creating job")
if test_execution == "serial":
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
wait_for_job(joblst[:], kubecli, test_duration + 300)
logging.info("Waiting for wait_duration %s" % wait_duration)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
if test_execution == "parallel":
break
if test_execution == "parallel":
logging.info("Waiting for parallel job to finish")
start_time = int(time.time())
wait_for_job(joblst[:], kubecli, test_duration + 300)
logging.info("Waiting for wait_duration %s" % wait_duration)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
except Exception as e:
logging.error("Network Chaos exiting due to Exception %s" % e)
raise RuntimeError()
finally:
logging.info("Deleting jobs")
delete_job(joblst[:], kubecli)
except (RuntimeError, Exception):
scenario_telemetry.exit_status = 1
failed_scenarios.append(net_config)
log_exception(net_config)
else:
scenario_telemetry.exit_status = 0
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries
# krkn_lib
def verify_interface(test_interface, nodelst, template, kubecli: KrknKubernetes):
pod_index = random.randint(0, len(nodelst) - 1)
pod_body = yaml.safe_load(template.render(nodename=nodelst[pod_index]))
logging.info("Creating pod to query interface on node %s" % nodelst[pod_index])
kubecli.create_pod(pod_body, "default", 300)
try:
if test_interface == []:
cmd = "ip r | grep default | awk '/default/ {print $5}'"
output = kubecli.exec_cmd_in_pod(cmd, "fedtools", "default")
test_interface = [output.replace("\n", "")]
else:
cmd = "ip -br addr show|awk -v ORS=',' '{print $1}'"
output = kubecli.exec_cmd_in_pod(cmd, "fedtools", "default")
interface_lst = output[:-1].split(",")
for interface in test_interface:
if interface not in interface_lst:
logging.error("Interface %s not found in node %s interface list %s" % (interface, nodelst[pod_index], interface_lst))
#sys.exit(1)
raise RuntimeError()
return test_interface
finally:
logging.info("Deleteing pod to query interface on node")
kubecli.delete_pod("fedtools", "default")
# krkn_lib
def get_job_pods(api_response, kubecli: KrknKubernetes):
controllerUid = api_response.metadata.labels["controller-uid"]
pod_label_selector = "controller-uid=" + controllerUid
pods_list = kubecli.list_pods(label_selector=pod_label_selector, namespace="default")
return pods_list[0]
# krkn_lib
def wait_for_job(joblst, kubecli: KrknKubernetes, timeout=300):
waittime = time.time() + timeout
count = 0
joblen = len(joblst)
while count != joblen:
for jobname in joblst:
try:
api_response = kubecli.get_job_status(jobname, namespace="default")
if api_response.status.succeeded is not None or api_response.status.failed is not None:
count += 1
joblst.remove(jobname)
except Exception:
logging.warning("Exception in getting job status")
if time.time() > waittime:
raise Exception("Starting pod failed")
time.sleep(5)
# krkn_lib
def delete_job(joblst, kubecli: KrknKubernetes):
for jobname in joblst:
try:
api_response = kubecli.get_job_status(jobname, namespace="default")
if api_response.status.failed is not None:
pod_name = get_job_pods(api_response, kubecli)
pod_stat = kubecli.read_pod(name=pod_name, namespace="default")
logging.error(pod_stat.status.container_statuses)
pod_log_response = kubecli.get_pod_log(name=pod_name, namespace="default")
pod_log = pod_log_response.data.decode("utf-8")
logging.error(pod_log)
except Exception:
logging.warning("Exception in getting job status")
kubecli.delete_job(name=jobname, namespace="default")
def get_egress_cmd(execution, test_interface, mod, vallst, duration=30):
tc_set = tc_unset = tc_ls = ""
param_map = {"latency": "delay", "loss": "loss", "bandwidth": "rate"}
for i in test_interface:
tc_set = "{0} tc qdisc add dev {1} root netem".format(tc_set, i)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, i)
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, i)
if execution == "parallel":
for val in vallst.keys():
tc_set += " {0} {1} ".format(param_map[val], vallst[val])
tc_set += ";"
else:
tc_set += " {0} {1} ;".format(param_map[mod], vallst[mod])
exec_cmd = "{0} {1} sleep {2};{3} sleep 20;{4}".format(tc_set, tc_ls, duration, tc_unset, tc_ls)
return exec_cmd

View File

@@ -1,154 +0,0 @@
import yaml
import logging
import sys
import time
from kraken.node_actions.aws_node_scenarios import aws_node_scenarios
from kraken.node_actions.general_cloud_node_scenarios import general_node_scenarios
from kraken.node_actions.az_node_scenarios import azure_node_scenarios
from kraken.node_actions.gcp_node_scenarios import gcp_node_scenarios
from kraken.node_actions.openstack_node_scenarios import openstack_node_scenarios
from kraken.node_actions.alibaba_node_scenarios import alibaba_node_scenarios
from kraken.node_actions.bm_node_scenarios import bm_node_scenarios
from kraken.node_actions.docker_node_scenarios import docker_node_scenarios
import kraken.node_actions.common_node_functions as common_node_functions
import kraken.cerberus.setup as cerberus
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
node_general = False
# Get the node scenarios object of specfied cloud type
# krkn_lib
def get_node_scenario_object(node_scenario, kubecli: KrknKubernetes):
if "cloud_type" not in node_scenario.keys() or node_scenario["cloud_type"] == "generic":
global node_general
node_general = True
return general_node_scenarios(kubecli)
if node_scenario["cloud_type"] == "aws":
return aws_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "gcp":
return gcp_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "openstack":
return openstack_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "azure" or node_scenario["cloud_type"] == "az":
return azure_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "alibaba" or node_scenario["cloud_type"] == "alicloud":
return alibaba_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "bm":
return bm_node_scenarios(
node_scenario.get("bmc_info"), node_scenario.get("bmc_user", None), node_scenario.get("bmc_password", None),
kubecli
)
elif node_scenario["cloud_type"] == "docker":
return docker_node_scenarios(kubecli)
else:
logging.error(
"Cloud type " + node_scenario["cloud_type"] + " is not currently supported; "
"try using 'generic' if wanting to stop/start kubelet or fork bomb on any "
"cluster"
)
sys.exit(1)
# Run defined scenarios
# krkn_lib
def run(scenarios_list, config, wait_duration, kubecli: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
for node_scenario_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = node_scenario_config
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, node_scenario_config)
with open(node_scenario_config, "r") as f:
node_scenario_config = yaml.full_load(f)
for node_scenario in node_scenario_config["node_scenarios"]:
node_scenario_object = get_node_scenario_object(node_scenario, kubecli)
if node_scenario["actions"]:
for action in node_scenario["actions"]:
start_time = int(time.time())
try:
inject_node_scenario(action, node_scenario, node_scenario_object, kubecli)
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.get_status(config, start_time, end_time)
logging.info("")
except (RuntimeError, Exception) as e:
scenario_telemetry.exit_status = 1
failed_scenarios.append(node_scenario_config)
log_exception(node_scenario_config)
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries
# Inject the specified node scenario
def inject_node_scenario(action, node_scenario, node_scenario_object, kubecli: KrknKubernetes):
generic_cloud_scenarios = ("stop_kubelet_scenario", "node_crash_scenario")
# Get the node scenario configurations
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
instance_kill_count = get_yaml_item_value(
node_scenario, "instance_count", 1
)
node_name = get_yaml_item_value(node_scenario, "node_name", "")
label_selector = get_yaml_item_value(node_scenario, "label_selector", "")
if action == "node_stop_start_scenario":
duration = get_yaml_item_value(node_scenario, "duration", 120)
timeout = get_yaml_item_value(node_scenario, "timeout", 120)
service = get_yaml_item_value(node_scenario, "service", "")
ssh_private_key = get_yaml_item_value(
node_scenario, "ssh_private_key", "~/.ssh/id_rsa"
)
# Get the node to apply the scenario
if node_name:
node_name_list = node_name.split(",")
else:
node_name_list = [node_name]
for single_node_name in node_name_list:
nodes = common_node_functions.get_node(single_node_name, label_selector, instance_kill_count, kubecli)
for single_node in nodes:
if node_general and action not in generic_cloud_scenarios:
logging.info("Scenario: " + action + " is not set up for generic cloud type, skipping action")
else:
if action == "node_start_scenario":
node_scenario_object.node_start_scenario(run_kill_count, single_node, timeout)
elif action == "node_stop_scenario":
node_scenario_object.node_stop_scenario(run_kill_count, single_node, timeout)
elif action == "node_stop_start_scenario":
node_scenario_object.node_stop_start_scenario(run_kill_count, single_node, timeout, duration)
elif action == "node_termination_scenario":
node_scenario_object.node_termination_scenario(run_kill_count, single_node, timeout)
elif action == "node_reboot_scenario":
node_scenario_object.node_reboot_scenario(run_kill_count, single_node, timeout)
elif action == "stop_start_kubelet_scenario":
node_scenario_object.stop_start_kubelet_scenario(run_kill_count, single_node, timeout)
elif action == "stop_kubelet_scenario":
node_scenario_object.stop_kubelet_scenario(run_kill_count, single_node, timeout)
elif action == "node_crash_scenario":
node_scenario_object.node_crash_scenario(run_kill_count, single_node, timeout)
elif action == "stop_start_helper_node_scenario":
if node_scenario["cloud_type"] != "openstack":
logging.error(
"Scenario: " + action + " is not supported for "
"cloud type " + node_scenario["cloud_type"] + ", skipping action"
)
else:
if not node_scenario["helper_node_ip"]:
logging.error("Helper node IP address is not provided")
sys.exit(1)
node_scenario_object.helper_node_stop_start_scenario(
run_kill_count, node_scenario["helper_node_ip"], timeout
)
node_scenario_object.helper_node_service_status(
node_scenario["helper_node_ip"], service, ssh_private_key, timeout
)
else:
logging.info("There is no node action that matches %s, skipping scenario" % action)

View File

@@ -1,320 +0,0 @@
import dataclasses
import json
import logging
from os.path import abspath
from typing import List, Dict, Any
import time
from arcaflow_plugin_sdk import schema, serialization, jsonschema
from arcaflow_plugin_kill_pod import kill_pods, wait_for_pods
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
import kraken.plugins.node_scenarios.vmware_plugin as vmware_plugin
import kraken.plugins.node_scenarios.ibmcloud_plugin as ibmcloud_plugin
from kraken.plugins.run_python_plugin import run_python_file
from kraken.plugins.network.ingress_shaping import network_chaos
from kraken.plugins.pod_network_outage.pod_network_outage_plugin import pod_outage
from kraken.plugins.pod_network_outage.pod_network_outage_plugin import pod_egress_shaping
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from kraken.plugins.pod_network_outage.pod_network_outage_plugin import pod_ingress_shaping
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import log_exception
@dataclasses.dataclass
class PluginStep:
schema: schema.StepSchema
error_output_ids: List[str]
def render_output(self, output_id: str, output_data) -> str:
return json.dumps({
"output_id": output_id,
"output_data": self.schema.outputs[output_id].serialize(output_data),
}, indent='\t')
class Plugins:
"""
Plugins is a class that can run plugins sequentially. The output is rendered to the standard output and the process
is aborted if a step fails.
"""
steps_by_id: Dict[str, PluginStep]
def __init__(self, steps: List[PluginStep]):
self.steps_by_id = dict()
for step in steps:
if step.schema.id in self.steps_by_id:
raise Exception(
"Duplicate step ID: {}".format(step.schema.id)
)
self.steps_by_id[step.schema.id] = step
def unserialize_scenario(self, file: str) -> Any:
return serialization.load_from_file(abspath(file))
def run(self, file: str, kubeconfig_path: str, kraken_config: str, run_uuid:str):
"""
Run executes a series of steps
"""
data = self.unserialize_scenario(abspath(file))
if not isinstance(data, list):
raise Exception(
"Invalid scenario configuration file: {} expected list, found {}".format(file, type(data).__name__)
)
i = 0
for entry in data:
if not isinstance(entry, dict):
raise Exception(
"Invalid scenario configuration file: {} expected a list of dict's, found {} on step {}".format(
file,
type(entry).__name__,
i
)
)
if "id" not in entry:
raise Exception(
"Invalid scenario configuration file: {} missing 'id' field on step {}".format(
file,
i,
)
)
if "config" not in entry:
raise Exception(
"Invalid scenario configuration file: {} missing 'config' field on step {}".format(
file,
i,
)
)
if entry["id"] not in self.steps_by_id:
raise Exception(
"Invalid step {} in {} ID: {} expected one of: {}".format(
i,
file,
entry["id"],
', '.join(self.steps_by_id.keys())
)
)
step = self.steps_by_id[entry["id"]]
unserialized_input = step.schema.input.unserialize(entry["config"])
if "kubeconfig_path" in step.schema.input.properties:
unserialized_input.kubeconfig_path = kubeconfig_path
if "kraken_config" in step.schema.input.properties:
unserialized_input.kraken_config = kraken_config
output_id, output_data = step.schema(params=unserialized_input, run_id=run_uuid)
logging.info(step.render_output(output_id, output_data) + "\n")
if output_id in step.error_output_ids:
raise Exception(
"Step {} in {} ({}) failed".format(i, file, step.schema.id)
)
i = i + 1
def json_schema(self):
"""
This function generates a JSON schema document and renders it from the steps passed.
"""
result = {
"$id": "https://github.com/redhat-chaos/krkn/",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Kraken Arcaflow scenarios",
"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
"type": "array",
"minContains": 1,
"items": {
"oneOf": [
]
}
}
for step_id in self.steps_by_id.keys():
step = self.steps_by_id[step_id]
step_input = jsonschema.step_input(step.schema)
del step_input["$id"]
del step_input["$schema"]
del step_input["title"]
del step_input["description"]
result["items"]["oneOf"].append({
"type": "object",
"properties": {
"id": {
"type": "string",
"const": step_id,
},
"config": step_input,
},
"required": [
"id",
"config",
]
})
return json.dumps(result, indent="\t")
PLUGINS = Plugins(
[
PluginStep(
kill_pods,
[
"error",
]
),
PluginStep(
wait_for_pods,
[
"error"
]
),
PluginStep(
run_python_file,
[
"error"
]
),
PluginStep(
vmware_plugin.node_start,
[
"error"
]
),
PluginStep(
vmware_plugin.node_stop,
[
"error"
]
),
PluginStep(
vmware_plugin.node_reboot,
[
"error"
]
),
PluginStep(
vmware_plugin.node_terminate,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_start,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_stop,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_reboot,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_terminate,
[
"error"
]
),
PluginStep(
network_chaos,
[
"error"
]
),
PluginStep(
pod_outage,
[
"error"
]
),
PluginStep(
pod_egress_shaping,
[
"error"
]
),
PluginStep(
pod_ingress_shaping,
[
"error"
]
)
]
)
def run(scenarios: List[str],
kubeconfig_path: str,
kraken_config: str,
failed_post_scenarios: List[str],
wait_duration: int,
telemetry: KrknTelemetryKubernetes,
kubecli: KrknKubernetes,
run_uuid: str
) -> (List[str], list[ScenarioTelemetry]):
scenario_telemetries: list[ScenarioTelemetry] = []
for scenario in scenarios:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = scenario
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, scenario)
logging.info('scenario ' + str(scenario))
pool = PodsMonitorPool(kubecli)
kill_scenarios = [kill_scenario for kill_scenario in PLUGINS.unserialize_scenario(scenario) if kill_scenario["id"] == "kill-pods"]
try:
start_monitoring(pool, kill_scenarios)
PLUGINS.run(scenario, kubeconfig_path, kraken_config, run_uuid)
result = pool.join()
scenario_telemetry.affected_pods = result
if result.error:
raise Exception(f"unrecovered pods: {result.error}")
except Exception as e:
logging.error(f"scenario exception: {str(e)}")
scenario_telemetry.exit_status = 1
pool.cancel()
failed_post_scenarios.append(scenario)
log_exception(scenario)
else:
scenario_telemetry.exit_status = 0
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
scenario_telemetries.append(scenario_telemetry)
scenario_telemetry.end_timestamp = time.time()
return failed_post_scenarios, scenario_telemetries
def start_monitoring(pool: PodsMonitorPool, scenarios: list[Any]):
for kill_scenario in scenarios:
recovery_time = kill_scenario["config"]["krkn_pod_recovery_time"]
if ("namespace_pattern" in kill_scenario["config"] and
"label_selector" in kill_scenario["config"]):
namespace_pattern = kill_scenario["config"]["namespace_pattern"]
label_selector = kill_scenario["config"]["label_selector"]
pool.select_and_monitor_by_namespace_pattern_and_label(
namespace_pattern=namespace_pattern,
label_selector=label_selector,
max_timeout=recovery_time)
logging.info(
f"waiting {recovery_time} seconds for pod recovery, "
f"pod label selector: {label_selector} namespace pattern: {namespace_pattern}")
elif ("namespace_pattern" in kill_scenario["config"] and
"name_pattern" in kill_scenario["config"]):
namespace_pattern = kill_scenario["config"]["namespace_pattern"]
name_pattern = kill_scenario["config"]["name_pattern"]
pool.select_and_monitor_by_name_pattern_and_namespace_pattern(pod_name_pattern=name_pattern,
namespace_pattern=namespace_pattern,
max_timeout=recovery_time)
logging.info(f"waiting {recovery_time} seconds for pod recovery, "
f"pod name pattern: {name_pattern} namespace pattern: {namespace_pattern}")
else:
raise Exception(f"impossible to determine monitor parameters, check {kill_scenario} configuration")

View File

@@ -1,4 +0,0 @@
from kraken.plugins import PLUGINS
if __name__ == "__main__":
print(PLUGINS.json_schema())

View File

@@ -1,256 +0,0 @@
import logging
import time
from typing import Any
import yaml
import sys
import random
import arcaflow_plugin_kill_pod
from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
import kraken.cerberus.setup as cerberus
import kraken.post_actions.actions as post_actions
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from arcaflow_plugin_sdk import serialization
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
# Run pod based scenarios
def run(kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_duration):
# Loop to run the scenarios starts here
for pod_scenario in scenarios_list:
if len(pod_scenario) > 1:
pre_action_output = post_actions.run(kubeconfig_path, pod_scenario[1])
else:
pre_action_output = ""
try:
# capture start time
start_time = int(time.time())
input = serialization.load_from_file(pod_scenario)
s = arcaflow_plugin_kill_pod.get_schema()
input_data: arcaflow_plugin_kill_pod.KillPodConfig = s.unserialize_input("pod", input)
if kubeconfig_path is not None:
input_data.kubeconfig_path = kubeconfig_path
output_id, output_data = s.call_step("pod", input_data)
if output_id == "error":
data: arcaflow_plugin_kill_pod.PodErrorOutput = output_data
logging.error("Failed to run pod scenario: {}".format(data.error))
else:
data: arcaflow_plugin_kill_pod.PodSuccessOutput = output_data
for pod in data.pods:
print("Deleted pod {} in namespace {}\n".format(pod.pod_name, pod.pod_namespace))
except Exception as e:
logging.error(
"Failed to run scenario: %s. Encountered the following " "exception: %s" % (pod_scenario[0], e)
)
sys.exit(1)
logging.info("Scenario: %s has been successfully injected!" % (pod_scenario[0]))
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
try:
failed_post_scenarios = post_actions.check_recovery(
kubeconfig_path, pod_scenario, failed_post_scenarios, pre_action_output
)
except Exception as e:
logging.error("Failed to run post action checks: %s" % e)
sys.exit(1)
# capture end time
end_time = int(time.time())
# publish cerberus status
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
return failed_post_scenarios
# krkn_lib
def container_run(kubeconfig_path,
scenarios_list,
config,
failed_post_scenarios,
wait_duration,
kubecli: KrknKubernetes,
telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
failed_scenarios = []
scenario_telemetries: list[ScenarioTelemetry] = []
pool = PodsMonitorPool(kubecli)
for container_scenario_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = container_scenario_config[0]
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, container_scenario_config[0])
if len(container_scenario_config) > 1:
pre_action_output = post_actions.run(kubeconfig_path, container_scenario_config[1])
else:
pre_action_output = ""
with open(container_scenario_config[0], "r") as f:
cont_scenario_config = yaml.full_load(f)
start_monitoring(kill_scenarios=cont_scenario_config["scenarios"], pool=pool)
for cont_scenario in cont_scenario_config["scenarios"]:
# capture start time
start_time = int(time.time())
try:
killed_containers = container_killing_in_pod(cont_scenario, kubecli)
logging.info(f"killed containers: {str(killed_containers)}")
result = pool.join()
if result.error:
raise Exception(f"pods failed to recovery: {result.error}")
scenario_telemetry.affected_pods = result
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
# capture end time
end_time = int(time.time())
# publish cerberus status
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
except (RuntimeError, Exception):
pool.cancel()
failed_scenarios.append(container_scenario_config[0])
log_exception(container_scenario_config[0])
scenario_telemetry.exit_status = 1
# removed_exit
# sys.exit(1)
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries
def start_monitoring(kill_scenarios: list[Any], pool: PodsMonitorPool):
for kill_scenario in kill_scenarios:
namespace_pattern = f"^{kill_scenario['namespace']}$"
label_selector = kill_scenario["label_selector"]
recovery_time = kill_scenario["expected_recovery_time"]
pool.select_and_monitor_by_namespace_pattern_and_label(
namespace_pattern=namespace_pattern,
label_selector=label_selector,
max_timeout=recovery_time)
def container_killing_in_pod(cont_scenario, kubecli: KrknKubernetes):
scenario_name = get_yaml_item_value(cont_scenario, "name", "")
namespace = get_yaml_item_value(cont_scenario, "namespace", "*")
label_selector = get_yaml_item_value(cont_scenario, "label_selector", None)
pod_names = get_yaml_item_value(cont_scenario, "pod_names", [])
container_name = get_yaml_item_value(cont_scenario, "container_name", "")
kill_action = get_yaml_item_value(cont_scenario, "action", 1)
kill_count = get_yaml_item_value(cont_scenario, "count", 1)
if not isinstance(kill_action, int):
logging.error("Please make sure the action parameter defined in the "
"config is an integer")
raise RuntimeError()
if (kill_action < 1) or (kill_action > 15):
logging.error("Only 1-15 kill signals are supported.")
raise RuntimeError()
kill_action = "kill " + str(kill_action)
if type(pod_names) != list:
logging.error("Please make sure your pod_names are in a list format")
# removed_exit
# sys.exit(1)
raise RuntimeError()
if len(pod_names) == 0:
if namespace == "*":
# returns double array of pod name and namespace
pods = kubecli.get_all_pods(label_selector)
else:
# Only returns pod names
pods = kubecli.list_pods(namespace, label_selector)
else:
if namespace == "*":
logging.error("You must specify the namespace to kill a container in a specific pod")
logging.error("Scenario " + scenario_name + " failed")
# removed_exit
# sys.exit(1)
raise RuntimeError()
pods = pod_names
# get container and pod name
container_pod_list = []
for pod in pods:
if type(pod) == list:
pod_output = kubecli.get_pod_info(pod[0], pod[1])
container_names = [container.name for container in pod_output.containers]
container_pod_list.append([pod[0], pod[1], container_names])
else:
pod_output = kubecli.get_pod_info(pod, namespace)
container_names = [container.name for container in pod_output.containers]
container_pod_list.append([pod, namespace, container_names])
killed_count = 0
killed_container_list = []
while killed_count < kill_count:
if len(container_pod_list) == 0:
logging.error("Trying to kill more containers than were found, try lowering kill count")
logging.error("Scenario " + scenario_name + " failed")
# removed_exit
# sys.exit(1)
raise RuntimeError()
selected_container_pod = container_pod_list[random.randint(0, len(container_pod_list) - 1)]
for c_name in selected_container_pod[2]:
if container_name != "":
if c_name == container_name:
killed_container_list.append([selected_container_pod[0], selected_container_pod[1], c_name])
retry_container_killing(kill_action, selected_container_pod[0], selected_container_pod[1], c_name, kubecli)
break
else:
killed_container_list.append([selected_container_pod[0], selected_container_pod[1], c_name])
retry_container_killing(kill_action, selected_container_pod[0], selected_container_pod[1], c_name, kubecli)
break
container_pod_list.remove(selected_container_pod)
killed_count += 1
logging.info("Scenario " + scenario_name + " successfully injected")
return killed_container_list
def retry_container_killing(kill_action, podname, namespace, container_name, kubecli: KrknKubernetes):
i = 0
while i < 5:
logging.info("Killing container %s in pod %s (ns %s)" % (str(container_name), str(podname), str(namespace)))
response = kubecli.exec_cmd_in_pod(kill_action, podname, namespace, container_name)
i += 1
# Blank response means it is done
if not response:
break
elif "unauthorized" in response.lower() or "authorization" in response.lower():
time.sleep(2)
continue
else:
logging.warning(response)
continue
def check_failed_containers(killed_container_list, wait_time, kubecli: KrknKubernetes):
container_ready = []
timer = 0
while timer <= wait_time:
for killed_container in killed_container_list:
# pod namespace contain name
pod_output = kubecli.get_pod_info(killed_container[0], killed_container[1])
for container in pod_output.containers:
if container.name == killed_container[2]:
if container.ready:
container_ready.append(killed_container)
if len(container_ready) != 0:
for item in container_ready:
killed_container_list = killed_container_list.remove(item)
if killed_container_list is None or len(killed_container_list) == 0:
return []
timer += 5
logging.info("Waiting 5 seconds for containers to become ready")
time.sleep(5)
return killed_container_list

View File

@@ -1,48 +0,0 @@
import logging
import kraken.invoke.command as runcommand
def run(kubeconfig_path, scenario, pre_action_output=""):
if scenario.endswith(".yaml") or scenario.endswith(".yml"):
logging.error("Powerfulseal support has recently been removed. Please switch to using plugins instead.")
elif scenario.endswith(".py"):
action_output = runcommand.invoke("python3 " + scenario).strip()
if pre_action_output:
if pre_action_output == action_output:
logging.info(scenario + " post action checks passed")
else:
logging.info(scenario + " post action response did not match pre check output")
logging.info("Pre action output: " + str(pre_action_output) + "\n")
logging.info("Post action output: " + str(action_output))
return False
elif scenario != "":
# invoke custom bash script
action_output = runcommand.invoke(scenario).strip()
if pre_action_output:
if pre_action_output == action_output:
logging.info(scenario + " post action checks passed")
else:
logging.info(scenario + " post action response did not match pre check output")
return False
return action_output
# Perform the post scenario actions to see if components recovered
def check_recovery(kubeconfig_path, scenario, failed_post_scenarios, pre_action_output):
if failed_post_scenarios:
for failed_scenario in failed_post_scenarios:
post_action_output = run(kubeconfig_path, failed_scenario[0], failed_scenario[1])
if post_action_output is not False:
failed_post_scenarios.remove(failed_scenario)
else:
logging.info("Post action scenario " + str(failed_scenario) + "is still failing")
# check post actions
if len(scenario) > 1:
post_action_output = run(kubeconfig_path, scenario[1], pre_action_output)
if post_action_output is False:
failed_post_scenarios.append([scenario[1], pre_action_output])
return failed_post_scenarios

View File

@@ -1,88 +0,0 @@
import datetime
import os.path
from typing import Optional
import urllib3
import logging
import sys
import yaml
from krkn_lib.models.krkn import ChaosRunAlertSummary, ChaosRunAlert
from krkn_lib.prometheus.krkn_prometheus import KrknPrometheus
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def alerts(prom_cli: KrknPrometheus, start_time, end_time, alert_profile):
if alert_profile is None or os.path.exists(alert_profile) is False:
logging.error(f"{alert_profile} alert profile does not exist")
sys.exit(1)
with open(alert_profile) as profile:
profile_yaml = yaml.safe_load(profile)
if not isinstance(profile_yaml, list):
logging.error(f"{alert_profile} wrong file format, alert profile must be "
f"a valid yaml file containing a list of items with 3 properties: "
f"expr, description, severity" )
sys.exit(1)
for alert in profile_yaml:
if list(alert.keys()).sort() != ["expr", "description", "severity"].sort():
logging.error(f"wrong alert {alert}, skipping")
prom_cli.process_alert(alert,
datetime.datetime.fromtimestamp(start_time),
datetime.datetime.fromtimestamp(end_time))
def critical_alerts(prom_cli: KrknPrometheus,
summary: ChaosRunAlertSummary,
run_id,
scenario,
start_time,
end_time):
summary.scenario = scenario
summary.run_id = run_id
query = r"""ALERTS{severity="critical"}"""
logging.info("Checking for critical alerts firing post chaos")
during_critical_alerts = prom_cli.process_prom_query_in_range(
query,
start_time=datetime.datetime.fromtimestamp(start_time),
end_time=end_time
)
for alert in during_critical_alerts:
if "metric" in alert:
alertname = alert["metric"]["alertname"] if "alertname" in alert["metric"] else "none"
alertstate = alert["metric"]["alertstate"] if "alertstate" in alert["metric"] else "none"
namespace = alert["metric"]["namespace"] if "namespace" in alert["metric"] else "none"
severity = alert["metric"]["severity"] if "severity" in alert["metric"] else "none"
alert = ChaosRunAlert(alertname, alertstate, namespace, severity)
summary.chaos_alerts.append(alert)
post_critical_alerts = prom_cli.process_query(
query
)
for alert in post_critical_alerts:
if "metric" in alert:
alertname = alert["metric"]["alertname"] if "alertname" in alert["metric"] else "none"
alertstate = alert["metric"]["alertstate"] if "alertstate" in alert["metric"] else "none"
namespace = alert["metric"]["namespace"] if "namespace" in alert["metric"] else "none"
severity = alert["metric"]["severity"] if "severity" in alert["metric"] else "none"
alert = ChaosRunAlert(alertname, alertstate, namespace, severity)
summary.post_chaos_alerts.append(alert)
during_critical_alerts_count = len(during_critical_alerts)
post_critical_alerts_count = len(post_critical_alerts)
firing_alerts = False
if during_critical_alerts_count > 0:
firing_alerts = True
if post_critical_alerts_count > 0:
firing_alerts = True
if not firing_alerts:
logging.info("No critical alerts are firing!!")

View File

@@ -1,374 +0,0 @@
import logging
import random
import re
import time
import yaml
from ..cerberus import setup as cerberus
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
# krkn_lib
def run(scenarios_list, config, wait_duration, kubecli: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
"""
Reads the scenario config and creates a temp file to fill up the PVC
"""
failed_post_scenarios = ""
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
for app_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = app_config
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, app_config)
try:
if len(app_config) > 1:
with open(app_config, "r") as f:
config_yaml = yaml.full_load(f)
scenario_config = config_yaml["pvc_scenario"]
pvc_name = get_yaml_item_value(
scenario_config, "pvc_name", ""
)
pod_name = get_yaml_item_value(
scenario_config, "pod_name", ""
)
namespace = get_yaml_item_value(
scenario_config, "namespace", ""
)
target_fill_percentage = get_yaml_item_value(
scenario_config, "fill_percentage", "50"
)
duration = get_yaml_item_value(
scenario_config, "duration", 60
)
logging.info(
"Input params:\n"
"pvc_name: '%s'\n"
"pod_name: '%s'\n"
"namespace: '%s'\n"
"target_fill_percentage: '%s%%'\nduration: '%ss'"
% (
str(pvc_name),
str(pod_name),
str(namespace),
str(target_fill_percentage),
str(duration)
)
)
# Check input params
if namespace is None:
logging.error(
"You must specify the namespace where the PVC is"
)
#sys.exit(1)
raise RuntimeError()
if pvc_name is None and pod_name is None:
logging.error(
"You must specify the pvc_name or the pod_name"
)
# sys.exit(1)
raise RuntimeError()
if pvc_name and pod_name:
logging.info(
"pod_name will be ignored, pod_name used will be "
"a retrieved from the pod used in the pvc_name"
)
# Get pod name
if pvc_name:
if pod_name:
logging.info(
"pod_name '%s' will be overridden with one of "
"the pods mounted in the PVC" % (str(pod_name))
)
pvc = kubecli.get_pvc_info(pvc_name, namespace)
try:
# random generator not used for
# security/cryptographic purposes.
pod_name = random.choice(pvc.podNames) # nosec
logging.info("Pod name: %s" % pod_name)
except Exception:
logging.error(
"Pod associated with %s PVC, on namespace %s, "
"not found" % (str(pvc_name), str(namespace))
)
# sys.exit(1)
raise RuntimeError()
# Get volume name
pod = kubecli.get_pod_info(name=pod_name, namespace=namespace)
if pod is None:
logging.error(
"Exiting as pod '%s' doesn't exist "
"in namespace '%s'" % (
str(pod_name),
str(namespace)
)
)
# sys.exit(1)
raise RuntimeError()
for volume in pod.volumes:
if volume.pvcName is not None:
volume_name = volume.name
pvc_name = volume.pvcName
pvc = kubecli.get_pvc_info(pvc_name, namespace)
break
if 'pvc' not in locals():
logging.error(
"Pod '%s' in namespace '%s' does not use a pvc" % (
str(pod_name),
str(namespace)
)
)
# sys.exit(1)
raise RuntimeError()
logging.info("Volume name: %s" % volume_name)
logging.info("PVC name: %s" % pvc_name)
# Get container name and mount path
for container in pod.containers:
for vol in container.volumeMounts:
if vol.name == volume_name:
mount_path = vol.mountPath
container_name = container.name
break
logging.info("Container path: %s" % container_name)
logging.info("Mount path: %s" % mount_path)
# Get PVC capacity and used bytes
command = "df %s -B 1024 | sed 1d" % (str(mount_path))
command_output = (
kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name
)
).split()
pvc_used_kb = int(command_output[2])
pvc_capacity_kb = pvc_used_kb + int(command_output[3])
logging.info("PVC used: %s KB" % pvc_used_kb)
logging.info("PVC capacity: %s KB" % pvc_capacity_kb)
# Check valid fill percentage
current_fill_percentage = pvc_used_kb / pvc_capacity_kb
if not (
current_fill_percentage * 100
< float(target_fill_percentage)
<= 99
):
logging.error(
"Target fill percentage (%.2f%%) is lower than "
"current fill percentage (%.2f%%) "
"or higher than 99%%" % (
target_fill_percentage,
current_fill_percentage * 100
)
)
# sys.exit(1)
raise RuntimeError()
# Calculate file size
file_size_kb = int(
(
float(
target_fill_percentage / 100
) * float(pvc_capacity_kb)
) - float(pvc_used_kb)
)
logging.debug("File size: %s KB" % file_size_kb)
file_name = "kraken.tmp"
logging.info(
"Creating %s file, %s KB size, in pod %s at %s (ns %s)"
% (
str(file_name),
str(file_size_kb),
str(pod_name),
str(mount_path),
str(namespace)
)
)
start_time = int(time.time())
# Create temp file in the PVC
full_path = "%s/%s" % (str(mount_path), str(file_name))
command = "fallocate -l $((%s*1024)) %s" % (
str(file_size_kb),
str(full_path)
)
logging.debug(
"Create temp file in the PVC command:\n %s" % command
)
kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name,
)
# Check if file is created
command = "ls -lh %s" % (str(mount_path))
logging.debug("Check file is created command:\n %s" % command)
response = kubecli.exec_cmd_in_pod(
command, pod_name, namespace, container_name
)
logging.info("\n" + str(response))
if str(file_name).lower() in str(response).lower():
logging.info(
"%s file successfully created" % (str(full_path))
)
else:
logging.error(
"Failed to create tmp file with %s size" % (
str(file_size_kb)
)
)
remove_temp_file(
file_name,
full_path,
pod_name,
namespace,
container_name,
mount_path,
file_size_kb,
kubecli
)
# sys.exit(1)
raise RuntimeError()
# Calculate file size
file_size_kb = int(
(
float(
target_fill_percentage / 100
) * float(pvc_capacity_kb)
) - float(pvc_used_kb)
)
logging.debug("File size: %s KB" % file_size_kb)
file_name = "kraken.tmp"
logging.info(
"Creating %s file, %s KB size, in pod %s at %s (ns %s)"
% (
str(file_name),
str(file_size_kb),
str(pod_name),
str(mount_path),
str(namespace)
)
)
start_time = int(time.time())
# Create temp file in the PVC
full_path = "%s/%s" % (str(mount_path), str(file_name))
command = "fallocate -l $((%s*1024)) %s" % (
str(file_size_kb),
str(full_path)
)
logging.debug(
"Create temp file in the PVC command:\n %s" % command
)
kubecli.exec_cmd_in_pod(
command, pod_name, namespace, container_name
)
# Check if file is created
command = "ls -lh %s" % (str(mount_path))
logging.debug("Check file is created command:\n %s" % command)
response = kubecli.exec_cmd_in_pod(
command, pod_name, namespace, container_name
)
logging.info("\n" + str(response))
if str(file_name).lower() in str(response).lower():
logging.info(
"Waiting for the specified duration in the config: %ss" % (
duration
)
)
time.sleep(duration)
logging.info("Finish waiting")
remove_temp_file(
file_name,
full_path,
pod_name,
namespace,
container_name,
mount_path,
file_size_kb,
kubecli
)
logging.info("End of scenario. Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
except (RuntimeError, Exception):
scenario_telemetry.exit_status = 1
failed_scenarios.append(app_config)
log_exception(app_config)
else:
scenario_telemetry.exit_status = 0
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries
# krkn_lib
def remove_temp_file(
file_name,
full_path,
pod_name,
namespace,
container_name,
mount_path,
file_size_kb,
kubecli: KrknKubernetes
):
command = "rm -f %s" % (str(full_path))
logging.debug("Remove temp file from the PVC command:\n %s" % command)
kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name)
command = "ls -lh %s" % (str(mount_path))
logging.debug("Check temp file is removed command:\n %s" % command)
response = kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name
)
logging.info("\n" + str(response))
if not (str(file_name).lower() in str(response).lower()):
logging.info("Temp file successfully removed")
else:
logging.error(
"Failed to delete tmp file with %s size" % (str(file_size_kb))
)
raise RuntimeError()
def toKbytes(value):
if not re.match("^[0-9]+[K|M|G|T]i$", value):
logging.error(
"PVC capacity %s does not match expression "
"regexp '^[0-9]+[K|M|G|T]i$'"
)
raise RuntimeError()
unit = {"K": 0, "M": 1, "G": 2, "T": 3}
base = 1024 if ("i" in value) else 1000
exp = unit[value[-2:-1]]
res = int(value[:-2]) * (base**exp)
return res

View File

@@ -1,325 +0,0 @@
import time
import random
import logging
import kraken.cerberus.setup as cerberus
import kraken.post_actions.actions as post_actions
import yaml
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import get_yaml_item_value, log_exception
def delete_objects(kubecli, namespace):
services = delete_all_services_namespace(kubecli, namespace)
daemonsets = delete_all_daemonset_namespace(kubecli, namespace)
statefulsets = delete_all_statefulsets_namespace(kubecli, namespace)
replicasets = delete_all_replicaset_namespace(kubecli, namespace)
deployments = delete_all_deployment_namespace(kubecli, namespace)
objects = { "daemonsets": daemonsets,
"deployments": deployments,
"replicasets": replicasets,
"statefulsets": statefulsets,
"services": services
}
return objects
def get_list_running_pods(kubecli: KrknKubernetes, namespace: str):
running_pods = []
pods = kubecli.list_pods(namespace)
for pod in pods:
pod_status = kubecli.get_pod_info(pod, namespace)
if pod_status and pod_status.status == "Running":
running_pods.append(pod)
logging.info('all running pods ' + str(running_pods))
return running_pods
def delete_all_deployment_namespace(kubecli: KrknKubernetes, namespace: str):
"""
Delete all the deployments in the specified namespace
:param kubecli: krkn kubernetes python package
:param namespace: namespace
"""
try:
deployments = kubecli.get_deployment_ns(namespace)
for deployment in deployments:
logging.info("Deleting deployment" + deployment)
kubecli.delete_deployment(deployment, namespace)
except Exception as e:
logging.error(
"Exception when calling delete_all_deployment_namespace: %s\n",
str(e),
)
raise e
return deployments
def delete_all_daemonset_namespace(kubecli: KrknKubernetes, namespace: str):
"""
Delete all the daemonset in the specified namespace
:param kubecli: krkn kubernetes python package
:param namespace: namespace
"""
try:
daemonsets = kubecli.get_daemonset(namespace)
for daemonset in daemonsets:
logging.info("Deleting daemonset" + daemonset)
kubecli.delete_daemonset(daemonset, namespace)
except Exception as e:
logging.error(
"Exception when calling delete_all_daemonset_namespace: %s\n",
str(e),
)
raise e
return daemonsets
def delete_all_statefulsets_namespace(kubecli: KrknKubernetes, namespace: str):
"""
Delete all the statefulsets in the specified namespace
:param kubecli: krkn kubernetes python package
:param namespace: namespace
"""
try:
statefulsets = kubecli.get_all_statefulset(namespace)
for statefulset in statefulsets:
logging.info("Deleting statefulsets" + statefulsets)
kubecli.delete_statefulset(statefulset, namespace)
except Exception as e:
logging.error(
"Exception when calling delete_all_statefulsets_namespace: %s\n",
str(e),
)
raise e
return statefulsets
def delete_all_replicaset_namespace(kubecli: KrknKubernetes, namespace: str):
"""
Delete all the replicasets in the specified namespace
:param kubecli: krkn kubernetes python package
:param namespace: namespace
"""
try:
replicasets = kubecli.get_all_replicasets(namespace)
for replicaset in replicasets:
logging.info("Deleting replicaset" + replicaset)
kubecli.delete_replicaset(replicaset, namespace)
except Exception as e:
logging.error(
"Exception when calling delete_all_replicaset_namespace: %s\n",
str(e),
)
raise e
return replicasets
def delete_all_services_namespace(kubecli: KrknKubernetes, namespace: str):
"""
Delete all the services in the specified namespace
:param kubecli: krkn kubernetes python package
:param namespace: namespace
"""
try:
services = kubecli.get_all_services(namespace)
for service in services:
logging.info("Deleting services" + service)
kubecli.delete_services(service, namespace)
except Exception as e:
logging.error(
"Exception when calling delete_all_services_namespace: %s\n",
str(e),
)
raise e
return services
# krkn_lib
def run(
scenarios_list,
config,
wait_duration,
failed_post_scenarios,
kubeconfig_path,
kubecli: KrknKubernetes,
telemetry: KrknTelemetryKubernetes
) -> (list[str], list[ScenarioTelemetry]):
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
for scenario_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = scenario_config[0]
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, scenario_config[0])
try:
if len(scenario_config) > 1:
pre_action_output = post_actions.run(kubeconfig_path, scenario_config[1])
else:
pre_action_output = ""
with open(scenario_config[0], "r") as f:
scenario_config_yaml = yaml.full_load(f)
for scenario in scenario_config_yaml["scenarios"]:
scenario_namespace = get_yaml_item_value(
scenario, "namespace", ""
)
scenario_label = get_yaml_item_value(
scenario, "label_selector", ""
)
if scenario_namespace is not None and scenario_namespace.strip() != "":
if scenario_label is not None and scenario_label.strip() != "":
logging.error("You can only have namespace or label set in your namespace scenario")
logging.error(
"Current scenario config has namespace '%s' and label selector '%s'"
% (scenario_namespace, scenario_label)
)
logging.error(
"Please set either namespace to blank ('') or label_selector to blank ('') to continue"
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
delete_count = get_yaml_item_value(
scenario, "delete_count", 1
)
run_count = get_yaml_item_value(scenario, "runs", 1)
run_sleep = get_yaml_item_value(scenario, "sleep", 10)
wait_time = get_yaml_item_value(scenario, "wait_time", 30)
logging.info(str(scenario_namespace) + str(scenario_label) + str(delete_count) + str(run_count) + str(run_sleep) + str(wait_time))
logging.info("done")
start_time = int(time.time())
for i in range(run_count):
killed_namespaces = {}
namespaces = kubecli.check_namespaces([scenario_namespace], scenario_label)
for j in range(delete_count):
if len(namespaces) == 0:
logging.error(
"Couldn't delete %s namespaces, not enough namespaces matching %s with label %s"
% (str(run_count), scenario_namespace, str(scenario_label))
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
selected_namespace = namespaces[random.randint(0, len(namespaces) - 1)]
logging.info('Delete objects in selected namespace: ' + selected_namespace )
try:
# delete all pods in namespace
objects = delete_objects(kubecli,selected_namespace)
killed_namespaces[selected_namespace] = objects
logging.info("Deleted all objects in namespace %s was successful" % str(selected_namespace))
except Exception as e:
logging.info("Delete all objects in namespace %s was unsuccessful" % str(selected_namespace))
logging.info("Namespace action error: " + str(e))
raise RuntimeError()
namespaces.remove(selected_namespace)
logging.info("Waiting %s seconds between namespace deletions" % str(run_sleep))
time.sleep(run_sleep)
logging.info("Waiting for the specified duration: %s" % wait_duration)
time.sleep(wait_duration)
if len(scenario_config) > 1:
try:
failed_post_scenarios = post_actions.check_recovery(
kubeconfig_path, scenario_config, failed_post_scenarios, pre_action_output
)
except Exception as e:
logging.error("Failed to run post action checks: %s" % e)
# removed_exit
# sys.exit(1)
raise RuntimeError()
else:
failed_post_scenarios = check_all_running_deployment(killed_namespaces, wait_time, kubecli)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
except (Exception, RuntimeError):
scenario_telemetry.exit_status = 1
failed_scenarios.append(scenario_config[0])
log_exception(scenario_config[0])
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries
def check_all_running_pods(kubecli: KrknKubernetes, namespace_name, wait_time):
timer = 0
while timer < wait_time:
pod_list = kubecli.list_pods(namespace_name)
pods_running = 0
for pod in pod_list:
pod_info = kubecli.get_pod_info(pod, namespace_name)
if pod_info.status != "Running" and pod_info.status != "Succeeded":
logging.info("Pods %s still not running or completed" % pod_info.name)
break
pods_running += 1
if len(pod_list) == pods_running:
break
timer += 5
time.sleep(5)
logging.info("Waiting 5 seconds for pods to become active")
# krkn_lib
def check_all_running_deployment(killed_namespaces, wait_time, kubecli: KrknKubernetes):
timer = 0
while timer < wait_time and killed_namespaces:
still_missing_ns = killed_namespaces.copy()
for namespace_name, objects in killed_namespaces.items():
still_missing_obj = objects.copy()
for obj_name, obj_list in objects.items():
if "deployments" == obj_name:
deployments = kubecli.get_deployment_ns(namespace_name)
if len(obj_list) == len(deployments):
still_missing_obj.pop(obj_name)
elif "replicasets" == obj_name:
replicasets = kubecli.get_all_replicasets(namespace_name)
if len(obj_list) == len(replicasets):
still_missing_obj.pop(obj_name)
elif "statefulsets" == obj_name:
statefulsets = kubecli.get_all_statefulset(namespace_name)
if len(obj_list) == len(statefulsets):
still_missing_obj.pop(obj_name)
elif "services" == obj_name:
services = kubecli.get_all_services(namespace_name)
if len(obj_list) == len(services):
still_missing_obj.pop(obj_name)
elif "daemonsets" == obj_name:
daemonsets = kubecli.get_daemonset(namespace_name)
if len(obj_list) == len(daemonsets):
still_missing_obj.pop(obj_name)
logging.info("Still missing objects " + str(still_missing_obj))
killed_namespaces[namespace_name] = still_missing_obj.copy()
if len(killed_namespaces[namespace_name].keys()) == 0:
logging.info("Wait for pods to become running for namespace: " + namespace_name)
check_all_running_pods(kubecli, namespace_name, wait_time)
still_missing_ns.pop(namespace_name)
killed_namespaces = still_missing_ns
if len(killed_namespaces.keys()) == 0:
return []
timer += 10
time.sleep(10)
logging.info("Waiting 10 seconds for objects in namespaces to become active")
logging.error("Objects are still not ready after waiting " + str(wait_time) + "seconds")
logging.error("Non active namespaces " + str(killed_namespaces))
return killed_namespaces

View File

@@ -1,90 +0,0 @@
import logging
import time
import yaml
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
def run(scenarios_list: list[str],wait_duration: int, krkn_lib: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
scenario_telemetries= list[ScenarioTelemetry]()
failed_post_scenarios = []
for scenario in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = scenario
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, scenario)
with open(scenario) as stream:
scenario_config = yaml.safe_load(stream)
service_name = scenario_config['service_name']
service_namespace = scenario_config['service_namespace']
plan = scenario_config["plan"]
image = scenario_config["image"]
target_port = scenario_config["service_target_port"]
chaos_duration = scenario_config["chaos_duration"]
logging.info(f"checking service {service_name} in namespace: {service_namespace}")
if not krkn_lib.service_exists(service_name, service_namespace):
logging.error(f"service: {service_name} not found in namespace: {service_namespace}, failed to run scenario.")
fail(scenario_telemetry, scenario_telemetries)
failed_post_scenarios.append(scenario)
break
try:
logging.info(f"service: {service_name} found in namespace: {service_namespace}")
logging.info(f"creating webservice and initializing test plan...")
# both named ports and port numbers can be used
if isinstance(target_port, int):
logging.info(f"webservice will listen on port {target_port}")
webservice = krkn_lib.deploy_service_hijacking(service_namespace, plan, image, port_number=target_port)
else:
logging.info(f"traffic will be redirected to named port: {target_port}")
webservice = krkn_lib.deploy_service_hijacking(service_namespace, plan, image, port_name=target_port)
logging.info(f"successfully deployed pod: {webservice.pod_name} "
f"in namespace:{service_namespace} with selector {webservice.selector}!"
)
logging.info(f"patching service: {service_name} to hijack traffic towards: {webservice.pod_name}")
original_service = krkn_lib.replace_service_selector([webservice.selector], service_name, service_namespace)
if original_service is None:
logging.error(f"failed to patch service: {service_name}, namespace: {service_namespace} with selector {webservice.selector}")
fail(scenario_telemetry, scenario_telemetries)
failed_post_scenarios.append(scenario)
break
logging.info(f"service: {service_name} successfully patched!")
logging.info(f"original service manifest:\n\n{yaml.dump(original_service)}")
logging.info(f"waiting {chaos_duration} before restoring the service")
time.sleep(chaos_duration)
selectors = ["=".join([key, original_service["spec"]["selector"][key]]) for key in original_service["spec"]["selector"].keys()]
logging.info(f"restoring the service selectors {selectors}")
original_service = krkn_lib.replace_service_selector(selectors, service_name, service_namespace)
if original_service is None:
logging.error(f"failed to restore original service: {service_name}, namespace: {service_namespace} with selectors: {selectors}")
fail(scenario_telemetry, scenario_telemetries)
failed_post_scenarios.append(scenario)
break
logging.info("selectors successfully restored")
logging.info("undeploying service-hijacking resources...")
krkn_lib.undeploy_service_hijacking(webservice)
logging.info("End of scenario. Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
logging.info("success")
except Exception as e:
logging.error(f"scenario {scenario} failed with exception: {e}")
fail(scenario_telemetry, scenario_telemetries)
failed_post_scenarios.append(scenario)
return failed_post_scenarios, scenario_telemetries
def fail(scenario_telemetry: ScenarioTelemetry, scenario_telemetries: list[ScenarioTelemetry]):
scenario_telemetry.exit_status = 1
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)

View File

@@ -1,191 +0,0 @@
#!/usr/bin/env python
import yaml
import logging
import time
from multiprocessing.pool import ThreadPool
from ..cerberus import setup as cerberus
from ..post_actions import actions as post_actions
from ..node_actions.aws_node_scenarios import AWS
from ..node_actions.openstack_node_scenarios import OPENSTACKCLOUD
from ..node_actions.az_node_scenarios import Azure
from ..node_actions.gcp_node_scenarios import GCP
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import log_exception
def multiprocess_nodes(cloud_object_function, nodes, processes=0):
try:
# pool object with number of element
if processes == 0:
pool = ThreadPool(processes=len(nodes))
else:
pool = ThreadPool(processes=processes)
logging.info("nodes type " + str(type(nodes[0])))
if type(nodes[0]) is tuple:
node_id = []
node_info = []
for node in nodes:
node_id.append(node[0])
node_info.append(node[1])
logging.info("node id " + str(node_id))
logging.info("node info" + str(node_info))
pool.starmap(cloud_object_function, zip(node_info, node_id))
else:
logging.info("pool type" + str(type(nodes)))
pool.map(cloud_object_function, nodes)
pool.close()
except Exception as e:
logging.info("Error on pool multiprocessing: " + str(e))
# Inject the cluster shut down scenario
# krkn_lib
def cluster_shut_down(shut_down_config, kubecli: KrknKubernetes):
runs = shut_down_config["runs"]
shut_down_duration = shut_down_config["shut_down_duration"]
cloud_type = shut_down_config["cloud_type"]
timeout = shut_down_config["timeout"]
processes = 0
if cloud_type.lower() == "aws":
cloud_object = AWS()
elif cloud_type.lower() == "gcp":
cloud_object = GCP()
processes = 1
elif cloud_type.lower() == "openstack":
cloud_object = OPENSTACKCLOUD()
elif cloud_type.lower() in ["azure", "az"]:
cloud_object = Azure()
else:
logging.error(
"Cloud type %s is not currently supported for cluster shut down" %
cloud_type
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
nodes = kubecli.list_nodes()
node_id = []
for node in nodes:
instance_id = cloud_object.get_instance_id(node)
node_id.append(instance_id)
logging.info("node id list " + str(node_id))
for _ in range(runs):
logging.info("Starting cluster_shut_down scenario injection")
stopping_nodes = set(node_id)
multiprocess_nodes(cloud_object.stop_instances, node_id, processes)
stopped_nodes = stopping_nodes.copy()
while len(stopping_nodes) > 0:
for node in stopping_nodes:
if type(node) is tuple:
node_status = cloud_object.wait_until_stopped(
node[1],
node[0],
timeout
)
else:
node_status = cloud_object.wait_until_stopped(
node,
timeout
)
# Only want to remove node from stopping list
# when fully stopped/no error
if node_status:
stopped_nodes.remove(node)
stopping_nodes = stopped_nodes.copy()
logging.info(
"Shutting down the cluster for the specified duration: %s" %
(shut_down_duration)
)
time.sleep(shut_down_duration)
logging.info("Restarting the nodes")
restarted_nodes = set(node_id)
multiprocess_nodes(cloud_object.start_instances, node_id, processes)
logging.info("Wait for each node to be running again")
not_running_nodes = restarted_nodes.copy()
while len(not_running_nodes) > 0:
for node in not_running_nodes:
if type(node) is tuple:
node_status = cloud_object.wait_until_running(
node[1],
node[0],
timeout
)
else:
node_status = cloud_object.wait_until_running(
node,
timeout
)
if node_status:
restarted_nodes.remove(node)
not_running_nodes = restarted_nodes.copy()
logging.info(
"Waiting for 150s to allow cluster component initialization"
)
time.sleep(150)
logging.info("Successfully injected cluster_shut_down scenario!")
# krkn_lib
def run(scenarios_list, config, wait_duration, kubecli: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
failed_post_scenarios = []
failed_scenarios = []
scenario_telemetries: list[ScenarioTelemetry] = []
for shut_down_config in scenarios_list:
config_path = shut_down_config
pre_action_output = ""
if isinstance(shut_down_config, list) :
if len(shut_down_config) == 0:
raise Exception("bad config file format for shutdown scenario")
config_path = shut_down_config[0]
if len(shut_down_config) > 1:
pre_action_output = post_actions.run("", shut_down_config[1])
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = config_path
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, config_path)
with open(config_path, "r") as f:
shut_down_config_yaml = yaml.full_load(f)
shut_down_config_scenario = \
shut_down_config_yaml["cluster_shut_down_scenario"]
start_time = int(time.time())
try:
cluster_shut_down(shut_down_config_scenario, kubecli)
logging.info(
"Waiting for the specified duration: %s" % (wait_duration)
)
time.sleep(wait_duration)
failed_post_scenarios = post_actions.check_recovery(
"", shut_down_config, failed_post_scenarios, pre_action_output
)
end_time = int(time.time())
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
except (RuntimeError, Exception):
log_exception(config_path)
failed_scenarios.append(config_path)
scenario_telemetry.exit_status = 1
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries

View File

@@ -1 +0,0 @@
from .syn_flood import *

View File

@@ -1,132 +0,0 @@
import logging
import os.path
import time
from typing import List
import krkn_lib.utils
import yaml
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
def run(scenarios_list: list[str], krkn_kubernetes: KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
scenario_telemetries: list[ScenarioTelemetry] = []
failed_post_scenarios = []
for scenario in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = scenario
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, scenario)
try:
pod_names = []
config = parse_config(scenario)
if config["target-service-label"]:
target_services = krkn_kubernetes.select_service_by_label(config["namespace"], config["target-service-label"])
else:
target_services = [config["target-service"]]
for target in target_services:
if not krkn_kubernetes.service_exists(target, config["namespace"]):
raise Exception(f"{target} service not found")
for i in range(config["number-of-pods"]):
pod_name = "syn-flood-" + krkn_lib.utils.get_random_string(10)
krkn_kubernetes.deploy_syn_flood(pod_name,
config["namespace"],
config["image"],
target,
config["target-port"],
config["packet-size"],
config["window-size"],
config["duration"],
config["attacker-nodes"]
)
pod_names.append(pod_name)
logging.info("waiting all the attackers to finish:")
did_finish = False
finished_pods = []
while not did_finish:
for pod_name in pod_names:
if not krkn_kubernetes.is_pod_running(pod_name, config["namespace"]):
finished_pods.append(pod_name)
if set(pod_names) == set(finished_pods):
did_finish = True
time.sleep(1)
except Exception as e:
logging.error(f"Failed to run syn flood scenario {scenario}: {e}")
failed_post_scenarios.append(scenario)
scenario_telemetry.exit_status = 1
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_post_scenarios, scenario_telemetries
def parse_config(scenario_file: str) -> dict[str,any]:
if not os.path.exists(scenario_file):
raise Exception(f"failed to load scenario file {scenario_file}")
try:
with open(scenario_file) as stream:
config = yaml.safe_load(stream)
except Exception:
raise Exception(f"{scenario_file} is not a valid yaml file")
missing = []
if not check_key_value(config ,"packet-size"):
missing.append("packet-size")
if not check_key_value(config,"window-size"):
missing.append("window-size")
if not check_key_value(config, "duration"):
missing.append("duration")
if not check_key_value(config, "namespace"):
missing.append("namespace")
if not check_key_value(config, "number-of-pods"):
missing.append("number-of-pods")
if not check_key_value(config, "target-port"):
missing.append("target-port")
if not check_key_value(config, "image"):
missing.append("image")
if "target-service" not in config.keys():
missing.append("target-service")
if "target-service-label" not in config.keys():
missing.append("target-service-label")
if len(missing) > 0:
raise Exception(f"{(',').join(missing)} parameter(s) are missing")
if not config["target-service"] and not config["target-service-label"]:
raise Exception("you have either to set a target service or a label")
if config["target-service"] and config["target-service-label"]:
raise Exception("you cannot select both target-service and target-service-label")
if 'attacker-nodes' and not is_node_affinity_correct(config['attacker-nodes']):
raise Exception("attacker-nodes format is not correct")
return config
def check_key_value(dictionary, key):
if key in dictionary:
value = dictionary[key]
if value is not None and value != '':
return True
return False
def is_node_affinity_correct(obj) -> bool:
if not isinstance(obj, dict):
return False
for key in obj.keys():
if not isinstance(key, str):
return False
if not isinstance(obj[key], list):
return False
return True

View File

@@ -1,388 +0,0 @@
import datetime
import time
import logging
import re
import yaml
import random
from krkn_lib import utils
from kubernetes.client import ApiException
from ..cerberus import setup as cerberus
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import get_yaml_item_value, log_exception, get_random_string
# krkn_lib
def pod_exec(pod_name, command, namespace, container_name, kubecli:KrknKubernetes):
for i in range(5):
response = kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name
)
if not response:
time.sleep(2)
continue
elif (
"unauthorized" in response.lower() or
"authorization" in response.lower()
):
time.sleep(2)
continue
else:
break
return response
# krkn_lib
def get_container_name(pod_name, namespace, kubecli:KrknKubernetes, container_name=""):
container_names = kubecli.get_containers_in_pod(pod_name, namespace)
if container_name != "":
if container_name in container_names:
return container_name
else:
logging.error(
"Container name %s not an existing container in pod %s" % (
container_name,
pod_name
)
)
else:
container_name = container_names[
# random module here is not used for security/cryptographic
# purposes
random.randint(0, len(container_names) - 1) # nosec
]
return container_name
def skew_node(node_name: str, action: str, kubecli: KrknKubernetes):
pod_namespace = "default"
status_pod_name = f"time-skew-pod-{get_random_string(5)}"
skew_pod_name = f"time-skew-pod-{get_random_string(5)}"
ntp_enabled = True
logging.info(f'Creating pod to skew {"time" if action == "skew_time" else "date"} on node {node_name}')
status_command = ["timedatectl"]
param = "2001-01-01"
skew_command = ["timedatectl", "set-time"]
if action == "skew_time":
skew_command.append("01:01:01")
else:
skew_command.append("2001-01-01")
try:
status_response = kubecli.exec_command_on_node(node_name, status_command, status_pod_name, pod_namespace)
if "Network time on: no" in status_response:
ntp_enabled = False
logging.warning(f'ntp unactive on node {node_name} skewing {"time" if action == "skew_time" else "date"} to {param}')
pod_exec(skew_pod_name, skew_command, pod_namespace, None, kubecli)
else:
logging.info(f'ntp active in cluster node, {"time" if action == "skew_time" else "date"} skewing will have no effect, skipping')
except ApiException:
pass
except Exception as e:
logging.error(f"failed to execute skew command in pod: {e}")
finally:
kubecli.delete_pod(status_pod_name, pod_namespace)
if not ntp_enabled :
kubecli.delete_pod(skew_pod_name, pod_namespace)
# krkn_lib
def skew_time(scenario, kubecli:KrknKubernetes):
if scenario["action"] not in ["skew_date","skew_time"]:
raise RuntimeError(f'{scenario["action"]} is not a valid time skew action')
if "node" in scenario["object_type"]:
node_names = []
if "object_name" in scenario.keys() and scenario["object_name"]:
node_names = scenario["object_name"]
elif (
"label_selector" in scenario.keys() and
scenario["label_selector"]
):
node_names = kubecli.list_nodes(scenario["label_selector"])
for node in node_names:
skew_node(node, scenario["action"], kubecli)
logging.info("Reset date/time on node " + str(node))
return "node", node_names
elif "pod" in scenario["object_type"]:
skew_command = "date --date "
if scenario["action"] == "skew_date":
skewed_date = "00-01-01"
skew_command += skewed_date
elif scenario["action"] == "skew_time":
skewed_time = "01:01:01"
skew_command += skewed_time
container_name = get_yaml_item_value(scenario, "container_name", "")
pod_names = []
if "object_name" in scenario.keys() and scenario["object_name"]:
for name in scenario["object_name"]:
if "namespace" not in scenario.keys():
logging.error("Need to set namespace when using pod name")
# removed_exit
# sys.exit(1)
raise RuntimeError()
pod_names.append([name, scenario["namespace"]])
elif "namespace" in scenario.keys() and scenario["namespace"]:
if "label_selector" not in scenario.keys():
logging.info(
"label_selector key not found, querying for all the pods "
"in namespace: %s" % (scenario["namespace"])
)
pod_names = kubecli.list_pods(scenario["namespace"])
else:
logging.info(
"Querying for the pods matching the %s label_selector "
"in namespace %s"
% (scenario["label_selector"], scenario["namespace"])
)
pod_names = kubecli.list_pods(
scenario["namespace"],
scenario["label_selector"]
)
counter = 0
for pod_name in pod_names:
pod_names[counter] = [pod_name, scenario["namespace"]]
counter += 1
elif (
"label_selector" in scenario.keys() and
scenario["label_selector"]
):
pod_names = kubecli.get_all_pods(scenario["label_selector"])
if len(pod_names) == 0:
logging.info(
"Cannot find pods matching the namespace/label_selector, "
"please check"
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
pod_counter = 0
for pod in pod_names:
if len(pod) > 1:
selected_container_name = get_container_name(
pod[0],
pod[1],
kubecli,
container_name,
)
pod_exec_response = pod_exec(
pod[0],
skew_command,
pod[1],
selected_container_name,
kubecli,
)
if pod_exec_response is False:
logging.error(
"Couldn't reset time on container %s "
"in pod %s in namespace %s"
% (selected_container_name, pod[0], pod[1])
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
pod_names[pod_counter].append(selected_container_name)
else:
selected_container_name = get_container_name(
pod,
scenario["namespace"],
kubecli,
container_name
)
pod_exec_response = pod_exec(
pod,
skew_command,
scenario["namespace"],
selected_container_name,
kubecli
)
if pod_exec_response is False:
logging.error(
"Couldn't reset time on container "
"%s in pod %s in namespace %s"
% (
selected_container_name,
pod,
scenario["namespace"]
)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
pod_names[pod_counter].append(selected_container_name)
logging.info("Reset date/time on pod " + str(pod[0]))
pod_counter += 1
return "pod", pod_names
# From kubectl/oc command get time output
def parse_string_date(obj_datetime):
try:
logging.info("Obj_date time " + str(obj_datetime))
obj_datetime = re.sub(r"\s\s+", " ", obj_datetime).strip()
logging.info("Obj_date sub time " + str(obj_datetime))
date_line = re.match(
r"[\s\S\n]*\w{3} \w{3} \d{1,} \d{2}:\d{2}:\d{2} \w{3} \d{4}[\s\S\n]*", # noqa
obj_datetime
)
if date_line is not None:
search_response = date_line.group().strip()
logging.info("Search response: " + str(search_response))
return search_response
else:
return ""
except Exception as e:
logging.info(
"Exception %s when trying to parse string to date" % str(e)
)
return ""
# Get date and time from string returned from OC
def string_to_date(obj_datetime):
obj_datetime = parse_string_date(obj_datetime)
try:
date_time_obj = datetime.datetime.strptime(
obj_datetime,
"%a %b %d %H:%M:%S %Z %Y"
)
return date_time_obj
except Exception:
logging.info("Couldn't parse string to datetime object")
return datetime.datetime(datetime.MINYEAR, 1, 1)
# krkn_lib
def check_date_time(object_type, names, kubecli:KrknKubernetes):
skew_command = "date"
not_reset = []
max_retries = 30
if object_type == "node":
for node_name in names:
first_date_time = datetime.datetime.utcnow()
check_pod_name = f"time-skew-pod-{get_random_string(5)}"
node_datetime_string = kubecli.exec_command_on_node(node_name, [skew_command], check_pod_name)
node_datetime = string_to_date(node_datetime_string)
counter = 0
while not (
first_date_time < node_datetime < datetime.datetime.utcnow()
):
time.sleep(10)
logging.info(
"Date/time on node %s still not reset, "
"waiting 10 seconds and retrying" % node_name
)
node_datetime_string = kubecli.exec_cmd_in_pod([skew_command], check_pod_name, "default")
node_datetime = string_to_date(node_datetime_string)
counter += 1
if counter > max_retries:
logging.error(
"Date and time in node %s didn't reset properly" %
node_name
)
not_reset.append(node_name)
break
if counter < max_retries:
logging.info(
"Date in node " + str(node_name) + " reset properly"
)
kubecli.delete_pod(check_pod_name)
elif object_type == "pod":
for pod_name in names:
first_date_time = datetime.datetime.utcnow()
counter = 0
pod_datetime_string = pod_exec(
pod_name[0],
skew_command,
pod_name[1],
pod_name[2],
kubecli
)
pod_datetime = string_to_date(pod_datetime_string)
while not (
first_date_time < pod_datetime < datetime.datetime.utcnow()
):
time.sleep(10)
logging.info(
"Date/time on pod %s still not reset, "
"waiting 10 seconds and retrying" % pod_name[0]
)
pod_datetime = pod_exec(
pod_name[0],
skew_command,
pod_name[1],
pod_name[2],
kubecli
)
pod_datetime = string_to_date(pod_datetime)
counter += 1
if counter > max_retries:
logging.error(
"Date and time in pod %s didn't reset properly" %
pod_name[0]
)
not_reset.append(pod_name[0])
break
if counter < max_retries:
logging.info(
"Date in pod " + str(pod_name[0]) + " reset properly"
)
return not_reset
# krkn_lib
def run(scenarios_list, config, wait_duration, kubecli:KrknKubernetes, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]):
failed_scenarios = []
scenario_telemetries: list[ScenarioTelemetry] = []
for time_scenario_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = time_scenario_config
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, time_scenario_config)
try:
with open(time_scenario_config, "r") as f:
scenario_config = yaml.full_load(f)
for time_scenario in scenario_config["time_scenarios"]:
start_time = int(time.time())
object_type, object_names = skew_time(time_scenario, kubecli)
not_reset = check_date_time(object_type, object_names, kubecli)
if len(not_reset) > 0:
logging.info("Object times were not reset")
logging.info(
"Waiting for the specified duration: %s" % (wait_duration)
)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(
config,
not_reset,
start_time,
end_time
)
except (RuntimeError, Exception):
scenario_telemetry.exit_status = 1
log_exception(time_scenario_config)
failed_scenarios.append(time_scenario_config)
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries

View File

@@ -1,121 +0,0 @@
import yaml
import logging
import time
from ..node_actions.aws_node_scenarios import AWS
from ..cerberus import setup as cerberus
from krkn_lib.telemetry.k8s import KrknTelemetryKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.utils.functions import log_exception
def run(scenarios_list, config, wait_duration, telemetry: KrknTelemetryKubernetes) -> (list[str], list[ScenarioTelemetry]) :
"""
filters the subnet of interest and applies the network acl
to create zone outage
"""
failed_post_scenarios = ""
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
for zone_outage_config in scenarios_list:
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = zone_outage_config
scenario_telemetry.start_timestamp = time.time()
telemetry.set_parameters_base64(scenario_telemetry, zone_outage_config)
try:
if len(zone_outage_config) > 1:
with open(zone_outage_config, "r") as f:
zone_outage_config_yaml = yaml.full_load(f)
scenario_config = zone_outage_config_yaml["zone_outage"]
vpc_id = scenario_config["vpc_id"]
subnet_ids = scenario_config["subnet_id"]
duration = scenario_config["duration"]
cloud_type = scenario_config["cloud_type"]
ids = {}
acl_ids_created = []
if cloud_type.lower() == "aws":
cloud_object = AWS()
else:
logging.error(
"Cloud type %s is not currently supported for "
"zone outage scenarios"
% cloud_type
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
start_time = int(time.time())
for subnet_id in subnet_ids:
logging.info("Targeting subnet_id")
network_association_ids = []
associations, original_acl_id = \
cloud_object.describe_network_acls(vpc_id, subnet_id)
for entry in associations:
if entry["SubnetId"] == subnet_id:
network_association_ids.append(
entry["NetworkAclAssociationId"]
)
logging.info(
"Network association ids associated with "
"the subnet %s: %s"
% (subnet_id, network_association_ids)
)
acl_id = cloud_object.create_default_network_acl(vpc_id)
new_association_id = \
cloud_object.replace_network_acl_association(
network_association_ids[0], acl_id
)
# capture the orginal_acl_id, created_acl_id and
# new association_id to use during the recovery
ids[new_association_id] = original_acl_id
acl_ids_created.append(acl_id)
# wait for the specified duration
logging.info(
"Waiting for the specified duration "
"in the config: %s" % (duration)
)
time.sleep(duration)
# replace the applied acl with the previous acl in use
for new_association_id, original_acl_id in ids.items():
cloud_object.replace_network_acl_association(
new_association_id,
original_acl_id
)
logging.info(
"Wating for 60 seconds to make sure "
"the changes are in place"
)
time.sleep(60)
# delete the network acl created for the run
for acl_id in acl_ids_created:
cloud_object.delete_network_acl(acl_id)
logging.info(
"End of scenario. "
"Waiting for the specified duration: %s" % (wait_duration)
)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
except (RuntimeError, Exception):
scenario_telemetry.exit_status = 1
failed_scenarios.append(zone_outage_config)
log_exception(zone_outage_config)
else:
scenario_telemetry.exit_status = 0
scenario_telemetry.end_timestamp = time.time()
scenario_telemetries.append(scenario_telemetry)
return failed_scenarios, scenario_telemetries

View File

@@ -0,0 +1 @@
from .setup import *

View File

@@ -1,7 +1,6 @@
import logging
import pandas as pd
import kraken.chaos_recommender.kraken_tests as kraken_tests
import time
KRAKEN_TESTS_PATH = "./kraken_chaos_tests.txt"
@@ -23,7 +22,9 @@ def calculate_zscores(data):
zscores["Service"] = data["service"]
zscores["CPU"] = (data["CPU"] - data["CPU"].mean()) / data["CPU"].std()
zscores["Memory"] = (data["MEM"] - data["MEM"].mean()) / data["MEM"].std()
zscores["Network"] = (data["NETWORK"] - data["NETWORK"].mean()) / data["NETWORK"].std()
zscores["Network"] = (data["NETWORK"] - data["NETWORK"].mean()) / data[
"NETWORK"
].std()
return zscores
@@ -37,18 +38,28 @@ def identify_outliers(data, threshold):
def get_services_above_heatmap_threshold(dataframe, cpu_threshold, mem_threshold):
# Filter the DataFrame based on CPU_HEATMAP and MEM_HEATMAP thresholds
filtered_df = dataframe[((dataframe['CPU']/dataframe['CPU_LIMITS']) > cpu_threshold)]
filtered_df = dataframe[
((dataframe["CPU"] / dataframe["CPU_LIMITS"]) > cpu_threshold)
]
# Get the lists of services
cpu_services = filtered_df['service'].tolist()
cpu_services = filtered_df["service"].tolist()
filtered_df = dataframe[((dataframe['MEM']/dataframe['MEM_LIMITS']) > mem_threshold)]
mem_services = filtered_df['service'].tolist()
filtered_df = dataframe[
((dataframe["MEM"] / dataframe["MEM_LIMITS"]) > mem_threshold)
]
mem_services = filtered_df["service"].tolist()
return cpu_services, mem_services
def analysis(file_path, namespaces, chaos_tests_config, threshold,
heatmap_cpu_threshold, heatmap_mem_threshold):
def analysis(
file_path,
namespaces,
chaos_tests_config,
threshold,
heatmap_cpu_threshold,
heatmap_mem_threshold,
):
# Load the telemetry data from file
logging.info("Fetching the Telemetry data...")
data = load_telemetry_data(file_path)
@@ -66,29 +77,43 @@ def analysis(file_path, namespaces, chaos_tests_config, threshold,
namespace_zscores = zscores.loc[zscores["Namespace"] == namespace]
namespace_data = data.loc[data["namespace"] == namespace]
outliers_cpu, outliers_memory, outliers_network = identify_outliers(
namespace_zscores, threshold)
namespace_zscores, threshold
)
cpu_services, mem_services = get_services_above_heatmap_threshold(
namespace_data, heatmap_cpu_threshold, heatmap_mem_threshold)
namespace_data, heatmap_cpu_threshold, heatmap_mem_threshold
)
analysis_data[namespace] = analysis_json(outliers_cpu, outliers_memory,
outliers_network,
cpu_services, mem_services,
chaos_tests_config)
analysis_data[namespace] = analysis_json(
outliers_cpu,
outliers_memory,
outliers_network,
cpu_services,
mem_services,
chaos_tests_config,
)
if cpu_services:
logging.info(f"These services use significant CPU compared to "
f"their assigned limits: {cpu_services}")
logging.info(
f"These services use significant CPU compared to "
f"their assigned limits: {cpu_services}"
)
else:
logging.info("There are no services that are using significant "
"CPU compared to their assigned limits "
"(infinite in case no limits are set).")
logging.info(
"There are no services that are using significant "
"CPU compared to their assigned limits "
"(infinite in case no limits are set)."
)
if mem_services:
logging.info(f"These services use significant MEMORY compared to "
f"their assigned limits: {mem_services}")
logging.info(
f"These services use significant MEMORY compared to "
f"their assigned limits: {mem_services}"
)
else:
logging.info("There are no services that are using significant "
"MEMORY compared to their assigned limits "
"(infinite in case no limits are set).")
logging.info(
"There are no services that are using significant "
"MEMORY compared to their assigned limits "
"(infinite in case no limits are set)."
)
time.sleep(2)
logging.info("Please check data in utilisation.txt for further analysis")
@@ -96,36 +121,41 @@ def analysis(file_path, namespaces, chaos_tests_config, threshold,
return analysis_data
def analysis_json(outliers_cpu, outliers_memory, outliers_network,
cpu_services, mem_services, chaos_tests_config):
def analysis_json(
outliers_cpu,
outliers_memory,
outliers_network,
cpu_services,
mem_services,
chaos_tests_config,
):
profiling = {
"cpu_outliers": outliers_cpu,
"memory_outliers": outliers_memory,
"network_outliers": outliers_network
"network_outliers": outliers_network,
}
heatmap = {
"services_with_cpu_heatmap_above_threshold": cpu_services,
"services_with_mem_heatmap_above_threshold": mem_services
"services_with_mem_heatmap_above_threshold": mem_services,
}
recommendations = {}
if cpu_services:
cpu_recommend = {"services": cpu_services,
"tests": chaos_tests_config['CPU']}
cpu_recommend = {"services": cpu_services, "tests": chaos_tests_config["CPU"]}
recommendations["cpu_services_recommendations"] = cpu_recommend
if mem_services:
mem_recommend = {"services": mem_services,
"tests": chaos_tests_config['MEM']}
mem_recommend = {"services": mem_services, "tests": chaos_tests_config["MEM"]}
recommendations["mem_services_recommendations"] = mem_recommend
if outliers_network:
outliers_network_recommend = {"outliers_networks": outliers_network,
"tests": chaos_tests_config['NETWORK']}
recommendations["outliers_network_recommendations"] = (
outliers_network_recommend)
outliers_network_recommend = {
"outliers_networks": outliers_network,
"tests": chaos_tests_config["NETWORK"],
}
recommendations["outliers_network_recommendations"] = outliers_network_recommend
return [profiling, heatmap, recommendations]

View File

@@ -1,13 +1,13 @@
def get_entries_by_category(filename, category):
# Read the file
with open(filename, 'r') as file:
with open(filename, "r") as file:
content = file.read()
# Split the content into sections based on the square brackets
sections = content.split('\n\n')
sections = content.split("\n\n")
# Define the categories
valid_categories = ['CPU', 'NETWORK', 'MEM', 'GENERIC']
valid_categories = ["CPU", "NETWORK", "MEM", "GENERIC"]
# Validate the provided category
if category not in valid_categories:
@@ -25,6 +25,10 @@ def get_entries_by_category(filename, category):
return []
# Extract the entries from the category section
entries = [entry.strip() for entry in target_section.split('\n') if entry and not entry.startswith('[')]
entries = [
entry.strip()
for entry in target_section.split("\n")
if entry and not entry.startswith("[")
]
return entries

View File

@@ -0,0 +1,203 @@
import logging
from prometheus_api_client import PrometheusConnect
import pandas as pd
import urllib3
saved_metrics_path = "./utilisation.txt"
def convert_data_to_dataframe(data, label):
df = pd.DataFrame()
df["service"] = [item["metric"]["pod"] for item in data]
df[label] = [item["value"][1] for item in data]
return df
def convert_data(data, service):
result = {}
for entry in data:
pod_name = entry["metric"]["pod"]
value = entry["value"][1]
result[pod_name] = value
return result.get(
service
) # for those pods whose limits are not defined they can take as much resources, there assigning a very high value
def convert_data_limits(data, node_data, service, prometheus):
result = {}
for entry in data:
pod_name = entry["metric"]["pod"]
value = entry["value"][1]
result[pod_name] = value
return result.get(
service, get_node_capacity(node_data, service, prometheus)
) # for those pods whose limits are not defined they can take as much resources, there assigning a very high value
def get_node_capacity(node_data, pod_name, prometheus):
# Get the node name on which the pod is running
query = f'kube_pod_info{{pod="{pod_name}"}}'
result = prometheus.custom_query(query)
if not result:
return None
node_name = result[0]["metric"]["node"]
for item in node_data:
if item["metric"]["node"] == node_name:
return item["value"][1]
return "1000000000"
def save_utilization_to_file(utilization, filename, prometheus):
merged_df = pd.DataFrame(
columns=[
"namespace",
"service",
"CPU",
"CPU_LIMITS",
"MEM",
"MEM_LIMITS",
"NETWORK",
]
)
for namespace in utilization:
# Loading utilization_data[] for namespace
# indexes -- 0 CPU, 1 CPU limits, 2 mem, 3 mem limits, 4 network
utilization_data = utilization[namespace]
df_cpu = convert_data_to_dataframe(utilization_data[0], "CPU")
services = df_cpu.service.unique()
logging.info(f"Services for namespace {namespace}: {services}")
for s in services:
new_row_df = pd.DataFrame(
{
"namespace": namespace,
"service": s,
"CPU": convert_data(utilization_data[0], s),
"CPU_LIMITS": convert_data_limits(
utilization_data[1], utilization_data[5], s, prometheus
),
"MEM": convert_data(utilization_data[2], s),
"MEM_LIMITS": convert_data_limits(
utilization_data[3], utilization_data[6], s, prometheus
),
"NETWORK": convert_data(utilization_data[4], s),
},
index=[0],
)
merged_df = pd.concat([merged_df, new_row_df], ignore_index=True)
# Convert columns to string
merged_df["CPU"] = merged_df["CPU"].astype(str)
merged_df["MEM"] = merged_df["MEM"].astype(str)
merged_df["CPU_LIMITS"] = merged_df["CPU_LIMITS"].astype(str)
merged_df["MEM_LIMITS"] = merged_df["MEM_LIMITS"].astype(str)
merged_df["NETWORK"] = merged_df["NETWORK"].astype(str)
# Extract integer part before the decimal point
# merged_df['CPU'] = merged_df['CPU'].str.split('.').str[0]
# merged_df['MEM'] = merged_df['MEM'].str.split('.').str[0]
# merged_df['CPU_LIMITS'] = merged_df['CPU_LIMITS'].str.split('.').str[0]
# merged_df['MEM_LIMITS'] = merged_df['MEM_LIMITS'].str.split('.').str[0]
# merged_df['NETWORK'] = merged_df['NETWORK'].str.split('.').str[0]
merged_df.to_csv(filename, sep="\t", index=False)
def fetch_utilization_from_prometheus(
prometheus_endpoint, auth_token, namespaces, scrape_duration
):
urllib3.disable_warnings()
prometheus = PrometheusConnect(
url=prometheus_endpoint,
headers={"Authorization": "Bearer {}".format(auth_token)},
disable_ssl=True,
)
# Dicts for saving utilisation and queries -- key is namespace
utilization = {}
queries = {}
logging.info("Fetching utilization...")
for namespace in namespaces:
# Fetch CPU utilization
cpu_query = (
'sum (rate (container_cpu_usage_seconds_total{image!="", namespace="%s"}[%s])) by (pod) *1000'
% (namespace, scrape_duration)
)
cpu_result = prometheus.custom_query(cpu_query)
cpu_limits_query = (
'(sum by (pod) (kube_pod_container_resource_limits{resource="cpu", namespace="%s"}))*1000'
% (namespace)
)
cpu_limits_result = prometheus.custom_query(cpu_limits_query)
node_cpu_limits_query = (
'kube_node_status_capacity{resource="cpu", unit="core"}*1000'
)
node_cpu_limits_result = prometheus.custom_query(node_cpu_limits_query)
mem_query = (
'sum by (pod) (avg_over_time(container_memory_usage_bytes{image!="", namespace="%s"}[%s]))'
% (namespace, scrape_duration)
)
mem_result = prometheus.custom_query(mem_query)
mem_limits_query = (
'sum by (pod) (kube_pod_container_resource_limits{resource="memory", namespace="%s"}) '
% (namespace)
)
mem_limits_result = prometheus.custom_query(mem_limits_query)
node_mem_limits_query = (
'kube_node_status_capacity{resource="memory", unit="byte"}'
)
node_mem_limits_result = prometheus.custom_query(node_mem_limits_query)
network_query = (
'sum by (pod) ((avg_over_time(container_network_transmit_bytes_total{namespace="%s"}[%s])) + \
(avg_over_time(container_network_receive_bytes_total{namespace="%s"}[%s])))'
% (namespace, scrape_duration, namespace, scrape_duration)
)
network_result = prometheus.custom_query(network_query)
utilization[namespace] = [
cpu_result,
cpu_limits_result,
mem_result,
mem_limits_result,
network_result,
node_cpu_limits_result,
node_mem_limits_result,
]
queries[namespace] = json_queries(
cpu_query, cpu_limits_query, mem_query, mem_limits_query, network_query
)
save_utilization_to_file(utilization, saved_metrics_path, prometheus)
return saved_metrics_path, queries
def json_queries(
cpu_query, cpu_limits_query, mem_query, mem_limits_query, network_query
):
queries = {
"cpu_query": cpu_query,
"cpu_limit_query": cpu_limits_query,
"memory_query": mem_query,
"memory_limit_query": mem_limits_query,
"network_query": network_query,
}
return queries

View File

@@ -14,7 +14,9 @@ def setup(repo, distribution):
logging.error("Provided distribution: %s is not supported" % (distribution))
sys.exit(1)
delete_repo = "rm -rf performance-dashboards || exit 0"
logging.info("Cloning, installing mutable grafana on the cluster and loading the dashboards")
logging.info(
"Cloning, installing mutable grafana on the cluster and loading the dashboards"
)
try:
# delete repo to clone the latest copy if exists
subprocess.run(delete_repo, shell=True, universal_newlines=True, timeout=45)

205
krkn/prometheus/client.py Normal file
View File

@@ -0,0 +1,205 @@
from __future__ import annotations
import datetime
import os.path
from typing import Optional, List, Dict, Any
import urllib3
import logging
import sys
import yaml
from krkn_lib.elastic.krkn_elastic import KrknElastic
from krkn_lib.models.elastic.models import ElasticAlert
from krkn_lib.models.krkn import ChaosRunAlertSummary, ChaosRunAlert
from krkn_lib.prometheus.krkn_prometheus import KrknPrometheus
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def alerts(
prom_cli: KrknPrometheus,
elastic: KrknElastic,
run_uuid,
start_time,
end_time,
alert_profile,
elastic_collect_alerts,
elastic_alerts_index,
):
if alert_profile is None or os.path.exists(alert_profile) is False:
logging.error(f"{alert_profile} alert profile does not exist")
sys.exit(1)
with open(alert_profile) as profile:
profile_yaml = yaml.safe_load(profile)
if not isinstance(profile_yaml, list):
logging.error(
f"{alert_profile} wrong file format, alert profile must be "
f"a valid yaml file containing a list of items with at least 3 properties: "
f"expr, description, severity"
)
sys.exit(1)
for alert in profile_yaml:
if list(alert.keys()).sort() != ["expr", "description", "severity"].sort():
logging.error(f"wrong alert {alert}, skipping")
processed_alert = prom_cli.process_alert(
alert,
datetime.datetime.fromtimestamp(start_time),
datetime.datetime.fromtimestamp(end_time),
)
if (
processed_alert[0]
and processed_alert[1]
and elastic
and elastic_collect_alerts
):
elastic_alert = ElasticAlert(
run_uuid=run_uuid,
severity=alert["severity"],
alert=processed_alert[1],
created_at=datetime.datetime.fromtimestamp(processed_alert[0]),
)
result = elastic.push_alert(elastic_alert, elastic_alerts_index)
if result == -1:
logging.error("failed to save alert on ElasticSearch")
pass
def critical_alerts(
prom_cli: KrknPrometheus,
summary: ChaosRunAlertSummary,
run_id,
scenario,
start_time,
end_time,
):
summary.scenario = scenario
summary.run_id = run_id
query = r"""ALERTS{severity="critical"}"""
logging.info("Checking for critical alerts firing post chaos")
during_critical_alerts = prom_cli.process_prom_query_in_range(
query, start_time=datetime.datetime.fromtimestamp(start_time), end_time=end_time
)
for alert in during_critical_alerts:
if "metric" in alert:
alertname = (
alert["metric"]["alertname"]
if "alertname" in alert["metric"]
else "none"
)
alertstate = (
alert["metric"]["alertstate"]
if "alertstate" in alert["metric"]
else "none"
)
namespace = (
alert["metric"]["namespace"]
if "namespace" in alert["metric"]
else "none"
)
severity = (
alert["metric"]["severity"] if "severity" in alert["metric"] else "none"
)
alert = ChaosRunAlert(alertname, alertstate, namespace, severity)
summary.chaos_alerts.append(alert)
post_critical_alerts = prom_cli.process_query(query)
for alert in post_critical_alerts:
if "metric" in alert:
alertname = (
alert["metric"]["alertname"]
if "alertname" in alert["metric"]
else "none"
)
alertstate = (
alert["metric"]["alertstate"]
if "alertstate" in alert["metric"]
else "none"
)
namespace = (
alert["metric"]["namespace"]
if "namespace" in alert["metric"]
else "none"
)
severity = (
alert["metric"]["severity"] if "severity" in alert["metric"] else "none"
)
alert = ChaosRunAlert(alertname, alertstate, namespace, severity)
summary.post_chaos_alerts.append(alert)
during_critical_alerts_count = len(during_critical_alerts)
post_critical_alerts_count = len(post_critical_alerts)
firing_alerts = False
if during_critical_alerts_count > 0:
firing_alerts = True
if post_critical_alerts_count > 0:
firing_alerts = True
if not firing_alerts:
logging.info("No critical alerts are firing!!")
def metrics(
prom_cli: KrknPrometheus,
elastic: KrknElastic,
run_uuid,
start_time,
end_time,
metrics_profile,
elastic_collect_metrics,
elastic_metrics_index,
) -> list[dict[str, list[(int, float)] | str]]:
metrics_list: list[dict[str, list[(int, float)] | str]] = []
if metrics_profile is None or os.path.exists(metrics_profile) is False:
logging.error(f"{metrics_profile} alert profile does not exist")
sys.exit(1)
with open(metrics_profile) as profile:
profile_yaml = yaml.safe_load(profile)
if not profile_yaml["metrics"] or not isinstance(profile_yaml["metrics"], list):
logging.error(
f"{metrics_profile} wrong file format, alert profile must be "
f"a valid yaml file containing a list of items with 3 properties: "
f"expr, description, severity"
)
sys.exit(1)
for metric_query in profile_yaml["metrics"]:
if (
list(metric_query.keys()).sort()
!= ["query", "metricName", "instant"].sort()
):
logging.error(f"wrong alert {metric_query}, skipping")
metrics_result = prom_cli.process_prom_query_in_range(
metric_query["query"],
start_time=datetime.datetime.fromtimestamp(start_time),
end_time=datetime.datetime.fromtimestamp(end_time),
)
metric = {"name": metric_query["metricName"], "values": []}
for returned_metric in metrics_result:
if "values" in returned_metric:
for value in returned_metric["values"]:
try:
metric["values"].append((value[0], float(value[1])))
except ValueError:
pass
metrics_list.append(metric)
if elastic_collect_metrics and elastic:
result = elastic.upload_metrics_to_elasticsearch(
run_uuid=run_uuid, index=elastic_metrics_index, raw_data=metrics_list
)
if result == -1:
logging.error("failed to save metrics on ElasticSearch")
return metrics_list

View File

@@ -0,0 +1,115 @@
import logging
import time
from abc import ABC, abstractmethod
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn import utils
class AbstractScenarioPlugin(ABC):
@abstractmethod
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
"""
This method serves as the entry point for a ScenarioPlugin. To make the plugin loadable,
the AbstractScenarioPlugin class must be extended, and this method must be implemented.
No exception must be propagated outside of this method.
:param run_uuid: the uuid of the chaos run generated by krkn for every single run
:param scenario: the config file of the scenario that is currently executed
:param krkn_config: the full dictionary representation of the `config.yaml`
:param lib_telemetry: it is a composite object of all the
krkn-lib objects and methods needed by a krkn plugin to run.
:param scenario_telemetry: the `ScenarioTelemetry` object of the scenario that is currently executed
:return: 0 if the scenario suceeded 1 if failed
"""
pass
@abstractmethod
def get_scenario_types(self) -> list[str]:
"""
Indicates the scenario types specified in the `config.yaml`. For the plugin to be properly
loaded, recognized and executed, it must be implemented and must return the matching `scenario_type` strings.
One plugin can be mapped one or many different strings unique across the other plugins otherwise an exception
will be thrown.
:return: the corresponding scenario_type as a list of strings
"""
pass
def run_scenarios(
self,
run_uuid: str,
scenarios_list: list[str],
krkn_config: dict[str, any],
telemetry: KrknTelemetryOpenshift,
) -> tuple[list[str], list[ScenarioTelemetry]]:
scenario_telemetries: list[ScenarioTelemetry] = []
failed_scenarios = []
wait_duration = krkn_config["tunings"]["wait_duration"]
for scenario_config in scenarios_list:
if isinstance(scenario_config, list):
logging.error(
"post scenarios have been deprecated, please "
"remove sub-lists from `scenarios` in config.yaml"
)
failed_scenarios.append(scenario_config)
break
scenario_telemetry = ScenarioTelemetry()
scenario_telemetry.scenario = scenario_config
scenario_telemetry.start_timestamp = time.time()
parsed_scenario_config = telemetry.set_parameters_base64(
scenario_telemetry, scenario_config
)
try:
logging.info(
f"Running {self.__class__.__name__}: {self.get_scenario_types()} -> {scenario_config}"
)
return_value = self.run(
run_uuid,
scenario_config,
krkn_config,
telemetry,
scenario_telemetry,
)
except Exception as e:
logging.error(
f"uncaught exception on scenario `run()` method: {e} "
f"please report an issue on https://github.com/krkn-chaos/krkn"
)
return_value = 1
scenario_telemetry.exit_status = return_value
scenario_telemetry.end_timestamp = time.time()
utils.collect_and_put_ocp_logs(
telemetry,
parsed_scenario_config,
telemetry.get_telemetry_request_id(),
int(scenario_telemetry.start_timestamp),
int(scenario_telemetry.end_timestamp),
)
utils.populate_cluster_events(
scenario_telemetry,
parsed_scenario_config,
telemetry.get_lib_kubernetes(),
int(scenario_telemetry.start_timestamp),
int(scenario_telemetry.end_timestamp),
)
if scenario_telemetry.exit_status != 0:
failed_scenarios.append(scenario_config)
scenario_telemetries.append(scenario_telemetry)
logging.info(f"wating {wait_duration} before running the next scenario")
time.sleep(wait_duration)
return failed_scenarios, scenario_telemetries

View File

@@ -0,0 +1,88 @@
import logging
import time
import yaml
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn_lib.utils import get_yaml_item_value
from jinja2 import Template
from krkn import cerberus
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
class ApplicationOutageScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
wait_duration = krkn_config["tunings"]["wait_duration"]
try:
with open(scenario, "r") as f:
app_outage_config_yaml = yaml.full_load(f)
scenario_config = app_outage_config_yaml["application_outage"]
pod_selector = get_yaml_item_value(
scenario_config, "pod_selector", "{}"
)
traffic_type = get_yaml_item_value(
scenario_config, "block", "[Ingress, Egress]"
)
namespace = get_yaml_item_value(scenario_config, "namespace", "")
duration = get_yaml_item_value(scenario_config, "duration", 60)
start_time = int(time.time())
network_policy_template = """---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kraken-deny
spec:
podSelector:
matchLabels: {{ pod_selector }}
policyTypes: {{ traffic_type }}
"""
t = Template(network_policy_template)
rendered_spec = t.render(
pod_selector=pod_selector, traffic_type=traffic_type
)
yaml_spec = yaml.safe_load(rendered_spec)
# Block the traffic by creating network policy
logging.info("Creating the network policy")
lib_telemetry.get_lib_kubernetes().create_net_policy(
yaml_spec, namespace
)
# wait for the specified duration
logging.info(
"Waiting for the specified duration in the config: %s" % duration
)
time.sleep(duration)
# unblock the traffic by deleting the network policy
logging.info("Deleting the network policy")
lib_telemetry.get_lib_kubernetes().delete_net_policy(
"kraken-deny", namespace
)
logging.info(
"End of scenario. Waiting for the specified duration: %s"
% wait_duration
)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(krkn_config, [], start_time, end_time)
except Exception as e:
logging.error(
"ApplicationOutageScenarioPlugin exiting due to Exception %s" % e
)
return 1
else:
return 0
def get_scenario_types(self) -> list[str]:
return ["application_outages_scenarios"]

View File

@@ -0,0 +1,197 @@
import logging
import os
from pathlib import Path
import arcaflow
import yaml
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
from krkn.scenario_plugins.arcaflow.context_auth import ContextAuth
class ArcaflowScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
try:
engine_args = self.build_args(scenario)
status_code = self.run_workflow(
engine_args, lib_telemetry.get_lib_kubernetes().get_kubeconfig_path()
)
return status_code
except Exception as e:
logging.error("ArcaflowScenarioPlugin exiting due to Exception %s" % e)
return 1
def get_scenario_types(self) -> [str]:
return ["hog_scenarios", "arcaflow_scenario"]
def run_workflow(
self, engine_args: arcaflow.EngineArgs, kubeconfig_path: str
) -> int:
self.set_arca_kubeconfig(engine_args, kubeconfig_path)
exit_status = arcaflow.run(engine_args)
return exit_status
def build_args(self, input_file: str) -> arcaflow.EngineArgs:
"""sets the kubeconfig parsed by setArcaKubeConfig as an input to the arcaflow workflow"""
current_path = Path().resolve()
context = f"{current_path}/{Path(input_file).parent}"
workflow = f"{context}/workflow.yaml"
config = f"{context}/config.yaml"
if not os.path.exists(context):
raise Exception(
"context folder for arcaflow workflow not found: {}".format(context)
)
if not os.path.exists(input_file):
raise Exception(
"input file for arcaflow workflow not found: {}".format(input_file)
)
if not os.path.exists(workflow):
raise Exception(
"workflow file for arcaflow workflow not found: {}".format(workflow)
)
if not os.path.exists(config):
raise Exception(
"configuration file for arcaflow workflow not found: {}".format(config)
)
engine_args = arcaflow.EngineArgs()
engine_args.context = context
engine_args.config = config
engine_args.workflow = workflow
engine_args.input = f"{current_path}/{input_file}"
return engine_args
def set_arca_kubeconfig(
self, engine_args: arcaflow.EngineArgs, kubeconfig_path: str
):
context_auth = ContextAuth()
if not os.path.exists(kubeconfig_path):
raise Exception("kubeconfig not found in {}".format(kubeconfig_path))
with open(kubeconfig_path, "r") as stream:
try:
kubeconfig = yaml.safe_load(stream)
context_auth.fetch_auth_data(kubeconfig)
except Exception as e:
logging.error(
"impossible to read kubeconfig file in: {}".format(kubeconfig_path)
)
raise e
kubeconfig_str = self.set_kubeconfig_auth(kubeconfig, context_auth)
with open(engine_args.input, "r") as stream:
input_file = yaml.safe_load(stream)
if "input_list" in input_file and isinstance(
input_file["input_list"], list
):
for index, _ in enumerate(input_file["input_list"]):
if isinstance(input_file["input_list"][index], dict):
input_file["input_list"][index]["kubeconfig"] = kubeconfig_str
else:
input_file["kubeconfig"] = kubeconfig_str
stream.close()
with open(engine_args.input, "w") as stream:
yaml.safe_dump(input_file, stream)
with open(engine_args.config, "r") as stream:
config_file = yaml.safe_load(stream)
if config_file["deployers"]["image"]["deployer_name"] == "kubernetes":
kube_connection = self.set_kubernetes_deployer_auth(
config_file["deployers"]["image"]["connection"], context_auth
)
config_file["deployers"]["image"]["connection"] = kube_connection
with open(engine_args.config, "w") as stream:
yaml.safe_dump(config_file, stream, explicit_start=True, width=4096)
def set_kubernetes_deployer_auth(
self, deployer: any, context_auth: ContextAuth
) -> any:
if context_auth.clusterHost is not None:
deployer["host"] = context_auth.clusterHost
if context_auth.clientCertificateData is not None:
deployer["cert"] = context_auth.clientCertificateData
if context_auth.clientKeyData is not None:
deployer["key"] = context_auth.clientKeyData
if context_auth.clusterCertificateData is not None:
deployer["cacert"] = context_auth.clusterCertificateData
if context_auth.username is not None:
deployer["username"] = context_auth.username
if context_auth.password is not None:
deployer["password"] = context_auth.password
if context_auth.bearerToken is not None:
deployer["bearerToken"] = context_auth.bearerToken
return deployer
def set_kubeconfig_auth(self, kubeconfig: any, context_auth: ContextAuth) -> str:
"""
Builds an arcaflow-compatible kubeconfig representation and returns it as a string.
In order to run arcaflow plugins in kubernetes/openshift the kubeconfig must contain client certificate/key
and server certificate base64 encoded within the kubeconfig file itself in *-data fields. That is not always the
case, infact kubeconfig may contain filesystem paths to those files, this function builds an arcaflow-compatible
kubeconfig file and returns it as a string that can be safely included in input.yaml
"""
if "current-context" not in kubeconfig.keys():
raise Exception(
"invalid kubeconfig file, impossible to determine current-context"
)
user_id = None
cluster_id = None
user_name = None
cluster_name = None
current_context = kubeconfig["current-context"]
for context in kubeconfig["contexts"]:
if context["name"] == current_context:
user_name = context["context"]["user"]
cluster_name = context["context"]["cluster"]
if user_name is None:
raise Exception(
"user not set for context {} in kubeconfig file".format(current_context)
)
if cluster_name is None:
raise Exception(
"cluster not set for context {} in kubeconfig file".format(
current_context
)
)
for index, user in enumerate(kubeconfig["users"]):
if user["name"] == user_name:
user_id = index
for index, cluster in enumerate(kubeconfig["clusters"]):
if cluster["name"] == cluster_name:
cluster_id = index
if cluster_id is None:
raise Exception(
"no cluster {} found in kubeconfig users".format(cluster_name)
)
if "client-certificate" in kubeconfig["users"][user_id]["user"]:
kubeconfig["users"][user_id]["user"][
"client-certificate-data"
] = context_auth.clientCertificateDataBase64
del kubeconfig["users"][user_id]["user"]["client-certificate"]
if "client-key" in kubeconfig["users"][user_id]["user"]:
kubeconfig["users"][user_id]["user"][
"client-key-data"
] = context_auth.clientKeyDataBase64
del kubeconfig["users"][user_id]["user"]["client-key"]
if "certificate-authority" in kubeconfig["clusters"][cluster_id]["cluster"]:
kubeconfig["clusters"][cluster_id]["cluster"][
"certificate-authority-data"
] = context_auth.clusterCertificateDataBase64
del kubeconfig["clusters"][cluster_id]["cluster"]["certificate-authority"]
kubeconfig_str = yaml.dump(kubeconfig)
return kubeconfig_str

View File

@@ -1,4 +1,3 @@
import yaml
import os
import base64
@@ -20,23 +19,25 @@ class ContextAuth:
@property
def clusterCertificateDataBase64(self):
if self.clusterCertificateData is not None:
return base64.b64encode(bytes(self.clusterCertificateData,'utf8')).decode("ascii")
return base64.b64encode(bytes(self.clusterCertificateData, "utf8")).decode(
"ascii"
)
return
@property
def clientCertificateDataBase64(self):
if self.clientCertificateData is not None:
return base64.b64encode(bytes(self.clientCertificateData,'utf8')).decode("ascii")
return base64.b64encode(bytes(self.clientCertificateData, "utf8")).decode(
"ascii"
)
return
@property
def clientKeyDataBase64(self):
if self.clientKeyData is not None:
return base64.b64encode(bytes(self.clientKeyData,"utf-8")).decode("ascii")
return base64.b64encode(bytes(self.clientKeyData, "utf-8")).decode("ascii")
return
def fetch_auth_data(self, kubeconfig: any):
context_username = None
current_context = kubeconfig["current-context"]
@@ -56,8 +57,10 @@ class ContextAuth:
for index, user in enumerate(kubeconfig["users"]):
if user["name"] == context_username:
user_id = index
if user_id is None :
raise Exception("user {0} not found in kubeconfig users".format(context_username))
if user_id is None:
raise Exception(
"user {0} not found in kubeconfig users".format(context_username)
)
for index, cluster in enumerate(kubeconfig["clusters"]):
if cluster["name"] == self.clusterName:
@@ -83,7 +86,9 @@ class ContextAuth:
if "client-key-data" in user:
try:
self.clientKeyData = base64.b64decode(user["client-key-data"]).decode('utf-8')
self.clientKeyData = base64.b64decode(user["client-key-data"]).decode(
"utf-8"
)
except Exception as e:
raise Exception("impossible to decode client-key-data")
@@ -96,7 +101,9 @@ class ContextAuth:
if "client-certificate-data" in user:
try:
self.clientCertificateData = base64.b64decode(user["client-certificate-data"]).decode('utf-8')
self.clientCertificateData = base64.b64decode(
user["client-certificate-data"]
).decode("utf-8")
except Exception as e:
raise Exception("impossible to decode client-certificate-data")
@@ -105,13 +112,17 @@ class ContextAuth:
if "certificate-authority" in cluster:
try:
self.clusterCertificate = cluster["certificate-authority"]
self.clusterCertificateData = self.read_file(cluster["certificate-authority"])
self.clusterCertificateData = self.read_file(
cluster["certificate-authority"]
)
except Exception as e:
raise e
if "certificate-authority-data" in cluster:
try:
self.clusterCertificateData = base64.b64decode(cluster["certificate-authority-data"]).decode('utf-8')
self.clusterCertificateData = base64.b64decode(
cluster["certificate-authority-data"]
).decode("utf-8")
except Exception as e:
raise Exception("impossible to decode certificate-authority-data")
@@ -124,19 +135,8 @@ class ContextAuth:
if "token" in user:
self.bearerToken = user["token"]
def read_file(self, filename:str) -> str:
def read_file(self, filename: str) -> str:
if not os.path.exists(filename):
raise Exception("file not found {0} ".format(filename))
with open(filename, "rb") as file_stream:
return file_stream.read().decode('utf-8')
return file_stream.read().decode("utf-8")

View File

@@ -1,7 +1,9 @@
import os
import unittest
from context_auth import ContextAuth
import yaml
from .context_auth import ContextAuth
class TestCurrentContext(unittest.TestCase):
@@ -9,7 +11,7 @@ class TestCurrentContext(unittest.TestCase):
def get_kubeconfig_with_data(self) -> str:
"""
This function returns a test kubeconfig file as a string.
:return: a test kubeconfig file in string format (for unit testing purposes)
""" # NOQA
return """apiVersion: v1
@@ -71,7 +73,8 @@ users:
def test_current_context(self):
cwd = os.getcwd()
current_context_data = ContextAuth()
current_context_data.fetch_auth_data(self.get_kubeconfig_with_data())
data = yaml.safe_load(self.get_kubeconfig_with_data())
current_context_data.fetch_auth_data(data)
self.assertIsNotNone(current_context_data.clusterCertificateData)
self.assertIsNotNone(current_context_data.clientCertificateData)
self.assertIsNotNone(current_context_data.clientKeyData)
@@ -81,7 +84,8 @@ users:
self.assertIsNotNone(current_context_data.clusterHost)
current_context_no_data = ContextAuth()
current_context_no_data.fetch_auth_data(self.get_kubeconfig_with_paths())
data = yaml.safe_load(self.get_kubeconfig_with_paths())
current_context_no_data.fetch_auth_data(data)
self.assertIsNotNone(current_context_no_data.clusterCertificate)
self.assertIsNotNone(current_context_no_data.clusterCertificateData)
self.assertIsNotNone(current_context_no_data.clientCertificate)
@@ -92,9 +96,3 @@ users:
self.assertIsNotNone(current_context_no_data.password)
self.assertIsNotNone(current_context_no_data.bearerToken)
self.assertIsNotNone(current_context_data.clusterHost)

View File

@@ -0,0 +1,232 @@
import logging
import random
import time
import yaml
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn_lib.utils import get_yaml_item_value
from krkn import cerberus
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
class ContainerScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
start_time = int(time.time())
pool = PodsMonitorPool(lib_telemetry.get_lib_kubernetes())
wait_duration = krkn_config["tunings"]["wait_duration"]
try:
with open(scenario, "r") as f:
cont_scenario_config = yaml.full_load(f)
self.start_monitoring(
kill_scenarios=cont_scenario_config["scenarios"], pool=pool
)
killed_containers = self.container_killing_in_pod(
cont_scenario_config, lib_telemetry.get_lib_kubernetes()
)
logging.info(f"killed containers: {str(killed_containers)}")
result = pool.join()
if result.error:
logging.error(
logging.error(
f"ContainerScenarioPlugin pods failed to recovery: {result.error}"
)
)
return 1
scenario_telemetry.affected_pods = result
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
# capture end time
end_time = int(time.time())
# publish cerberus status
cerberus.publish_kraken_status(krkn_config, [], start_time, end_time)
except (RuntimeError, Exception):
logging.error("ContainerScenarioPlugin exiting due to Exception %s" % e)
return 1
else:
return 0
def get_scenario_types(self) -> list[str]:
return ["container_scenarios"]
def start_monitoring(self, kill_scenarios: list[any], pool: PodsMonitorPool):
for kill_scenario in kill_scenarios:
namespace_pattern = f"^{kill_scenario['namespace']}$"
label_selector = kill_scenario["label_selector"]
recovery_time = kill_scenario["expected_recovery_time"]
pool.select_and_monitor_by_namespace_pattern_and_label(
namespace_pattern=namespace_pattern,
label_selector=label_selector,
max_timeout=recovery_time,
)
def container_killing_in_pod(self, cont_scenario, kubecli: KrknKubernetes):
scenario_name = get_yaml_item_value(cont_scenario, "name", "")
namespace = get_yaml_item_value(cont_scenario, "namespace", "*")
label_selector = get_yaml_item_value(cont_scenario, "label_selector", None)
pod_names = get_yaml_item_value(cont_scenario, "pod_names", [])
container_name = get_yaml_item_value(cont_scenario, "container_name", "")
kill_action = get_yaml_item_value(cont_scenario, "action", 1)
kill_count = get_yaml_item_value(cont_scenario, "count", 1)
if not isinstance(kill_action, int):
logging.error(
"Please make sure the action parameter defined in the "
"config is an integer"
)
raise RuntimeError()
if (kill_action < 1) or (kill_action > 15):
logging.error("Only 1-15 kill signals are supported.")
raise RuntimeError()
kill_action = "kill " + str(kill_action)
if type(pod_names) != list:
logging.error("Please make sure your pod_names are in a list format")
# removed_exit
# sys.exit(1)
raise RuntimeError()
if len(pod_names) == 0:
if namespace == "*":
# returns double array of pod name and namespace
pods = kubecli.get_all_pods(label_selector)
else:
# Only returns pod names
pods = kubecli.list_pods(namespace, label_selector)
else:
if namespace == "*":
logging.error(
"You must specify the namespace to kill a container in a specific pod"
)
logging.error("Scenario " + scenario_name + " failed")
# removed_exit
# sys.exit(1)
raise RuntimeError()
pods = pod_names
# get container and pod name
container_pod_list = []
for pod in pods:
if type(pod) == list:
pod_output = kubecli.get_pod_info(pod[0], pod[1])
container_names = [
container.name for container in pod_output.containers
]
container_pod_list.append([pod[0], pod[1], container_names])
else:
pod_output = kubecli.get_pod_info(pod, namespace)
container_names = [
container.name for container in pod_output.containers
]
container_pod_list.append([pod, namespace, container_names])
killed_count = 0
killed_container_list = []
while killed_count < kill_count:
if len(container_pod_list) == 0:
logging.error(
"Trying to kill more containers than were found, try lowering kill count"
)
logging.error("Scenario " + scenario_name + " failed")
# removed_exit
# sys.exit(1)
raise RuntimeError()
selected_container_pod = container_pod_list[
random.randint(0, len(container_pod_list) - 1)
]
for c_name in selected_container_pod[2]:
if container_name != "":
if c_name == container_name:
killed_container_list.append(
[
selected_container_pod[0],
selected_container_pod[1],
c_name,
]
)
self.retry_container_killing(
kill_action,
selected_container_pod[0],
selected_container_pod[1],
c_name,
kubecli,
)
break
else:
killed_container_list.append(
[selected_container_pod[0], selected_container_pod[1], c_name]
)
self.retry_container_killing(
kill_action,
selected_container_pod[0],
selected_container_pod[1],
c_name,
kubecli,
)
break
container_pod_list.remove(selected_container_pod)
killed_count += 1
logging.info("Scenario " + scenario_name + " successfully injected")
return killed_container_list
def retry_container_killing(
self, kill_action, podname, namespace, container_name, kubecli: KrknKubernetes
):
i = 0
while i < 5:
logging.info(
"Killing container %s in pod %s (ns %s)"
% (str(container_name), str(podname), str(namespace))
)
response = kubecli.exec_cmd_in_pod(
kill_action, podname, namespace, container_name
)
i += 1
# Blank response means it is done
if not response:
break
elif (
"unauthorized" in response.lower()
or "authorization" in response.lower()
):
time.sleep(2)
continue
else:
logging.warning(response)
continue
def check_failed_containers(
self, killed_container_list, wait_time, kubecli: KrknKubernetes
):
container_ready = []
timer = 0
while timer <= wait_time:
for killed_container in killed_container_list:
# pod namespace contain name
pod_output = kubecli.get_pod_info(
killed_container[0], killed_container[1]
)
for container in pod_output.containers:
if container.name == killed_container[2]:
if container.ready:
container_ready.append(killed_container)
if len(container_ready) != 0:
for item in container_ready:
killed_container_list = killed_container_list.remove(item)
if killed_container_list is None or len(killed_container_list) == 0:
return []
timer += 5
logging.info("Waiting 5 seconds for containers to become ready")
time.sleep(5)
return killed_container_list

View File

@@ -2,28 +2,37 @@ import random
import logging
from krkn_lib.k8s import KrknKubernetes
# krkn_lib
# Pick a random managedcluster with specified label selector
def get_managedcluster(
managedcluster_name,
label_selector,
instance_kill_count,
kubecli: KrknKubernetes):
managedcluster_name, label_selector, instance_kill_count, kubecli: KrknKubernetes
):
if managedcluster_name in kubecli.list_killable_managedclusters():
return [managedcluster_name]
elif managedcluster_name:
logging.info("managedcluster with provided managedcluster_name does not exist or the managedcluster might " "be in unavailable state.")
logging.info(
"managedcluster with provided managedcluster_name does not exist or the managedcluster might "
"be in unavailable state."
)
managedclusters = kubecli.list_killable_managedclusters(label_selector)
if not managedclusters:
raise Exception("Available managedclusters with the provided label selector do not exist")
logging.info("Available managedclusters with the label selector %s: %s" % (label_selector, managedclusters))
raise Exception(
"Available managedclusters with the provided label selector do not exist"
)
logging.info(
"Available managedclusters with the label selector %s: %s"
% (label_selector, managedclusters)
)
number_of_managedclusters = len(managedclusters)
if instance_kill_count == number_of_managedclusters:
return managedclusters
managedclusters_to_return = []
for i in range(instance_kill_count):
managedcluster_to_add = managedclusters[random.randint(0, len(managedclusters) - 1)]
managedcluster_to_add = managedclusters[
random.randint(0, len(managedclusters) - 1)
]
managedclusters_to_return.append(managedcluster_to_add)
managedclusters.remove(managedcluster_to_add)
return managedclusters_to_return

View File

@@ -0,0 +1,127 @@
import logging
import time
import yaml
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn_lib.utils import get_yaml_item_value
from krkn import cerberus, utils
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
from krkn.scenario_plugins.managed_cluster.common_functions import get_managedcluster
from krkn.scenario_plugins.managed_cluster.scenarios import Scenarios
class ManagedClusterScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
with open(scenario, "r") as f:
scenario = yaml.full_load(f)
for managedcluster_scenario in scenario["managedcluster_scenarios"]:
managedcluster_scenario_object = Scenarios(
lib_telemetry.get_lib_kubernetes()
)
if managedcluster_scenario["actions"]:
for action in managedcluster_scenario["actions"]:
start_time = int(time.time())
try:
self.inject_managedcluster_scenario(
action,
managedcluster_scenario,
managedcluster_scenario_object,
lib_telemetry.get_lib_kubernetes(),
)
end_time = int(time.time())
cerberus.get_status(krkn_config, start_time, end_time)
except Exception as e:
logging.error(
"ManagedClusterScenarioPlugin exiting due to Exception %s"
% e
)
return 1
else:
return 0
def inject_managedcluster_scenario(
self,
action,
managedcluster_scenario,
managedcluster_scenario_object,
kubecli: KrknKubernetes,
):
# Get the managedcluster scenario configurations
run_kill_count = get_yaml_item_value(managedcluster_scenario, "runs", 1)
instance_kill_count = get_yaml_item_value(
managedcluster_scenario, "instance_count", 1
)
managedcluster_name = get_yaml_item_value(
managedcluster_scenario, "managedcluster_name", ""
)
label_selector = get_yaml_item_value(
managedcluster_scenario, "label_selector", ""
)
timeout = get_yaml_item_value(managedcluster_scenario, "timeout", 120)
# Get the managedcluster to apply the scenario
if managedcluster_name:
managedcluster_name_list = managedcluster_name.split(",")
else:
managedcluster_name_list = [managedcluster_name]
for single_managedcluster_name in managedcluster_name_list:
managedclusters = get_managedcluster(
single_managedcluster_name, label_selector, instance_kill_count, kubecli
)
for single_managedcluster in managedclusters:
if action == "managedcluster_start_scenario":
managedcluster_scenario_object.managedcluster_start_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "managedcluster_stop_scenario":
managedcluster_scenario_object.managedcluster_stop_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "managedcluster_stop_start_scenario":
managedcluster_scenario_object.managedcluster_stop_start_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "managedcluster_termination_scenario":
managedcluster_scenario_object.managedcluster_termination_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "managedcluster_reboot_scenario":
managedcluster_scenario_object.managedcluster_reboot_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "stop_start_klusterlet_scenario":
managedcluster_scenario_object.stop_start_klusterlet_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "start_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "stop_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(
run_kill_count, single_managedcluster, timeout
)
elif action == "managedcluster_crash_scenario":
managedcluster_scenario_object.managedcluster_crash_scenario(
run_kill_count, single_managedcluster, timeout
)
else:
logging.info(
"There is no managedcluster action that matches %s, skipping scenario"
% action
)
def get_managedcluster_scenario_object(self, kubecli: KrknKubernetes):
return Scenarios(kubecli)
def get_scenario_types(self) -> list[str]:
return ["managedcluster_scenarios"]

View File

@@ -2,104 +2,148 @@ from jinja2 import Environment, FileSystemLoader
import os
import time
import logging
import sys
import yaml
import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
import krkn.scenario_plugins.managed_cluster.common_functions as common_managedcluster_functions
from krkn_lib.k8s import KrknKubernetes
class GENERAL:
def __init__(self):
pass
# krkn_lib
class managedcluster_scenarios():
class Scenarios:
kubecli: KrknKubernetes
def __init__(self, kubecli: KrknKubernetes):
self.kubecli = kubecli
self.general = GENERAL()
# managedcluster scenario to start the managedcluster
def managedcluster_start_scenario(self, instance_kill_count, managedcluster, timeout):
def managedcluster_start_scenario(
self, instance_kill_count, managedcluster, timeout
):
for _ in range(instance_kill_count):
try:
logging.info("Starting managedcluster_start_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__))
)
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
template.render(
managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3 &
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 1 -n open-cluster-management-agent""")
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 1 -n open-cluster-management-agent""",
)
)
self.kubecli.create_manifestwork(body, managedcluster)
logging.info("managedcluster_start_scenario has been successfully injected!")
logging.info(
"managedcluster_start_scenario has been successfully injected!"
)
logging.info("Waiting for the specified timeout: %s" % timeout)
common_managedcluster_functions.wait_for_available_status(managedcluster, timeout, self.kubecli)
common_managedcluster_functions.wait_for_available_status(
managedcluster, timeout, self.kubecli
)
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
raise e
finally:
logging.info("Deleting manifestworks")
self.kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop the managedcluster
def managedcluster_stop_scenario(self, instance_kill_count, managedcluster, timeout):
def managedcluster_stop_scenario(
self, instance_kill_count, managedcluster, timeout
):
for _ in range(instance_kill_count):
try:
logging.info("Starting managedcluster_stop_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)),encoding='utf-8')
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__)), encoding="utf-8"
)
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
template.render(
managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0 &&
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 0 -n open-cluster-management-agent""")
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 0 -n open-cluster-management-agent""",
)
)
self.kubecli.create_manifestwork(body, managedcluster)
logging.info("managedcluster_stop_scenario has been successfully injected!")
logging.info(
"managedcluster_stop_scenario has been successfully injected!"
)
logging.info("Waiting for the specified timeout: %s" % timeout)
common_managedcluster_functions.wait_for_unavailable_status(managedcluster, timeout, self.kubecli)
common_managedcluster_functions.wait_for_unavailable_status(
managedcluster, timeout, self.kubecli
)
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
raise e
finally:
logging.info("Deleting manifestworks")
self.kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop and then start the managedcluster
def managedcluster_stop_start_scenario(self, instance_kill_count, managedcluster, timeout):
def managedcluster_stop_start_scenario(
self, instance_kill_count, managedcluster, timeout
):
logging.info("Starting managedcluster_stop_start_scenario injection")
self.managedcluster_stop_scenario(instance_kill_count, managedcluster, timeout)
time.sleep(10)
self.managedcluster_start_scenario(instance_kill_count, managedcluster, timeout)
logging.info("managedcluster_stop_start_scenario has been successfully injected!")
logging.info(
"managedcluster_stop_start_scenario has been successfully injected!"
)
# managedcluster scenario to terminate the managedcluster
def managedcluster_termination_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster termination is not implemented, " "no action is going to be taken")
def managedcluster_termination_scenario(
self, instance_kill_count, managedcluster, timeout
):
logging.info(
"managedcluster termination is not implemented, "
"no action is going to be taken"
)
# managedcluster scenario to reboot the managedcluster
def managedcluster_reboot_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster reboot is not implemented," " no action is going to be taken")
def managedcluster_reboot_scenario(
self, instance_kill_count, managedcluster, timeout
):
logging.info(
"managedcluster reboot is not implemented,"
" no action is going to be taken"
)
# managedcluster scenario to start the klusterlet
def start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting start_klusterlet_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__))
)
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3""")
template.render(
managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3""",
)
)
self.kubecli.create_manifestwork(body, managedcluster)
logging.info("start_klusterlet_scenario has been successfully injected!")
time.sleep(30) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
logging.info(
"start_klusterlet_scenario has been successfully injected!"
)
time.sleep(
30
) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
raise e
finally:
logging.info("Deleting manifestworks")
self.kubecli.delete_manifestwork(managedcluster)
@@ -109,25 +153,33 @@ class managedcluster_scenarios():
for _ in range(instance_kill_count):
try:
logging.info("Starting stop_klusterlet_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__))
)
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0""")
template.render(
managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0""",
)
)
self.kubecli.create_manifestwork(body, managedcluster)
logging.info("stop_klusterlet_scenario has been successfully injected!")
time.sleep(30) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
time.sleep(
30
) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
raise e
finally:
logging.info("Deleting manifestworks")
self.kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop and start the klusterlet
def stop_start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
def stop_start_klusterlet_scenario(
self, instance_kill_count, managedcluster, timeout
):
logging.info("Starting stop_start_klusterlet_scenario injection")
self.stop_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
time.sleep(10)
@@ -135,6 +187,10 @@ class managedcluster_scenarios():
logging.info("stop_start_klusterlet_scenario has been successfully injected!")
# managedcluster scenario to crash the managedcluster
def managedcluster_crash_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster crash scenario is not implemented, " "no action is going to be taken")
def managedcluster_crash_scenario(
self, instance_kill_count, managedcluster, timeout
):
logging.info(
"managedcluster crash scenario is not implemented, "
"no action is going to be taken"
)

View File

@@ -0,0 +1,93 @@
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
from krkn.scenario_plugins.native.plugins import PLUGINS
from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from typing import Any
import logging
class NativeScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
pool = PodsMonitorPool(lib_telemetry.get_lib_kubernetes())
kill_scenarios = [
kill_scenario
for kill_scenario in PLUGINS.unserialize_scenario(scenario)
if kill_scenario["id"] == "kill-pods"
]
try:
self.start_monitoring(pool, kill_scenarios)
PLUGINS.run(
scenario,
lib_telemetry.get_lib_kubernetes().get_kubeconfig_path(),
krkn_config,
run_uuid,
)
result = pool.join()
scenario_telemetry.affected_pods = result
if result.error:
logging.error(f"NativeScenarioPlugin unrecovered pods: {result.error}")
return 1
except Exception as e:
logging.error("NativeScenarioPlugin exiting due to Exception %s" % e)
pool.cancel()
return 1
else:
return 0
def get_scenario_types(self) -> list[str]:
return [
"pod_disruption_scenarios",
"pod_network_scenario",
"vmware_node_scenarios",
"ibmcloud_node_scenarios",
]
def start_monitoring(self, pool: PodsMonitorPool, scenarios: list[Any]):
for kill_scenario in scenarios:
recovery_time = kill_scenario["config"]["krkn_pod_recovery_time"]
if (
"namespace_pattern" in kill_scenario["config"]
and "label_selector" in kill_scenario["config"]
):
namespace_pattern = kill_scenario["config"]["namespace_pattern"]
label_selector = kill_scenario["config"]["label_selector"]
pool.select_and_monitor_by_namespace_pattern_and_label(
namespace_pattern=namespace_pattern,
label_selector=label_selector,
max_timeout=recovery_time,
)
logging.info(
f"waiting {recovery_time} seconds for pod recovery, "
f"pod label selector: {label_selector} namespace pattern: {namespace_pattern}"
)
elif (
"namespace_pattern" in kill_scenario["config"]
and "name_pattern" in kill_scenario["config"]
):
namespace_pattern = kill_scenario["config"]["namespace_pattern"]
name_pattern = kill_scenario["config"]["name_pattern"]
pool.select_and_monitor_by_name_pattern_and_namespace_pattern(
pod_name_pattern=name_pattern,
namespace_pattern=namespace_pattern,
max_timeout=recovery_time,
)
logging.info(
f"waiting {recovery_time} seconds for pod recovery, "
f"pod name pattern: {name_pattern} namespace pattern: {namespace_pattern}"
)
else:
raise Exception(
f"impossible to determine monitor parameters, check {kill_scenario} configuration"
)

View File

@@ -1,19 +1,17 @@
#!/usr/bin/env python
import sys
import time
import typing
from os import environ
from dataclasses import dataclass, field
import random
from traceback import format_exc
import logging
from kraken.plugins.node_scenarios import kubernetes_functions as kube_helper
from krkn.scenario_plugins.native.node_scenarios import (
kubernetes_functions as kube_helper,
)
from arcaflow_plugin_sdk import validation, plugin
from kubernetes import client, watch
from ibm_vpc import VpcV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_cloud_sdk_core import ApiException
import requests
import sys
@@ -26,19 +24,15 @@ class IbmCloud:
apiKey = environ.get("IBMC_APIKEY")
service_url = environ.get("IBMC_URL")
if not apiKey:
raise Exception(
"Environmental variable 'IBMC_APIKEY' is not set"
)
raise Exception("Environmental variable 'IBMC_APIKEY' is not set")
if not service_url:
raise Exception(
"Environmental variable 'IBMC_URL' is not set"
)
try:
raise Exception("Environmental variable 'IBMC_URL' is not set")
try:
authenticator = IAMAuthenticator(apiKey)
self.service = VpcV1(authenticator=authenticator)
self.service.set_service_url(service_url)
except Exception as e:
except Exception as e:
logging.error("error authenticating" + str(e))
sys.exit(1)
@@ -46,15 +40,11 @@ class IbmCloud:
"""
Deletes the Instance whose name is given by 'instance_id'
"""
try:
try:
self.service.delete_instance(instance_id)
logging.info("Deleted Instance -- '{}'".format(instance_id))
except Exception as e:
logging.info(
"Instance '{}' could not be deleted. ".format(
instance_id
)
)
logging.info("Instance '{}' could not be deleted. ".format(instance_id))
return False
def reboot_instances(self, instance_id):
@@ -65,17 +55,13 @@ class IbmCloud:
try:
self.service.create_instance_action(
instance_id,
type='reboot',
)
instance_id,
type="reboot",
)
logging.info("Reset Instance -- '{}'".format(instance_id))
return True
except Exception as e:
logging.info(
"Instance '{}' could not be rebooted".format(
instance_id
)
)
logging.info("Instance '{}' could not be rebooted".format(instance_id))
return False
def stop_instances(self, instance_id):
@@ -86,15 +72,13 @@ class IbmCloud:
try:
self.service.create_instance_action(
instance_id,
type='stop',
)
instance_id,
type="stop",
)
logging.info("Stopped Instance -- '{}'".format(instance_id))
return True
except Exception as e:
logging.info(
"Instance '{}' could not be stopped".format(instance_id)
)
logging.info("Instance '{}' could not be stopped".format(instance_id))
logging.info("error" + str(e))
return False
@@ -106,9 +90,9 @@ class IbmCloud:
try:
self.service.create_instance_action(
instance_id,
type='start',
)
instance_id,
type="start",
)
logging.info("Started Instance -- '{}'".format(instance_id))
return True
except Exception as e:
@@ -120,27 +104,29 @@ class IbmCloud:
Returns a list of Instances present in the datacenter
"""
instance_names = []
try:
try:
instances_result = self.service.list_instances().get_result()
instances_list = instances_result['instances']
instances_list = instances_result["instances"]
for vpc in instances_list:
instance_names.append({"vpc_name": vpc['name'], "vpc_id": vpc['id']})
starting_count = instances_result['total_count']
while instances_result['total_count'] == instances_result['limit']:
instances_result = self.service.list_instances(start=starting_count).get_result()
instances_list = instances_result['instances']
starting_count += instances_result['total_count']
instance_names.append({"vpc_name": vpc["name"], "vpc_id": vpc["id"]})
starting_count = instances_result["total_count"]
while instances_result["total_count"] == instances_result["limit"]:
instances_result = self.service.list_instances(
start=starting_count
).get_result()
instances_list = instances_result["instances"]
starting_count += instances_result["total_count"]
for vpc in instances_list:
instance_names.append({"vpc_name": vpc.name, "vpc_id": vpc.id})
except Exception as e:
except Exception as e:
logging.error("Error listing out instances: " + str(e))
sys.exit(1)
return instance_names
def find_id_in_list(self, name, vpc_list):
def find_id_in_list(self, name, vpc_list):
for vpc in vpc_list:
if vpc['vpc_name'] == name:
return vpc['vpc_id']
if vpc["vpc_name"] == name:
return vpc["vpc_id"]
def get_instance_status(self, instance_id):
"""
@@ -149,7 +135,7 @@ class IbmCloud:
try:
instance = self.service.get_instance(instance_id).get_result()
state = instance['status']
state = instance["status"]
return state
except Exception as e:
logging.error(
@@ -169,7 +155,8 @@ class IbmCloud:
while vpc is not None:
vpc = self.get_instance_status(instance_id)
logging.info(
"Instance %s is still being deleted, sleeping for 5 seconds" % instance_id
"Instance %s is still being deleted, sleeping for 5 seconds"
% instance_id
)
time.sleep(5)
time_counter += 5
@@ -196,7 +183,9 @@ class IbmCloud:
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Instance %s is still not ready in allotted time" % instance_id)
logging.info(
"Instance %s is still not ready in allotted time" % instance_id
)
return False
return True
@@ -216,7 +205,9 @@ class IbmCloud:
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Instance %s is still not stopped in allotted time" % instance_id)
logging.info(
"Instance %s is still not stopped in allotted time" % instance_id
)
return False
return True
@@ -236,7 +227,9 @@ class IbmCloud:
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Instance %s is still restarting after allotted time" % instance_id)
logging.info(
"Instance %s is still restarting after allotted time" % instance_id
)
return False
self.wait_until_running(instance_id, timeout)
return True
@@ -303,9 +296,7 @@ class NodeScenarioConfig:
)
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name")
typing.Optional[str], validation.min(1), validation.required_if_not("name")
] = field(
default=None,
metadata={
@@ -374,7 +365,7 @@ def node_start(
logging.info("Starting node_start_scenario injection")
logging.info("Starting the node %s " % (name))
instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
if instance_id:
if instance_id:
vm_started = ibmcloud.start_instances(instance_id)
if vm_started:
ibmcloud.wait_until_running(instance_id, cfg.timeout)
@@ -383,12 +374,19 @@ def node_start(
name, cfg.timeout, watch_resource, core_v1
)
nodes_started[int(time.time_ns())] = Node(name=name)
logging.info("Node with instance ID: %s is in running state" % name)
logging.info("node_start_scenario has been successfully injected!")
else:
logging.error("Failed to find node that matched instances on ibm cloud in region")
logging.info(
"Node with instance ID: %s is in running state" % name
)
logging.info(
"node_start_scenario has been successfully injected!"
)
else:
logging.error(
"Failed to find node that matched instances on ibm cloud in region"
)
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.START
"No matching vpc with node name " + name,
kube_helper.Actions.START,
)
except Exception as e:
logging.error("Failed to start node instance. Test Failed")
@@ -417,11 +415,11 @@ def node_stop(
ibmcloud = IbmCloud()
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
logging.info('set up done')
logging.info("set up done")
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.STOP, core_v1)
logging.info("set node list" + str(node_list))
node_name_id_list = ibmcloud.list_instances()
logging.info('node names' + str(node_name_id_list))
logging.info("node names" + str(node_name_id_list))
nodes_stopped = {}
for name in node_list:
try:
@@ -438,12 +436,19 @@ def node_stop(
name, cfg.timeout, watch_resource, core_v1
)
nodes_stopped[int(time.time_ns())] = Node(name=name)
logging.info("Node with instance ID: %s is in stopped state" % name)
logging.info("node_stop_scenario has been successfully injected!")
else:
logging.error("Failed to find node that matched instances on ibm cloud in region")
logging.info(
"Node with instance ID: %s is in stopped state" % name
)
logging.info(
"node_stop_scenario has been successfully injected!"
)
else:
logging.error(
"Failed to find node that matched instances on ibm cloud in region"
)
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.STOP
"No matching vpc with node name " + name,
kube_helper.Actions.STOP,
)
except Exception as e:
logging.error("Failed to stop node instance. Test Failed")
@@ -495,11 +500,16 @@ def node_reboot(
logging.info(
"Node with instance ID: %s has rebooted successfully" % name
)
logging.info("node_reboot_scenario has been successfully injected!")
else:
logging.error("Failed to find node that matched instances on ibm cloud in region")
logging.info(
"node_reboot_scenario has been successfully injected!"
)
else:
logging.error(
"Failed to find node that matched instances on ibm cloud in region"
)
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.REBOOT
"No matching vpc with node name " + name,
kube_helper.Actions.REBOOT,
)
except Exception as e:
logging.error("Failed to reboot node instance. Test Failed")
@@ -540,16 +550,23 @@ def node_terminate(
)
instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
logging.info("Deleting the node with instance ID: %s " % (name))
if instance_id:
if instance_id:
ibmcloud.delete_instance(instance_id)
ibmcloud.wait_until_released(name, cfg.timeout)
nodes_terminated[int(time.time_ns())] = Node(name=name)
logging.info("Node with instance ID: %s has been released" % name)
logging.info("node_terminate_scenario has been successfully injected!")
else:
logging.error("Failed to find instances that matched the node specifications on ibm cloud in the set region")
logging.info(
"Node with instance ID: %s has been released" % name
)
logging.info(
"node_terminate_scenario has been successfully injected!"
)
else:
logging.error(
"Failed to find instances that matched the node specifications on ibm cloud in the set region"
)
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.TERMINATE
"No matching vpc with node name " + name,
kube_helper.Actions.TERMINATE,
)
except Exception as e:
logging.error("Failed to terminate node instance. Test Failed")

View File

@@ -9,14 +9,18 @@ from os import environ
from traceback import format_exc
import requests
from arcaflow_plugin_sdk import plugin, validation
from com.vmware.vapi.std.errors_client import (AlreadyInDesiredState,
NotAllowedInCurrentState)
from com.vmware.vapi.std.errors_client import (
AlreadyInDesiredState,
NotAllowedInCurrentState,
)
from com.vmware.vcenter.vm_client import Power
from com.vmware.vcenter_client import VM, ResourcePool
from kubernetes import client, watch
from vmware.vapi.vsphere.client import create_vsphere_client
from kraken.plugins.node_scenarios import kubernetes_functions as kube_helper
from krkn.scenario_plugins.native.node_scenarios import (
kubernetes_functions as kube_helper,
)
class vSphere:
@@ -104,9 +108,7 @@ class vSphere:
return True
except NotAllowedInCurrentState:
logging.info(
"VM '{}'-'({})' is not Powered On. Cannot reset it",
instance_id,
vm
"VM '{}'-'({})' is not Powered On. Cannot reset it", instance_id, vm
)
return False
@@ -122,9 +124,7 @@ class vSphere:
logging.info(f"Stopped VM -- '{instance_id}-({vm})'")
return True
except AlreadyInDesiredState:
logging.info(
f"VM '{instance_id}'-'({vm})' is already Powered Off"
)
logging.info(f"VM '{instance_id}'-'({vm})' is already Powered Off")
return False
def start_instances(self, instance_id):
@@ -139,9 +139,7 @@ class vSphere:
logging.info(f"Started VM -- '{instance_id}-({vm})'")
return True
except AlreadyInDesiredState:
logging.info(
f"VM '{instance_id}'-'({vm})' is already Powered On"
)
logging.info(f"VM '{instance_id}'-'({vm})' is already Powered On")
return False
def list_instances(self, datacenter):
@@ -152,18 +150,14 @@ class vSphere:
datacenter_filter = self.client.vcenter.Datacenter.FilterSpec(
names=set([datacenter])
)
datacenter_summaries = self.client.vcenter.Datacenter.list(
datacenter_filter
)
datacenter_summaries = self.client.vcenter.Datacenter.list(datacenter_filter)
try:
datacenter_id = datacenter_summaries[0].datacenter
except IndexError:
logging.error("Datacenter '{}' doesn't exist", datacenter)
sys.exit(1)
vm_filter = self.client.vcenter.VM.FilterSpec(
datacenters={datacenter_id}
)
vm_filter = self.client.vcenter.VM.FilterSpec(datacenters={datacenter_id})
vm_summaries = self.client.vcenter.VM.list(vm_filter)
vm_names = []
for vm in vm_summaries:
@@ -177,10 +171,7 @@ class vSphere:
datacenter_summaries = self.client.vcenter.Datacenter.list()
datacenter_names = [
{
"datacenter_id": datacenter.datacenter,
"datacenter_name": datacenter.name
}
{"datacenter_id": datacenter.datacenter, "datacenter_name": datacenter.name}
for datacenter in datacenter_summaries
]
return datacenter_names
@@ -194,16 +185,11 @@ class vSphere:
datastore_filter = self.client.vcenter.Datastore.FilterSpec(
datacenters={datacenter}
)
datastore_summaries = self.client.vcenter.Datastore.list(
datastore_filter
)
datastore_summaries = self.client.vcenter.Datastore.list(datastore_filter)
datastore_names = []
for datastore in datastore_summaries:
datastore_names.append(
{
"datastore_name": datastore.name,
"datastore_id": datastore.datastore
}
{"datastore_name": datastore.name, "datastore_id": datastore.datastore}
)
return datastore_names
@@ -213,9 +199,7 @@ class vSphere:
IDs belonging to a specific datacenter
"""
folder_filter = self.client.vcenter.Folder.FilterSpec(
datacenters={datacenter}
)
folder_filter = self.client.vcenter.Folder.FilterSpec(datacenters={datacenter})
folder_summaries = self.client.vcenter.Folder.list(folder_filter)
folder_names = []
for folder in folder_summaries:
@@ -234,17 +218,12 @@ class vSphere:
filter_spec = ResourcePool.FilterSpec(
datacenters=set([datacenter]), names=names
)
resource_pool_summaries = self.client.vcenter.ResourcePool.list(
filter_spec
)
resource_pool_summaries = self.client.vcenter.ResourcePool.list(filter_spec)
if len(resource_pool_summaries) > 0:
resource_pool = resource_pool_summaries[0].resource_pool
return resource_pool
else:
logging.error(
"ResourcePool not found in Datacenter '{}'",
datacenter
)
logging.error("ResourcePool not found in Datacenter '{}'", datacenter)
return None
def create_default_vm(self, guest_os="RHEL_7_64", max_attempts=10):
@@ -277,9 +256,7 @@ class vSphere:
# random generator not used for
# security/cryptographic purposes in this loop
datacenter = random.choice(datacenter_list) # nosec
resource_pool = self.get_resource_pool(
datacenter["datacenter_id"]
)
resource_pool = self.get_resource_pool(datacenter["datacenter_id"])
folder = random.choice( # nosec
self.get_folder_list(datacenter["datacenter_id"])
)["folder_id"]
@@ -288,25 +265,18 @@ class vSphere:
)["datastore_id"]
vm_name = "Test-" + str(time.time_ns())
return (
create_vm(
vm_name,
resource_pool,
folder,
datastore,
guest_os
),
create_vm(vm_name, resource_pool, folder, datastore, guest_os),
vm_name,
)
except Exception as e:
logging.error(
"Default VM could not be created, retrying. "
"Error was: %s",
str(e)
"Default VM could not be created, retrying. " "Error was: %s",
str(e),
)
logging.error(
"Default VM could not be created in %s attempts. "
"Check your VMware resources",
max_attempts
max_attempts,
)
return None, None
@@ -338,15 +308,12 @@ class vSphere:
while vm is not None:
vm = self.get_vm(instance_id)
logging.info(
f"VM {instance_id} is still being deleted, "
f"sleeping for 5 seconds"
f"VM {instance_id} is still being deleted, " f"sleeping for 5 seconds"
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
f"VM {instance_id} is still not deleted in allotted time"
)
logging.info(f"VM {instance_id} is still not deleted in allotted time")
return False
return True
@@ -361,16 +328,12 @@ class vSphere:
while status != Power.State.POWERED_ON:
status = self.get_vm_status(instance_id)
logging.info(
"VM %s is still not running, "
"sleeping for 5 seconds",
instance_id
"VM %s is still not running, " "sleeping for 5 seconds", instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
f"VM {instance_id} is still not ready in allotted time"
)
logging.info(f"VM {instance_id} is still not ready in allotted time")
return False
return True
@@ -385,15 +348,12 @@ class vSphere:
while status != Power.State.POWERED_OFF:
status = self.get_vm_status(instance_id)
logging.info(
f"VM {instance_id} is still not running, "
f"sleeping for 5 seconds"
f"VM {instance_id} is still not running, " f"sleeping for 5 seconds"
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
f"VM {instance_id} is still not ready in allotted time"
)
logging.info(f"VM {instance_id} is still not ready in allotted time")
return False
return True
@@ -410,16 +370,16 @@ class NodeScenarioSuccessOutput:
metadata={
"name": "Nodes started/stopped/terminated/rebooted",
"description": "Map between timestamps and the pods "
"started/stopped/terminated/rebooted. "
"The timestamp is provided in nanoseconds",
"started/stopped/terminated/rebooted. "
"The timestamp is provided in nanoseconds",
}
)
action: kube_helper.Actions = field(
metadata={
"name": "The action performed on the node",
"description": "The action performed or attempted to be "
"performed on the node. Possible values"
"are : Start, Stop, Terminate, Reboot",
"performed on the node. Possible values"
"are : Start, Stop, Terminate, Reboot",
}
)
@@ -449,7 +409,7 @@ class NodeScenarioConfig:
metadata={
"name": "Name",
"description": "Name(s) for target nodes. "
"Required if label_selector is not set.",
"Required if label_selector is not set.",
},
)
@@ -458,20 +418,18 @@ class NodeScenarioConfig:
metadata={
"name": "Number of runs per node",
"description": "Number of times to inject each scenario under "
"actions (will perform on same node each time)",
"actions (will perform on same node each time)",
},
)
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name")
typing.Optional[str], validation.min(1), validation.required_if_not("name")
] = field(
default=None,
metadata={
"name": "Label selector",
"description": "Kubernetes label selector for the target nodes. "
"Required if name is not set.\n"
"Required if name is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
"for details.",
},
@@ -482,19 +440,16 @@ class NodeScenarioConfig:
metadata={
"name": "Timeout",
"description": "Timeout to wait for the target pod(s) "
"to be removed in seconds.",
"to be removed in seconds.",
},
)
instance_count: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
instance_count: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=1,
metadata={
"name": "Instance Count",
"description": "Number of nodes to perform action/select "
"that match the label selector.",
"that match the label selector.",
},
)
@@ -511,7 +466,7 @@ class NodeScenarioConfig:
metadata={
"name": "Verify API Session",
"description": "Verifies the vSphere client session. "
"It is enabled by default",
"It is enabled by default",
},
)
@@ -520,7 +475,7 @@ class NodeScenarioConfig:
metadata={
"name": "Kubeconfig path",
"description": "Path to your Kubeconfig file. "
"Defaults to ~/.kube/config.\n"
"Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
"for details.",
},
@@ -531,11 +486,8 @@ class NodeScenarioConfig:
id="vmware-node-start",
name="Start the node",
description="Start the node(s) by starting the VMware VM "
"on which the node is configured",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
"on which the node is configured",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_start(
cfg: NodeScenarioConfig,
@@ -546,11 +498,7 @@ def node_start(
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(
cfg,
kube_helper.Actions.START,
core_v1
)
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.START, core_v1)
nodes_started = {}
for name in node_list:
try:
@@ -565,17 +513,12 @@ def node_start(
name, cfg.timeout, watch_resource, core_v1
)
nodes_started[int(time.time_ns())] = Node(name=name)
logging.info(
f"Node with instance ID: {name} is in running state"
)
logging.info(
"node_start_scenario has been successfully injected!"
)
logging.info(f"Node with instance ID: {name} is in running state")
logging.info("node_start_scenario has been successfully injected!")
except Exception as e:
logging.error("Failed to start node instance. Test Failed")
logging.error(
f"node_start_scenario injection failed! "
f"Error was: {str(e)}"
f"node_start_scenario injection failed! " f"Error was: {str(e)}"
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.START
@@ -590,11 +533,8 @@ def node_start(
id="vmware-node-stop",
name="Stop the node",
description="Stop the node(s) by starting the VMware VM "
"on which the node is configured",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
"on which the node is configured",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_stop(
cfg: NodeScenarioConfig,
@@ -605,11 +545,7 @@ def node_stop(
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(
cfg,
kube_helper.Actions.STOP,
core_v1
)
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.STOP, core_v1)
nodes_stopped = {}
for name in node_list:
try:
@@ -624,17 +560,12 @@ def node_stop(
name, cfg.timeout, watch_resource, core_v1
)
nodes_stopped[int(time.time_ns())] = Node(name=name)
logging.info(
f"Node with instance ID: {name} is in stopped state"
)
logging.info(
"node_stop_scenario has been successfully injected!"
)
logging.info(f"Node with instance ID: {name} is in stopped state")
logging.info("node_stop_scenario has been successfully injected!")
except Exception as e:
logging.error("Failed to stop node instance. Test Failed")
logging.error(
f"node_stop_scenario injection failed! "
f"Error was: {str(e)}"
f"node_stop_scenario injection failed! " f"Error was: {str(e)}"
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.STOP
@@ -649,11 +580,8 @@ def node_stop(
id="vmware-node-reboot",
name="Reboot VMware VM",
description="Reboot the node(s) by starting the VMware VM "
"on which the node is configured",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
"on which the node is configured",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_reboot(
cfg: NodeScenarioConfig,
@@ -664,11 +592,7 @@ def node_reboot(
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(
cfg,
kube_helper.Actions.REBOOT,
core_v1
)
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.REBOOT, core_v1)
nodes_rebooted = {}
for name in node_list:
try:
@@ -685,17 +609,13 @@ def node_reboot(
)
nodes_rebooted[int(time.time_ns())] = Node(name=name)
logging.info(
f"Node with instance ID: {name} has rebooted "
"successfully"
)
logging.info(
"node_reboot_scenario has been successfully injected!"
f"Node with instance ID: {name} has rebooted " "successfully"
)
logging.info("node_reboot_scenario has been successfully injected!")
except Exception as e:
logging.error("Failed to reboot node instance. Test Failed")
logging.error(
f"node_reboot_scenario injection failed! "
f"Error was: {str(e)}"
f"node_reboot_scenario injection failed! " f"Error was: {str(e)}"
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.REBOOT
@@ -733,24 +653,18 @@ def node_terminate(
)
vsphere.stop_instances(name)
vsphere.wait_until_stopped(name, cfg.timeout)
logging.info(
f"Releasing the node with instance ID: {name} "
)
logging.info(f"Releasing the node with instance ID: {name} ")
vsphere.release_instances(name)
vsphere.wait_until_released(name, cfg.timeout)
nodes_terminated[int(time.time_ns())] = Node(name=name)
logging.info(f"Node with instance ID: {name} has been released")
logging.info(
f"Node with instance ID: {name} has been released"
)
logging.info(
"node_terminate_scenario has been "
"successfully injected!"
"node_terminate_scenario has been " "successfully injected!"
)
except Exception as e:
logging.error("Failed to terminate node instance. Test Failed")
logging.error(
f"node_terminate_scenario injection failed! "
f"Error was: {str(e)}"
f"node_terminate_scenario injection failed! " f"Error was: {str(e)}"
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.TERMINATE

View File

@@ -0,0 +1,176 @@
import dataclasses
import json
import logging
from os.path import abspath
from typing import List, Any, Dict
from krkn.scenario_plugins.native.run_python_plugin import run_python_file
from arcaflow_plugin_kill_pod import kill_pods, wait_for_pods
from krkn.scenario_plugins.native.network.ingress_shaping import network_chaos
from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin import (
pod_outage,
)
from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin import (
pod_egress_shaping,
)
import krkn.scenario_plugins.native.node_scenarios.ibmcloud_plugin as ibmcloud_plugin
from krkn.scenario_plugins.native.pod_network_outage.pod_network_outage_plugin import (
pod_ingress_shaping,
)
from arcaflow_plugin_sdk import schema, serialization, jsonschema
from krkn.scenario_plugins.native.node_scenarios import vmware_plugin
@dataclasses.dataclass
class PluginStep:
schema: schema.StepSchema
error_output_ids: List[str]
def render_output(self, output_id: str, output_data) -> str:
return json.dumps(
{
"output_id": output_id,
"output_data": self.schema.outputs[output_id].serialize(output_data),
},
indent="\t",
)
class Plugins:
"""
Plugins is a class that can run plugins sequentially. The output is rendered to the standard output and the process
is aborted if a step fails.
"""
steps_by_id: Dict[str, PluginStep]
def __init__(self, steps: List[PluginStep]):
self.steps_by_id = dict()
for step in steps:
if step.schema.id in self.steps_by_id:
raise Exception("Duplicate step ID: {}".format(step.schema.id))
self.steps_by_id[step.schema.id] = step
def unserialize_scenario(self, file: str) -> Any:
return serialization.load_from_file(abspath(file))
def run(self, file: str, kubeconfig_path: str, kraken_config: str, run_uuid: str):
"""
Run executes a series of steps
"""
data = self.unserialize_scenario(abspath(file))
if not isinstance(data, list):
raise Exception(
"Invalid scenario configuration file: {} expected list, found {}".format(
file, type(data).__name__
)
)
i = 0
for entry in data:
if not isinstance(entry, dict):
raise Exception(
"Invalid scenario configuration file: {} expected a list of dict's, found {} on step {}".format(
file, type(entry).__name__, i
)
)
if "id" not in entry:
raise Exception(
"Invalid scenario configuration file: {} missing 'id' field on step {}".format(
file,
i,
)
)
if "config" not in entry:
raise Exception(
"Invalid scenario configuration file: {} missing 'config' field on step {}".format(
file,
i,
)
)
if entry["id"] not in self.steps_by_id:
raise Exception(
"Invalid step {} in {} ID: {} expected one of: {}".format(
i, file, entry["id"], ", ".join(self.steps_by_id.keys())
)
)
step = self.steps_by_id[entry["id"]]
unserialized_input = step.schema.input.unserialize(entry["config"])
if "kubeconfig_path" in step.schema.input.properties:
unserialized_input.kubeconfig_path = kubeconfig_path
if "kraken_config" in step.schema.input.properties:
unserialized_input.kraken_config = kraken_config
output_id, output_data = step.schema(
params=unserialized_input, run_id=run_uuid
)
logging.info(step.render_output(output_id, output_data) + "\n")
if output_id in step.error_output_ids:
raise Exception(
"Step {} in {} ({}) failed".format(i, file, step.schema.id)
)
i = i + 1
def json_schema(self):
"""
This function generates a JSON schema document and renders it from the steps passed.
"""
result = {
"$id": "https://github.com/redhat-chaos/krkn/",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Kraken Arcaflow scenarios",
"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
"type": "array",
"minContains": 1,
"items": {"oneOf": []},
}
for step_id in self.steps_by_id.keys():
step = self.steps_by_id[step_id]
step_input = jsonschema.step_input(step.schema)
del step_input["$id"]
del step_input["$schema"]
del step_input["title"]
del step_input["description"]
result["items"]["oneOf"].append(
{
"type": "object",
"properties": {
"id": {
"type": "string",
"const": step_id,
},
"config": step_input,
},
"required": [
"id",
"config",
],
}
)
return json.dumps(result, indent="\t")
PLUGINS = Plugins(
[
PluginStep(
kill_pods,
[
"error",
],
),
PluginStep(wait_for_pods, ["error"]),
PluginStep(run_python_file, ["error"]),
PluginStep(vmware_plugin.node_start, ["error"]),
PluginStep(vmware_plugin.node_stop, ["error"]),
PluginStep(vmware_plugin.node_reboot, ["error"]),
PluginStep(vmware_plugin.node_terminate, ["error"]),
PluginStep(ibmcloud_plugin.node_start, ["error"]),
PluginStep(ibmcloud_plugin.node_stop, ["error"]),
PluginStep(ibmcloud_plugin.node_reboot, ["error"]),
PluginStep(ibmcloud_plugin.node_terminate, ["error"]),
PluginStep(network_chaos, ["error"]),
PluginStep(pod_outage, ["error"]),
PluginStep(pod_egress_shaping, ["error"]),
PluginStep(pod_ingress_shaping, ["error"]),
]
)

View File

@@ -0,0 +1,255 @@
import logging
import os
import random
import time
import yaml
from jinja2 import Environment, FileSystemLoader
from krkn_lib.k8s import KrknKubernetes
from krkn_lib.models.telemetry import ScenarioTelemetry
from krkn_lib.telemetry.ocp import KrknTelemetryOpenshift
from krkn_lib.utils import get_yaml_item_value, log_exception
from krkn import cerberus, utils
from krkn.scenario_plugins.node_actions import common_node_functions
from krkn.scenario_plugins.abstract_scenario_plugin import AbstractScenarioPlugin
class NetworkChaosScenarioPlugin(AbstractScenarioPlugin):
def run(
self,
run_uuid: str,
scenario: str,
krkn_config: dict[str, any],
lib_telemetry: KrknTelemetryOpenshift,
scenario_telemetry: ScenarioTelemetry,
) -> int:
try:
with open(scenario, "r") as file:
param_lst = ["latency", "loss", "bandwidth"]
test_config = yaml.safe_load(file)
test_dict = test_config["network_chaos"]
test_duration = int(get_yaml_item_value(test_dict, "duration", 300))
test_interface = get_yaml_item_value(test_dict, "interfaces", [])
test_node = get_yaml_item_value(test_dict, "node_name", "")
test_node_label = get_yaml_item_value(
test_dict, "label_selector", "node-role.kubernetes.io/master"
)
test_execution = get_yaml_item_value(test_dict, "execution", "serial")
test_instance_count = get_yaml_item_value(
test_dict, "instance_count", 1
)
test_egress = get_yaml_item_value(
test_dict, "egress", {"bandwidth": "100mbit"}
)
if test_node:
node_name_list = test_node.split(",")
else:
node_name_list = [test_node]
nodelst = []
for single_node_name in node_name_list:
nodelst.extend(
common_node_functions.get_node(
single_node_name,
test_node_label,
test_instance_count,
lib_telemetry.get_lib_kubernetes(),
)
)
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__))
)
env = Environment(loader=file_loader, autoescape=True)
pod_template = env.get_template("pod.j2")
test_interface = self.verify_interface(
test_interface,
nodelst,
pod_template,
lib_telemetry.get_lib_kubernetes(),
)
joblst = []
egress_lst = [i for i in param_lst if i in test_egress]
chaos_config = {
"network_chaos": {
"duration": test_duration,
"interfaces": test_interface,
"node_name": ",".join(nodelst),
"execution": test_execution,
"instance_count": test_instance_count,
"egress": test_egress,
}
}
logging.info(
"Executing network chaos with config \n %s"
% yaml.dump(chaos_config)
)
job_template = env.get_template("job.j2")
try:
for i in egress_lst:
for node in nodelst:
exec_cmd = self.get_egress_cmd(
test_execution,
test_interface,
i,
test_dict["egress"],
duration=test_duration,
)
logging.info("Executing %s on node %s" % (exec_cmd, node))
job_body = yaml.safe_load(
job_template.render(
jobname=i + str(hash(node))[:5],
nodename=node,
cmd=exec_cmd,
)
)
joblst.append(job_body["metadata"]["name"])
api_response = (
lib_telemetry.get_lib_kubernetes().create_job(job_body)
)
if api_response is None:
logging.error(
"NetworkChaosScenarioPlugin Error creating job"
)
return 1
if test_execution == "serial":
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
self.wait_for_job(
joblst[:],
lib_telemetry.get_lib_kubernetes(),
test_duration + 300,
)
end_time = int(time.time())
cerberus.publish_kraken_status(
krkn_config,
None,
start_time,
end_time,
)
if test_execution == "parallel":
break
if test_execution == "parallel":
logging.info("Waiting for parallel job to finish")
start_time = int(time.time())
self.wait_for_job(
joblst[:],
lib_telemetry.get_lib_kubernetes(),
test_duration + 300,
)
end_time = int(time.time())
cerberus.publish_kraken_status(
krkn_config, [], start_time, end_time
)
except Exception as e:
logging.error(
"NetworkChaosScenarioPlugin exiting due to Exception %s" % e
)
return 1
finally:
logging.info("Deleting jobs")
self.delete_job(joblst[:], lib_telemetry.get_lib_kubernetes())
except (RuntimeError, Exception):
scenario_telemetry.exit_status = 1
return 1
else:
return 0
def verify_interface(
self, test_interface, nodelst, template, kubecli: KrknKubernetes
):
pod_index = random.randint(0, len(nodelst) - 1)
pod_body = yaml.safe_load(template.render(nodename=nodelst[pod_index]))
logging.info("Creating pod to query interface on node %s" % nodelst[pod_index])
kubecli.create_pod(pod_body, "default", 300)
try:
if test_interface == []:
cmd = "ip r | grep default | awk '/default/ {print $5}'"
output = kubecli.exec_cmd_in_pod(cmd, "fedtools", "default")
test_interface = [output.replace("\n", "")]
else:
cmd = "ip -br addr show|awk -v ORS=',' '{print $1}'"
output = kubecli.exec_cmd_in_pod(cmd, "fedtools", "default")
interface_lst = output[:-1].split(",")
for interface in test_interface:
if interface not in interface_lst:
logging.error(
"NetworkChaosScenarioPlugin Interface %s not found in node %s interface list %s"
% (interface, nodelst[pod_index], interface_lst)
)
raise RuntimeError()
return test_interface
finally:
logging.info("Deleteing pod to query interface on node")
kubecli.delete_pod("fedtools", "default")
# krkn_lib
def get_job_pods(self, api_response, kubecli: KrknKubernetes):
controllerUid = api_response.metadata.labels["controller-uid"]
pod_label_selector = "controller-uid=" + controllerUid
pods_list = kubecli.list_pods(
label_selector=pod_label_selector, namespace="default"
)
return pods_list[0]
# krkn_lib
def wait_for_job(self, joblst, kubecli: KrknKubernetes, timeout=300):
waittime = time.time() + timeout
count = 0
joblen = len(joblst)
while count != joblen:
for jobname in joblst:
try:
api_response = kubecli.get_job_status(jobname, namespace="default")
if (
api_response.status.succeeded is not None
or api_response.status.failed is not None
):
count += 1
joblst.remove(jobname)
except Exception:
logging.warning("Exception in getting job status")
if time.time() > waittime:
raise Exception("Starting pod failed")
time.sleep(5)
# krkn_lib
def delete_job(self, joblst, kubecli: KrknKubernetes):
for jobname in joblst:
try:
api_response = kubecli.get_job_status(jobname, namespace="default")
if api_response.status.failed is not None:
pod_name = self.get_job_pods(api_response, kubecli)
pod_stat = kubecli.read_pod(name=pod_name, namespace="default")
logging.error(
f"NetworkChaosScenarioPlugin {pod_stat.status.container_statuses}"
)
pod_log_response = kubecli.get_pod_log(
name=pod_name, namespace="default"
)
pod_log = pod_log_response.data.decode("utf-8")
logging.error(pod_log)
except Exception:
logging.warning("Exception in getting job status")
kubecli.delete_job(name=jobname, namespace="default")
def get_egress_cmd(self, execution, test_interface, mod, vallst, duration=30):
tc_set = tc_unset = tc_ls = ""
param_map = {"latency": "delay", "loss": "loss", "bandwidth": "rate"}
for i in test_interface:
tc_set = "{0} tc qdisc add dev {1} root netem".format(tc_set, i)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, i)
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, i)
if execution == "parallel":
for val in vallst.keys():
tc_set += " {0} {1} ".format(param_map[val], vallst[val])
tc_set += ";"
else:
tc_set += " {0} {1} ;".format(param_map[mod], vallst[mod])
exec_cmd = "{0} {1} sleep {2};{3} sleep 20;{4}".format(
tc_set, tc_ls, duration, tc_unset, tc_ls
)
return exec_cmd
def get_scenario_types(self) -> list[str]:
return ["network_chaos_scenarios"]

View File

@@ -1,15 +1,18 @@
import sys
import logging
import time
import kraken.invoke.command as runcommand
import kraken.node_actions.common_node_functions as nodeaction
import krkn.invoke.command as runcommand
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
from krkn_lib.k8s import KrknKubernetes
# krkn_lib
class abstract_node_scenarios:
kubecli: KrknKubernetes
def __init__(self, kubecli: KrknKubernetes):
self.kubecli = kubecli
# Node scenario to start the node
def node_start_scenario(self, instance_kill_count, node, timeout):
pass
@@ -47,16 +50,19 @@ class abstract_node_scenarios:
try:
logging.info("Starting stop_kubelet_scenario injection")
logging.info("Stopping the kubelet of the node %s" % (node))
runcommand.run("oc debug node/" + node + " -- chroot /host systemctl stop kubelet")
runcommand.run(
"oc debug node/" + node + " -- chroot /host systemctl stop kubelet"
)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
logging.info("The kubelet of the node %s has been stopped" % (node))
logging.info("stop_kubelet_scenario has been successfuly injected!")
except Exception as e:
logging.error(
"Failed to stop the kubelet of the node. Encountered following " "exception: %s. Test Failed" % (e)
"Failed to stop the kubelet of the node. Encountered following "
"exception: %s. Test Failed" % (e)
)
logging.error("stop_kubelet_scenario injection failed!")
sys.exit(1)
raise e
# Node scenario to stop and start the kubelet
def stop_start_kubelet_scenario(self, instance_kill_count, node, timeout):
@@ -65,6 +71,29 @@ class abstract_node_scenarios:
self.node_reboot_scenario(instance_kill_count, node, timeout)
logging.info("stop_start_kubelet_scenario has been successfully injected!")
# Node scenario to restart the kubelet
def restart_kubelet_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting restart_kubelet_scenario injection")
logging.info("Restarting the kubelet of the node %s" % (node))
runcommand.run(
"oc debug node/"
+ node
+ " -- chroot /host systemctl restart kubelet &"
)
nodeaction.wait_for_not_ready_status(node, timeout, self.kubecli)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info("The kubelet of the node %s has been restarted" % (node))
logging.info("restart_kubelet_scenario has been successfuly injected!")
except Exception as e:
logging.error(
"Failed to restart the kubelet of the node. Encountered following "
"exception: %s. Test Failed" % (e)
)
logging.error("restart_kubelet_scenario injection failed!")
raise e
# Node scenario to crash the node
def node_crash_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
@@ -72,13 +101,17 @@ class abstract_node_scenarios:
logging.info("Starting node_crash_scenario injection")
logging.info("Crashing the node %s" % (node))
runcommand.invoke(
"oc debug node/" + node + " -- chroot /host " "dd if=/dev/urandom of=/proc/sysrq-trigger"
"oc debug node/" + node + " -- chroot /host "
"dd if=/dev/urandom of=/proc/sysrq-trigger"
)
logging.info("node_crash_scenario has been successfuly injected!")
except Exception as e:
logging.error("Failed to crash the node. Encountered following exception: %s. " "Test Failed" % (e))
logging.error(
"Failed to crash the node. Encountered following exception: %s. "
"Test Failed" % (e)
)
logging.error("node_crash_scenario injection failed!")
sys.exit(1)
raise e
# Node scenario to check service status on helper node
def node_service_status(self, node, service, ssh_private_key, timeout):

View File

@@ -1,13 +1,22 @@
import sys
import time
import logging
import kraken.node_actions.common_node_functions as nodeaction
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
import os
import json
from aliyunsdkcore.client import AcsClient
from aliyunsdkecs.request.v20140526 import DescribeInstancesRequest, DeleteInstanceRequest
from aliyunsdkecs.request.v20140526 import StopInstanceRequest, StartInstanceRequest, RebootInstanceRequest
from kraken.node_actions.abstract_node_scenarios import abstract_node_scenarios
from aliyunsdkecs.request.v20140526 import (
DescribeInstancesRequest,
DeleteInstanceRequest,
)
from aliyunsdkecs.request.v20140526 import (
StopInstanceRequest,
StartInstanceRequest,
RebootInstanceRequest,
)
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
abstract_node_scenarios,
)
from krkn_lib.k8s import KrknKubernetes
@@ -46,12 +55,12 @@ class Alibaba:
"variables/credentials are correct"
)
logging.error(response)
sys.exit(1)
raise RuntimeError(response)
return instance_list
return []
except Exception as e:
logging.error("ERROR while trying to get list of instances " + str(e))
sys.exit(1)
raise e
# Get the instance ID of the node
def get_instance_id(self, node_name):
@@ -59,8 +68,16 @@ class Alibaba:
for vm in vm_list:
if node_name == vm["InstanceName"]:
return vm["InstanceId"]
logging.error("Couldn't find vm with name " + str(node_name) + ", you could try another region")
sys.exit(1)
logging.error(
"Couldn't find vm with name "
+ str(node_name)
+ ", you could try another region"
)
raise RuntimeError(
"Couldn't find vm with name "
+ str(node_name)
+ ", you could try another region"
)
# Start the node instance
def start_instances(self, instance_id):
@@ -72,9 +89,10 @@ class Alibaba:
logging.info("ECS instance with id " + str(instance_id) + " started")
except Exception as e:
logging.error(
"Failed to start node instance %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to start node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
sys.exit(1)
raise e
# https://partners-intl.aliyun.com/help/en/doc-detail/93110.html
# Stop the node instance
@@ -86,8 +104,11 @@ class Alibaba:
self._send_request(request)
logging.info("Stop %s command submit successfully.", instance_id)
except Exception as e:
logging.error("Failed to stop node instance %s. Encountered following " "exception: %s." % (instance_id, e))
sys.exit(1)
logging.error(
"Failed to stop node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
raise e
# Terminate the node instance
def release_instance(self, instance_id, force_release=True):
@@ -99,9 +120,10 @@ class Alibaba:
logging.info("ECS Instance " + str(instance_id) + " released")
except Exception as e:
logging.error(
"Failed to terminate node instance %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to terminate node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
sys.exit(1)
raise e
# Reboot the node instance
def reboot_instances(self, instance_id, force_reboot=True):
@@ -113,9 +135,10 @@ class Alibaba:
logging.info("ECS Instance " + str(instance_id) + " rebooted")
except Exception as e:
logging.error(
"Failed to reboot node instance %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to reboot node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
sys.exit(1)
raise e
def get_vm_status(self, instance_id):
@@ -132,7 +155,8 @@ class Alibaba:
return "Unknown"
except Exception as e:
logging.error(
"Failed to get node instance status %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to get node instance status %s. Encountered following "
"exception: %s." % (instance_id, e)
)
return None
@@ -142,7 +166,9 @@ class Alibaba:
status = self.get_vm_status(instance_id)
while status != "Running":
status = self.get_vm_status(instance_id)
logging.info("ECS %s is still not running, sleeping for 5 seconds" % instance_id)
logging.info(
"ECS %s is still not running, sleeping for 5 seconds" % instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
@@ -156,11 +182,15 @@ class Alibaba:
status = self.get_vm_status(instance_id)
while status != "Stopped":
status = self.get_vm_status(instance_id)
logging.info("Vm %s is still stopping, sleeping for 5 seconds" % instance_id)
logging.info(
"Vm %s is still stopping, sleeping for 5 seconds" % instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Vm %s is still not stopped in allotted time" % instance_id)
logging.info(
"Vm %s is still not stopped in allotted time" % instance_id
)
return False
return True
@@ -170,7 +200,9 @@ class Alibaba:
time_counter = 0
while statuses and statuses != "Released":
statuses = self.get_vm_status(instance_id)
logging.info("ECS %s is still being released, waiting 10 seconds" % instance_id)
logging.info(
"ECS %s is still being released, waiting 10 seconds" % instance_id
)
time.sleep(10)
time_counter += 10
if time_counter >= timeout:
@@ -180,9 +212,10 @@ class Alibaba:
logging.info("ECS %s is released" % instance_id)
return True
# krkn_lib
class alibaba_node_scenarios(abstract_node_scenarios):
def __init__(self,kubecli: KrknKubernetes):
def __init__(self, kubecli: KrknKubernetes):
self.alibaba = Alibaba()
# Node scenario to start the node
@@ -191,7 +224,9 @@ class alibaba_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_start_scenario injection")
vm_id = self.alibaba.get_instance_id(node)
logging.info("Starting the node %s with instance ID: %s " % (node, vm_id))
logging.info(
"Starting the node %s with instance ID: %s " % (node, vm_id)
)
self.alibaba.start_instances(vm_id)
self.alibaba.wait_until_running(vm_id, timeout)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
@@ -199,10 +234,11 @@ class alibaba_node_scenarios(abstract_node_scenarios):
logging.info("node_start_scenario has been successfully injected!")
except Exception as e:
logging.error(
"Failed to start node instance. Encountered following " "exception: %s. Test Failed" % (e)
"Failed to start node instance. Encountered following "
"exception: %s. Test Failed" % (e)
)
logging.error("node_start_scenario injection failed!")
sys.exit(1)
raise e
# Node scenario to stop the node
def node_stop_scenario(self, instance_kill_count, node, timeout):
@@ -210,36 +246,48 @@ class alibaba_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_stop_scenario injection")
vm_id = self.alibaba.get_instance_id(node)
logging.info("Stopping the node %s with instance ID: %s " % (node, vm_id))
logging.info(
"Stopping the node %s with instance ID: %s " % (node, vm_id)
)
self.alibaba.stop_instances(vm_id)
self.alibaba.wait_until_stopped(vm_id, timeout)
logging.info("Node with instance ID: %s is in stopped state" % vm_id)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
except Exception as e:
logging.error("Failed to stop node instance. Encountered following exception: %s. " "Test Failed" % e)
logging.error(
"Failed to stop node instance. Encountered following exception: %s. "
"Test Failed" % e
)
logging.error("node_stop_scenario injection failed!")
sys.exit(1)
raise e
# Might need to stop and then release the instance
# Node scenario to terminate the node
def node_termination_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_termination_scenario injection by first stopping instance")
logging.info(
"Starting node_termination_scenario injection by first stopping instance"
)
vm_id = self.alibaba.get_instance_id(node)
self.alibaba.stop_instances(vm_id)
self.alibaba.wait_until_stopped(vm_id, timeout)
logging.info("Releasing the node %s with instance ID: %s " % (node, vm_id))
logging.info(
"Releasing the node %s with instance ID: %s " % (node, vm_id)
)
self.alibaba.release_instance(vm_id)
self.alibaba.wait_until_released(vm_id, timeout)
logging.info("Node with instance ID: %s has been released" % node)
logging.info("node_termination_scenario has been successfully injected!")
logging.info(
"node_termination_scenario has been successfully injected!"
)
except Exception as e:
logging.error(
"Failed to release node instance. Encountered following exception:" " %s. Test Failed" % (e)
"Failed to release node instance. Encountered following exception:"
" %s. Test Failed" % (e)
)
logging.error("node_termination_scenario injection failed!")
sys.exit(1)
raise e
# Node scenario to reboot the node
def node_reboot_scenario(self, instance_kill_count, node, timeout):
@@ -251,11 +299,14 @@ class alibaba_node_scenarios(abstract_node_scenarios):
self.alibaba.reboot_instances(instance_id)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info("Node with instance ID: %s has been rebooted" % (instance_id))
logging.info(
"Node with instance ID: %s has been rebooted" % (instance_id)
)
logging.info("node_reboot_scenario has been successfully injected!")
except Exception as e:
logging.error(
"Failed to reboot node instance. Encountered following exception:" " %s. Test Failed" % (e)
"Failed to reboot node instance. Encountered following exception:"
" %s. Test Failed" % (e)
)
logging.error("node_reboot_scenario injection failed!")
sys.exit(1)
raise e

View File

@@ -2,10 +2,13 @@ import sys
import time
import boto3
import logging
import kraken.node_actions.common_node_functions as nodeaction
from kraken.node_actions.abstract_node_scenarios import abstract_node_scenarios
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
abstract_node_scenarios,
)
from krkn_lib.k8s import KrknKubernetes
class AWS:
def __init__(self):
self.boto_client = boto3.client("ec2")
@@ -13,7 +16,11 @@ class AWS:
# Get the instance ID of the node
def get_instance_id(self, node):
return self.boto_client.describe_instances(Filters=[{"Name": "private-dns-name", "Values": [node]}])[
instance = self.boto_client.describe_instances(Filters=[{"Name": "private-dns-name", "Values": [node]}])
if len(instance['Reservations']) == 0:
node = node[3:].replace('-','.')
instance = self.boto_client.describe_instances(Filters=[{"Name": "private-ip-address", "Values": [node]}])
return instance[
"Reservations"
][0]["Instances"][0]["InstanceId"]
@@ -24,10 +31,9 @@ class AWS:
logging.info("EC2 instance: " + str(instance_id) + " started")
except Exception as e:
logging.error(
"Failed to start node instance %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to start node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Stop the node instance
@@ -36,9 +42,10 @@ class AWS:
self.boto_client.stop_instances(InstanceIds=[instance_id])
logging.info("EC2 instance: " + str(instance_id) + " stopped")
except Exception as e:
logging.error("Failed to stop node instance %s. Encountered following " "exception: %s." % (instance_id, e))
# removed_exit
# sys.exit(1)
logging.error(
"Failed to stop node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
raise RuntimeError()
# Terminate the node instance
@@ -48,10 +55,9 @@ class AWS:
logging.info("EC2 instance: " + str(instance_id) + " terminated")
except Exception as e:
logging.error(
"Failed to terminate node instance %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to terminate node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Reboot the node instance
@@ -61,10 +67,9 @@ class AWS:
logging.info("EC2 instance " + str(instance_id) + " rebooted")
except Exception as e:
logging.error(
"Failed to reboot node instance %s. Encountered following " "exception: %s." % (instance_id, e)
"Failed to reboot node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Below functions poll EC2.Client.describe_instances() every 15 seconds
@@ -76,7 +81,10 @@ class AWS:
self.boto_instance.wait_until_running(InstanceIds=[instance_id])
return True
except Exception as e:
logging.error("Failed to get status waiting for %s to be running %s" % (instance_id, e))
logging.error(
"Failed to get status waiting for %s to be running %s"
% (instance_id, e)
)
return False
# Wait until the node instance is stopped
@@ -85,7 +93,10 @@ class AWS:
self.boto_instance.wait_until_stopped(InstanceIds=[instance_id])
return True
except Exception as e:
logging.error("Failed to get status waiting for %s to be stopped %s" % (instance_id, e))
logging.error(
"Failed to get status waiting for %s to be stopped %s"
% (instance_id, e)
)
return False
# Wait until the node instance is terminated
@@ -94,7 +105,10 @@ class AWS:
self.boto_instance.wait_until_terminated(InstanceIds=[instance_id])
return True
except Exception as e:
logging.error("Failed to get status waiting for %s to be terminated %s" % (instance_id, e))
logging.error(
"Failed to get status waiting for %s to be terminated %s"
% (instance_id, e)
)
return False
# Creates a deny network acl and returns the id
@@ -107,10 +121,10 @@ class AWS:
except Exception as e:
logging.error(
"Failed to create the default network_acl: %s"
"Make sure you have aws cli configured on the host and set for the region of your vpc/subnet" % (e)
"Make sure you have aws cli configured on the host and set for the region of your vpc/subnet"
% (e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
return acl_id
@@ -118,13 +132,14 @@ class AWS:
def replace_network_acl_association(self, association_id, acl_id):
try:
logging.info("Replacing the network acl associated with the subnet")
status = self.boto_client.replace_network_acl_association(AssociationId=association_id, NetworkAclId=acl_id)
status = self.boto_client.replace_network_acl_association(
AssociationId=association_id, NetworkAclId=acl_id
)
logging.info(status)
new_association_id = status["NewAssociationId"]
except Exception as e:
logging.error("Failed to replace network acl association: %s" % (e))
# removed_exit
# sys.exit(1)
raise RuntimeError()
return new_association_id
@@ -140,10 +155,10 @@ class AWS:
except Exception as e:
logging.error(
"Failed to describe network acl: %s."
"Make sure you have aws cli configured on the host and set for the region of your vpc/subnet" % (e)
"Make sure you have aws cli configured on the host and set for the region of your vpc/subnet"
% (e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
associations = response["NetworkAcls"][0]["Associations"]
# grab the current network_acl in use
@@ -161,10 +176,10 @@ class AWS:
"Make sure you have aws cli configured on the host and set for the region of your vpc/subnet"
% (acl_id, e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
# krkn_lib
class aws_node_scenarios(abstract_node_scenarios):
def __init__(self, kubecli: KrknKubernetes):
@@ -177,19 +192,23 @@ class aws_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_start_scenario injection")
instance_id = self.aws.get_instance_id(node)
logging.info("Starting the node %s with instance ID: %s " % (node, instance_id))
logging.info(
"Starting the node %s with instance ID: %s " % (node, instance_id)
)
self.aws.start_instances(instance_id)
self.aws.wait_until_running(instance_id)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info("Node with instance ID: %s is in running state" % (instance_id))
logging.info(
"Node with instance ID: %s is in running state" % (instance_id)
)
logging.info("node_start_scenario has been successfully injected!")
except Exception as e:
logging.error(
"Failed to start node instance. Encountered following " "exception: %s. Test Failed" % (e)
"Failed to start node instance. Encountered following "
"exception: %s. Test Failed" % (e)
)
logging.error("node_start_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Node scenario to stop the node
@@ -198,16 +217,22 @@ class aws_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_stop_scenario injection")
instance_id = self.aws.get_instance_id(node)
logging.info("Stopping the node %s with instance ID: %s " % (node, instance_id))
logging.info(
"Stopping the node %s with instance ID: %s " % (node, instance_id)
)
self.aws.stop_instances(instance_id)
self.aws.wait_until_stopped(instance_id)
logging.info("Node with instance ID: %s is in stopped state" % (instance_id))
logging.info(
"Node with instance ID: %s is in stopped state" % (instance_id)
)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
except Exception as e:
logging.error("Failed to stop node instance. Encountered following exception: %s. " "Test Failed" % (e))
logging.error(
"Failed to stop node instance. Encountered following exception: %s. "
"Test Failed" % (e)
)
logging.error("node_stop_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Node scenario to terminate the node
@@ -216,7 +241,10 @@ class aws_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_termination_scenario injection")
instance_id = self.aws.get_instance_id(node)
logging.info("Terminating the node %s with instance ID: %s " % (node, instance_id))
logging.info(
"Terminating the node %s with instance ID: %s "
% (node, instance_id)
)
self.aws.terminate_instances(instance_id)
self.aws.wait_until_terminated(instance_id)
for _ in range(timeout):
@@ -225,15 +253,17 @@ class aws_node_scenarios(abstract_node_scenarios):
time.sleep(1)
if node in self.kubecli.list_nodes():
raise Exception("Node could not be terminated")
logging.info("Node with instance ID: %s has been terminated" % (instance_id))
logging.info(
"Node with instance ID: %s has been terminated" % (instance_id)
)
logging.info("node_termination_scenario has been successfuly injected!")
except Exception as e:
logging.error(
"Failed to terminate node instance. Encountered following exception:" " %s. Test Failed" % (e)
"Failed to terminate node instance. Encountered following exception:"
" %s. Test Failed" % (e)
)
logging.error("node_termination_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Node scenario to reboot the node
@@ -242,17 +272,21 @@ class aws_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_reboot_scenario injection" + str(node))
instance_id = self.aws.get_instance_id(node)
logging.info("Rebooting the node %s with instance ID: %s " % (node, instance_id))
logging.info(
"Rebooting the node %s with instance ID: %s " % (node, instance_id)
)
self.aws.reboot_instances(instance_id)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info("Node with instance ID: %s has been rebooted" % (instance_id))
logging.info(
"Node with instance ID: %s has been rebooted" % (instance_id)
)
logging.info("node_reboot_scenario has been successfuly injected!")
except Exception as e:
logging.error(
"Failed to reboot node instance. Encountered following exception:" " %s. Test Failed" % (e)
"Failed to reboot node instance. Encountered following exception:"
" %s. Test Failed" % (e)
)
logging.error("node_reboot_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()

View File

@@ -1,16 +1,15 @@
import time
import os
import kraken.invoke.command as runcommand
import logging
import kraken.node_actions.common_node_functions as nodeaction
from kraken.node_actions.abstract_node_scenarios import abstract_node_scenarios
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
abstract_node_scenarios,
)
from azure.mgmt.compute import ComputeManagementClient
from azure.identity import DefaultAzureCredential
from krkn_lib.k8s import KrknKubernetes
class Azure:
def __init__(self):
logging.info("azure " + str(self))
@@ -39,9 +38,10 @@ class Azure:
self.compute_client.virtual_machines.begin_start(group_name, vm_name)
logging.info("vm name " + str(vm_name) + " started")
except Exception as e:
logging.error("Failed to start node instance %s. Encountered following " "exception: %s." % (vm_name, e))
# removed_exit
# sys.exit(1)
logging.error(
"Failed to start node instance %s. Encountered following "
"exception: %s." % (vm_name, e)
)
raise RuntimeError()
# Stop the node instance
@@ -50,9 +50,10 @@ class Azure:
self.compute_client.virtual_machines.begin_power_off(group_name, vm_name)
logging.info("vm name " + str(vm_name) + " stopped")
except Exception as e:
logging.error("Failed to stop node instance %s. Encountered following " "exception: %s." % (vm_name, e))
# removed_exit
# sys.exit(1)
logging.error(
"Failed to stop node instance %s. Encountered following "
"exception: %s." % (vm_name, e)
)
raise RuntimeError()
# Terminate the node instance
@@ -62,10 +63,10 @@ class Azure:
logging.info("vm name " + str(vm_name) + " terminated")
except Exception as e:
logging.error(
"Failed to terminate node instance %s. Encountered following " "exception: %s." % (vm_name, e)
"Failed to terminate node instance %s. Encountered following "
"exception: %s." % (vm_name, e)
)
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Reboot the node instance
@@ -74,13 +75,17 @@ class Azure:
self.compute_client.virtual_machines.begin_restart(group_name, vm_name)
logging.info("vm name " + str(vm_name) + " rebooted")
except Exception as e:
logging.error("Failed to reboot node instance %s. Encountered following " "exception: %s." % (vm_name, e))
# removed_exit
# sys.exit(1)
logging.error(
"Failed to reboot node instance %s. Encountered following "
"exception: %s." % (vm_name, e)
)
raise RuntimeError()
def get_vm_status(self, resource_group, vm_name):
statuses = self.compute_client.virtual_machines.instance_view(resource_group, vm_name).statuses
statuses = self.compute_client.virtual_machines.instance_view(
resource_group, vm_name
).statuses
status = len(statuses) >= 2 and statuses[1]
return status
@@ -114,12 +119,16 @@ class Azure:
# Wait until the node instance is terminated
def wait_until_terminated(self, resource_group, vm_name, timeout):
statuses = self.compute_client.virtual_machines.instance_view(resource_group, vm_name).statuses[0]
statuses = self.compute_client.virtual_machines.instance_view(
resource_group, vm_name
).statuses[0]
logging.info("vm status " + str(statuses))
time_counter = 0
while statuses.code == "ProvisioningState/deleting":
try:
statuses = self.compute_client.virtual_machines.instance_view(resource_group, vm_name).statuses[0]
statuses = self.compute_client.virtual_machines.instance_view(
resource_group, vm_name
).statuses[0]
logging.info("Vm %s is still deleting, waiting 10 seconds" % vm_name)
time.sleep(10)
time_counter += 10
@@ -130,6 +139,7 @@ class Azure:
logging.info("Vm %s is terminated" % vm_name)
return True
# krkn_lib
class azure_node_scenarios(abstract_node_scenarios):
def __init__(self, kubecli: KrknKubernetes):
@@ -143,19 +153,22 @@ class azure_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_start_scenario injection")
vm_name, resource_group = self.azure.get_instance_id(node)
logging.info("Starting the node %s with instance ID: %s " % (vm_name, resource_group))
logging.info(
"Starting the node %s with instance ID: %s "
% (vm_name, resource_group)
)
self.azure.start_instances(resource_group, vm_name)
self.azure.wait_until_running(resource_group, vm_name, timeout)
nodeaction.wait_for_ready_status(vm_name, timeout,self.kubecli)
nodeaction.wait_for_ready_status(vm_name, timeout, self.kubecli)
logging.info("Node with instance ID: %s is in running state" % node)
logging.info("node_start_scenario has been successfully injected!")
except Exception as e:
logging.error(
"Failed to start node instance. Encountered following " "exception: %s. Test Failed" % (e)
"Failed to start node instance. Encountered following "
"exception: %s. Test Failed" % (e)
)
logging.error("node_start_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Node scenario to stop the node
@@ -164,16 +177,21 @@ class azure_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_stop_scenario injection")
vm_name, resource_group = self.azure.get_instance_id(node)
logging.info("Stopping the node %s with instance ID: %s " % (vm_name, resource_group))
logging.info(
"Stopping the node %s with instance ID: %s "
% (vm_name, resource_group)
)
self.azure.stop_instances(resource_group, vm_name)
self.azure.wait_until_stopped(resource_group, vm_name, timeout)
logging.info("Node with instance ID: %s is in stopped state" % vm_name)
nodeaction.wait_for_unknown_status(vm_name, timeout, self.kubecli)
except Exception as e:
logging.error("Failed to stop node instance. Encountered following exception: %s. " "Test Failed" % e)
logging.error(
"Failed to stop node instance. Encountered following exception: %s. "
"Test Failed" % e
)
logging.error("node_stop_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Node scenario to terminate the node
@@ -182,7 +200,10 @@ class azure_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_termination_scenario injection")
vm_name, resource_group = self.azure.get_instance_id(node)
logging.info("Terminating the node %s with instance ID: %s " % (vm_name, resource_group))
logging.info(
"Terminating the node %s with instance ID: %s "
% (vm_name, resource_group)
)
self.azure.terminate_instances(resource_group, vm_name)
self.azure.wait_until_terminated(resource_group, vm_name, timeout)
for _ in range(timeout):
@@ -192,14 +213,16 @@ class azure_node_scenarios(abstract_node_scenarios):
if vm_name in self.kubecli.list_nodes():
raise Exception("Node could not be terminated")
logging.info("Node with instance ID: %s has been terminated" % node)
logging.info("node_termination_scenario has been successfully injected!")
logging.info(
"node_termination_scenario has been successfully injected!"
)
except Exception as e:
logging.error(
"Failed to terminate node instance. Encountered following exception:" " %s. Test Failed" % (e)
"Failed to terminate node instance. Encountered following exception:"
" %s. Test Failed" % (e)
)
logging.error("node_termination_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
# Node scenario to reboot the node
@@ -208,7 +231,10 @@ class azure_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_reboot_scenario injection")
vm_name, resource_group = self.azure.get_instance_id(node)
logging.info("Rebooting the node %s with instance ID: %s " % (vm_name, resource_group))
logging.info(
"Rebooting the node %s with instance ID: %s "
% (vm_name, resource_group)
)
self.azure.reboot_instances(resource_group, vm_name)
nodeaction.wait_for_unknown_status(vm_name, timeout, self.kubecli)
nodeaction.wait_for_ready_status(vm_name, timeout, self.kubecli)
@@ -216,9 +242,9 @@ class azure_node_scenarios(abstract_node_scenarios):
logging.info("node_reboot_scenario has been successfully injected!")
except Exception as e:
logging.error(
"Failed to reboot node instance. Encountered following exception:" " %s. Test Failed" % (e)
"Failed to reboot node instance. Encountered following exception:"
" %s. Test Failed" % (e)
)
logging.error("node_reboot_scenario injection failed!")
# removed_exit
# sys.exit(1)
raise RuntimeError()
raise RuntimeError()

View File

@@ -1,14 +1,16 @@
import kraken.node_actions.common_node_functions as nodeaction
from kraken.node_actions.abstract_node_scenarios import abstract_node_scenarios
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
abstract_node_scenarios,
)
import logging
import openshift as oc
import pyipmi
import pyipmi.interfaces
import sys
import time
import traceback
from krkn_lib.k8s import KrknKubernetes
class BM:
def __init__(self, bm_info, user, passwd):
self.user = user
@@ -22,7 +24,11 @@ class BM:
# Get the ipmi or other BMC address of the baremetal node
def get_bmc_addr(self, node_name):
# Addresses in the config get higher priority.
if self.bm_info is not None and node_name in self.bm_info and "bmc_addr" in self.bm_info[node_name]:
if (
self.bm_info is not None
and node_name in self.bm_info
and "bmc_addr" in self.bm_info[node_name]
):
return self.bm_info[node_name]["bmc_addr"]
# Get the bmc addr from the BareMetalHost object.
@@ -40,7 +46,10 @@ class BM:
'BMC addr empty for node "%s". Either fix the BMH object,'
" or specify the address in the scenario config" % node_name
)
sys.exit(1)
raise RuntimeError(
'BMC addr empty for node "%s". Either fix the BMH object,'
" or specify the address in the scenario config" % node_name
)
return bmh_object.model.spec.bmc.address
def get_ipmi_connection(self, bmc_addr, node_name):
@@ -69,10 +78,15 @@ class BM:
"Missing IPMI BMI user and/or password for baremetal cloud. "
"Please specify either a global or per-machine user and pass"
)
sys.exit(1)
raise RuntimeError(
"Missing IPMI BMI user and/or password for baremetal cloud. "
"Please specify either a global or per-machine user and pass"
)
# Establish connection
interface = pyipmi.interfaces.create_interface("ipmitool", interface_type="lanplus")
interface = pyipmi.interfaces.create_interface(
"ipmitool", interface_type="lanplus"
)
connection = pyipmi.create_connection(interface)
@@ -96,14 +110,21 @@ class BM:
# Wait until the node instance is running
def wait_until_running(self, bmc_addr, node_name):
while not self.get_ipmi_connection(bmc_addr, node_name).get_chassis_status().power_on:
while (
not self.get_ipmi_connection(bmc_addr, node_name)
.get_chassis_status()
.power_on
):
time.sleep(1)
# Wait until the node instance is stopped
def wait_until_stopped(self, bmc_addr, node_name):
while self.get_ipmi_connection(bmc_addr, node_name).get_chassis_status().power_on:
while (
self.get_ipmi_connection(bmc_addr, node_name).get_chassis_status().power_on
):
time.sleep(1)
# krkn_lib
class bm_node_scenarios(abstract_node_scenarios):
def __init__(self, bm_info, user, passwd, kubecli: KrknKubernetes):
@@ -116,11 +137,15 @@ class bm_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_start_scenario injection")
bmc_addr = self.bm.get_bmc_addr(node)
logging.info("Starting the node %s with bmc address: %s " % (node, bmc_addr))
logging.info(
"Starting the node %s with bmc address: %s " % (node, bmc_addr)
)
self.bm.start_instances(bmc_addr, node)
self.bm.wait_until_running(bmc_addr, node)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info("Node with bmc address: %s is in running state" % (bmc_addr))
logging.info(
"Node with bmc address: %s is in running state" % (bmc_addr)
)
logging.info("node_start_scenario has been successfully injected!")
except Exception as e:
logging.error(
@@ -129,7 +154,7 @@ class bm_node_scenarios(abstract_node_scenarios):
"an incorrect ipmi address or login" % (e)
)
logging.error("node_start_scenario injection failed!")
sys.exit(1)
raise e
# Node scenario to stop the node
def node_stop_scenario(self, instance_kill_count, node, timeout):
@@ -137,10 +162,14 @@ class bm_node_scenarios(abstract_node_scenarios):
try:
logging.info("Starting node_stop_scenario injection")
bmc_addr = self.bm.get_bmc_addr(node)
logging.info("Stopping the node %s with bmc address: %s " % (node, bmc_addr))
logging.info(
"Stopping the node %s with bmc address: %s " % (node, bmc_addr)
)
self.bm.stop_instances(bmc_addr, node)
self.bm.wait_until_stopped(bmc_addr, node)
logging.info("Node with bmc address: %s is in stopped state" % (bmc_addr))
logging.info(
"Node with bmc address: %s is in stopped state" % (bmc_addr)
)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
except Exception as e:
logging.error(
@@ -149,7 +178,7 @@ class bm_node_scenarios(abstract_node_scenarios):
"an incorrect ipmi address or login" % (e)
)
logging.error("node_stop_scenario injection failed!")
sys.exit(1)
raise e
# Node scenario to terminate the node
def node_termination_scenario(self, instance_kill_count, node, timeout):
@@ -162,7 +191,9 @@ class bm_node_scenarios(abstract_node_scenarios):
logging.info("Starting node_reboot_scenario injection")
bmc_addr = self.bm.get_bmc_addr(node)
logging.info("BMC Addr: %s" % (bmc_addr))
logging.info("Rebooting the node %s with bmc address: %s " % (node, bmc_addr))
logging.info(
"Rebooting the node %s with bmc address: %s " % (node, bmc_addr)
)
self.bm.reboot_instances(bmc_addr, node)
nodeaction.wait_for_unknown_status(node, timeout, self.kubecli)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
@@ -176,4 +207,4 @@ class bm_node_scenarios(abstract_node_scenarios):
)
traceback.print_exc()
logging.error("node_reboot_scenario injection failed!")
sys.exit(1)
raise e

View File

@@ -2,8 +2,9 @@ import time
import random
import logging
import paramiko
import kraken.invoke.command as runcommand
import krkn.invoke.command as runcommand
from krkn_lib.k8s import KrknKubernetes
node_general = False
@@ -12,7 +13,10 @@ def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubern
if node_name in kubecli.list_killable_nodes():
return [node_name]
elif node_name:
logging.info("Node with provided node_name does not exist or the node might " "be in NotReady state.")
logging.info(
"Node with provided node_name does not exist or the node might "
"be in NotReady state."
)
nodes = kubecli.list_killable_nodes(label_selector)
if not nodes:
raise Exception("Ready nodes with the provided label selector do not exist")
@@ -34,12 +38,14 @@ def wait_for_ready_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "True", timeout, resource_version)
# krkn_lib
# Wait until the node status becomes Not Ready
def wait_for_not_ready_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "False", timeout, resource_version)
# krkn_lib
# Wait until the node status becomes Unknown
def wait_for_unknown_status(node, timeout, kubecli: KrknKubernetes):
@@ -50,7 +56,8 @@ def wait_for_unknown_status(node, timeout, kubecli: KrknKubernetes):
# Get the ip of the cluster node
def get_node_ip(node):
return runcommand.invoke(
"kubectl get node %s -o " "jsonpath='{.status.addresses[?(@.type==\"InternalIP\")].address}'" % (node)
"kubectl get node %s -o "
"jsonpath='{.status.addresses[?(@.type==\"InternalIP\")].address}'" % (node)
)
@@ -74,15 +81,23 @@ def check_service_status(node, service, ssh_private_key, timeout):
if connection is None:
break
except Exception as e:
logging.error("Failed to ssh to instance: %s within the timeout duration of %s: %s" % (node, timeout, e))
logging.error(
"Failed to ssh to instance: %s within the timeout duration of %s: %s"
% (node, timeout, e)
)
for service_name in service:
logging.info("Checking status of Service: %s" % (service_name))
stdin, stdout, stderr = ssh.exec_command(
"systemctl status %s | grep '^ Active' " "| awk '{print $2}'" % (service_name)
"systemctl status %s | grep '^ Active' "
"| awk '{print $2}'" % (service_name)
)
service_status = stdout.readlines()[0]
logging.info("Status of service %s is %s \n" % (service_name, service_status.strip()))
logging.info(
"Status of service %s is %s \n" % (service_name, service_status.strip())
)
if service_status.strip() != "active":
logging.error("Service %s is in %s state" % (service_name, service_status.strip()))
logging.error(
"Service %s is in %s state" % (service_name, service_status.strip())
)
ssh.close()

Some files were not shown because too many files have changed in this diff Show More