Compare commits

..

45 Commits

Author SHA1 Message Date
Paige Rubendall
9de6c7350e adding stringio for security reasons 2022-09-12 11:14:08 -04:00
Naga Ravi Chaitanya Elluri
9f23699cfa Document node scenario actions for VMware
This commit also updates the id's for the VMware scenarios to be aligned
with other cloud providers.
2022-09-07 11:34:14 -04:00
Sandro Bonazzola
fcc7145b98 post_action_regex: fix log message for list_namespace
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-07 16:48:58 +02:00
Sandro Bonazzola
bce5be9667 make post_action_regex importable
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-07 16:48:58 +02:00
Sandro Bonazzola
0031912000 post_action_regex: avoid redevining variables from outer scope
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-07 16:48:58 +02:00
Sandro Bonazzola
1a1a9c9bfe pycodestyle fixes: scenarios/openshift/post_action_regex.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-07 16:48:58 +02:00
Sandro Bonazzola
ec807e3b3a pycodestyle fixes: vmware_plugin.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-05 14:15:38 +02:00
Sandro Bonazzola
b444854cb2 pycodestyle fixes: kraken/pvc/pvc_scenario.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-05 13:36:16 +02:00
Sandro Bonazzola
1dc58d8721 pycodestyle fixes: ingress_shaping.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-05 13:20:23 +02:00
Sandro Bonazzola
6112ba63c3 plugins/run_python_plugin.py: remove unused import
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-05 13:20:23 +02:00
Sandro Bonazzola
155269fd9d pycodestyle fixes: run_kraken.py
Other than plain style changes, introduced constants
`KUBE_BURNER_URL` and `KUBE_BURNER_VERSION`
solving the problem of having a too long string and at the same time
make it easier to bump the requirement on Kube Burner.

Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-05 10:25:59 +02:00
Sandro Bonazzola
79b92fc395 pycodestyle fixes: tests/test_ingress_network_plugin.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-05 08:47:55 +02:00
Sandro Bonazzola
ed1c486c85 pycodestyle fixes: tests/test_vmware_plugin.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 12:56:47 -04:00
Sandro Bonazzola
6ba1e1ad8b waive bandit report on insecure random usage
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 15:57:39 +02:00
Sandro Bonazzola
3b476b68f2 pycodestyle fixes: kraken/time_actions/common_time_functions.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 15:57:39 +02:00
Sandro Bonazzola
e17ebd0e7b pycodestyle fixes: kraken/shut_down/common_shut_down_func.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 15:44:42 +02:00
Sandro Bonazzola
d0d289fb7c update references to github organization
Updated references from chaos-kubox to redhat-chaos.

Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 14:38:25 +02:00
Sandro Bonazzola
66f88f5a78 pyflakes: fix imports for allowing analysis
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 14:23:11 +02:00
Sandro Bonazzola
abc635c699 server.py: change comment to pydoc
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 13:44:17 +02:00
Sandro Bonazzola
90b45538f2 pycodestyle fixes: kraken/cerberus/setup.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 06:32:46 -04:00
Sandro Bonazzola
c6469ef6cd pycodestyle: fix server.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 09:47:41 +02:00
Sandro Bonazzola
c94c2b22a9 pycodestyle fixes: kraken/zone_outage/actions.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-02 09:15:59 +02:00
Shreyas Anantha Ramaprasad
9421a0c2c2 Added support for ingress traffic shaping (#299)
* Added plugin for ingress network traffic shaping

* Documentation changes

* Minor changes

* Documentation and formatting fixes

* Added trap to sleep infinity command running in containers

* Removed shell injection threat for modprobe commands

* Added docstrings to cerberus functions

* Added checks to prevent shell injection

* Bug fix
2022-09-02 07:54:11 +02:00
Sandro Bonazzola
8a68e1cc9b pycodestyle fixes: kraken/kubernetes/client.py
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-01 12:32:52 -04:00
Shreyas Anantha Ramaprasad
d5615ac470 Fixing parts of issue #185 for PVC scenario (#290)
* Created new file for dataclasses and replaced kubectl pvc cli calls

* Added checks for existence of pod/pvc

* Modified command to get pvc capacity

Removed redundant function call
2022-09-01 15:44:37 +02:00
Naga Ravi Chaitanya Elluri
5ab16baafa Bump release version 2022-08-25 16:45:47 -04:00
Naga Ravi Chaitanya Elluri
412d718985 Fix code alignment 2022-08-25 11:32:19 -04:00
Naga Ravi Chaitanya Elluri
11f469cb8e Update install sources to use the latest release 2022-08-24 15:34:42 -04:00
Naga Ravi Chaitanya Elluri
6c75d3dddb Add option to skip litmus installation
This commit adds an option for the user to pick whether to install
litmus or not depending on their use case. One use case is disconnected
environments where litmus is pre-installed insted of reaching out to the
internet.
2022-08-23 14:09:10 -04:00
Paige Rubendall
f7e27a215e Move plugin tests (#289)
* moving pytests

* adding tests folder not under CI
2022-08-19 09:23:37 -04:00
Naga Ravi Chaitanya Elluri
e680592762 Create prometheus token to use for OCP versions >=4.11
This commit adopts the code from https://github.com/redhat-chaos/cerberus/pull/176
to support using the existing/creating new token needed to query prometheus depending
on the OpenShift version in use.

Co-authored-by: Paige Rubendall <prubenda@redhat.com>
2022-08-16 08:07:28 -04:00
Shreyas Anantha Ramaprasad
08deae63dd Added VMware Node Scenarios (#285)
* Added VMware node scenarios

* Made vmware plugin independent of Krkn

* Revert changes made to node status watch

* Fixed minor documentation changes
2022-08-15 23:35:16 +02:00
Sam Doran
f4bc30d2a1 Update README (#284)
* Update link to documentation

* Update container status badge and link

Use the correct link to the status badge on Quay.
2022-08-07 02:20:32 -04:00
Robert O'Brien
bbde837360 Refactor node status function 2022-08-03 16:51:49 +02:00
Robert O'Brien
5d789e7d30 Refactor client watch 2022-08-03 16:51:49 +02:00
Robert O'Brien
69fc8e8d1b Add resource version to list node call 2022-08-03 16:51:49 +02:00
Robert O'Brien
77f53b3a23 Rework node status to use watches 2022-08-03 16:51:49 +02:00
Janos Bonic
ccd902565e Fixes #265: Replace Powerfulseal and introduce Wolkenwalze SDK for plugin system 2022-08-02 16:25:03 +01:00
Naga Ravi Chaitanya Elluri
da117ad9d9 Switch to python3.9 2022-07-22 16:56:47 -04:00
Janos Bonic
ca7bc3f67b Removing cryptography pinning
Signed-off-by: Janos Bonic <86970079+janosdebugs@users.noreply.github.com>
2022-07-20 13:31:56 -04:00
Shreyas Anantha Ramaprasad
b01d9895fb Continue fixing small parts of issue #185 (#277)
* Added dataclasses to store info retrieved from k8 client calls

* Replaced few invoke commands in common_litmus

* Minor Documentation Changes

* Removed unused import and redundant variable

Signed-off-by: Shreyas Anantha Ramaprasad <ars.shreyas@gmail.com>
2022-07-19 14:57:17 +02:00
Naga Ravi Chaitanya Elluri
bbb66aa322 Fix source to install azure-cli
This commit updates Krkn source Dockerfile to copy azure client binary
from the official azure-cli image instead of using package manager to
avoid dependency issues.
2022-07-18 16:21:29 -04:00
harshil-redhat
97d4f51f74 Fix installation docs with updated git repo (#270)
Signed-off-by: harshil-redhat <72143431+harshil-redhat@users.noreply.github.com>
2022-06-23 19:29:36 -04:00
Alejandro Gullón
4522ab77b1 Updating commands to get used PVC capacity and allocate file 2022-06-19 18:43:01 -04:00
STARTX
f4bfc08186 debug error message when network interface not found (#268)
Debug error occured when giving a bad network interface list

Traceback (most recent call last):
  File "/root/kraken/run_kraken.py", line 318, in <module>
    main(options.cfg)
  File "/root/kraken/run_kraken.py", line 239, in main
    network_chaos.run(scenarios_list, config, wait_duration)
  File "/root/kraken/kraken/network_chaos/actions.py", line 39, in run
    test_interface = verify_interface(test_interface, nodelst, pod_template)
  File "/root/kraken/kraken/network_chaos/actions.py", line 111, in verify_interface
    "Interface %s not found in node %s interface list %s" % (interface, nodelst[pod_index]),
TypeError: not enough arguments for format string

Signed-off-by: STARTX <clarue@startx.fr>
2022-06-14 18:33:59 -04:00
88 changed files with 5049 additions and 1329 deletions

View File

@@ -12,14 +12,19 @@ jobs:
- name: Check out code
uses: actions/checkout@v3
- name: Create multi-node KinD cluster
uses: chaos-kubox/actions/kind@main
uses: redhat-chaos/actions/kind@main
- name: Install Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
architecture: 'x64'
- name: Install environment
run: |
sudo apt-get install build-essential python3-dev
pip install -r requirements.txt
- name: Run unit tests
run: python -m unittest discover
- name: Run e2e tests
run: python -m unittest discover -s tests
- name: Run CI
run: ./CI/run.sh
- name: Build the Docker images
run: docker build --no-cache -t quay.io/chaos-kubox/krkn containers/
@@ -34,7 +39,7 @@ jobs:
run: docker push quay.io/chaos-kubox/krkn
- name: Rebuild krkn-hub
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
uses: chaos-kubox/actions/krkn-hub@main
uses: redhat-chaos/actions/krkn-hub@main
with:
QUAY_USER: ${{ secrets.QUAY_USER_1 }}
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}

View File

@@ -1,31 +1,6 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete hello pods"
steps:
- podAction:
matches:
- labels:
namespace: "default"
selector: "hello-openshift"
filters:
- randomSample:
size: 1
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "default"
selector: "hello-openshift"
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 1
# yaml-language-server: $schema=../../scenarios/plugin.schema.json
- id: kill-pods
config:
label_selector: name=hello-openshift
namespace_pattern: ^default$
kill: 1

View File

@@ -1,5 +1,5 @@
# Krkn aka Kraken
[![Docker Repository on Quay](https://quay.io/repository/chaos-kubox/krkn?tab=tags&tag=latest "Docker Repository on Quay")](https://quay.io/chaos-kubox/krkn)
[![Docker Repository on Quay](https://quay.io/repository/chaos-kubox/krkn/status "Docker Repository on Quay")](https://quay.io/repository/chaos-kubox/krkn?tab=tags&tag=latest)
![Krkn logo](media/logo.png)
@@ -23,7 +23,7 @@ Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check i
- Test environment recommendations as to how and where to run chaos tests.
- Chaos testing in practice.
The guide is hosted at [https://chaos-kubox.github.io/krkn/](https://chaos-kubox.github.io/krkn/).
The guide is hosted at https://redhat-chaos.github.io/krkn.
### How to Get Started
@@ -35,7 +35,7 @@ After installation, refer back to the below sections for supported scenarios and
#### Running Kraken with minimal configuration tweaks
For cases where you want to run Kraken with minimal configuration changes, refer to [Kraken-hub](https://github.com/chaos-kubox/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
For cases where you want to run Kraken with minimal configuration changes, refer to [Kraken-hub](https://github.com/redhat-chaos/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
### Setting up infrastructure dependencies
Kraken indexes the metrics specified in the profile into Elasticsearch in addition to leveraging Cerberus for understanding the health of the Kubernetes/OpenShift cluster under test. More information on the features is documented below. The infrastructure pieces can be easily installed and uninstalled by running:
@@ -74,7 +74,7 @@ Scenario type | Kubernetes | OpenShift
### Kraken scenario pass/fail criteria and report
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/chaos-kubox/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/chaos-kubox/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/chaos-kubox/krkn/blob/main/config/cerberus.yaml).
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
- Leveraging [kube-burner](docs/alerts.md) alerting feature to fail the runs in case of critical alerts.
### Signaling
@@ -105,11 +105,10 @@ In addition to checking the recovery and health of the cluster and components un
### Roadmap
Following is a list of enhancements that we are planning to work on adding support in Kraken. Of course any help/contributions are greatly appreciated.
- [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/chaos-kubox/krkn/issues/124)
- Ability to shape the ingress network similar to how Kraken supports [egress traffic shaping](https://github.com/chaos-kubox/krkn/blob/main/docs/network_chaos.md) today.
- [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/redhat-chaos/krkn/issues/124)
- Continue to improve [Chaos Testing Guide](https://cloud-bulldozer.github.io/kraken/) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- Support for running Kraken on Kubernetes distribution - see https://github.com/chaos-kubox/krkn/issues/185, https://github.com/chaos-kubox/krkn/issues/186
- Sweet logo for Kraken - see https://github.com/chaos-kubox/krkn/issues/195
- Support for running Kraken on Kubernetes distribution - see https://github.com/redhat-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- Sweet logo for Kraken - see https://github.com/redhat-chaos/krkn/issues/195
### Contributions

View File

@@ -5,21 +5,23 @@ kraken:
port: 8081
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
litmus_install: True # Installs specified version, set to False if it's already setup
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/openshift/container_etcd.yml
- pod_scenarios:
- - scenarios/openshift/etcd.yml
- - scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/post_action_regex.py
- plugin_scenarios:
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/vmware_node_scenarios.yml
- scenarios/openshift/network_chaos_ingress.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- pod_scenarios:
- - scenarios/openshift/openshift-apiserver.yml
- - scenarios/openshift/openshift-kube-apiserver.yml
- plugin_scenarios:
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- litmus_scenarios: # List of litmus scenarios to load

View File

@@ -5,14 +5,15 @@ kraken:
port: 8081
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
litmus_install: True # Installs specified version, set to False if it's already setup
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/kube/container_dns.yml
- pod_scenarios:
- - scenarios/kube/scheduler.yml
- plugin_scenarios:
- scenarios/kube/scheduler.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed

View File

@@ -9,15 +9,14 @@ kraken:
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- pod_scenarios: # List of chaos pod scenarios to load
- - scenarios/openshift/etcd.yml
- - scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/post_action_regex.py
- plugin_scenarios: # List of chaos pod scenarios to load
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- pod_scenarios:
- - scenarios/openshift/openshift-apiserver.yml
- - scenarios/openshift/openshift-kube-apiserver.yml
- plugin_scenarios:
- scenarios/openshift/openshift-apiserver.yml
- scenarios/openshift/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/openshift/time_scenarios_example.yml
- litmus_scenarios: # List of litmus scenarios to load

View File

@@ -2,6 +2,8 @@
FROM quay.io/openshift/origin-tests:latest as origintests
FROM mcr.microsoft.com/azure-cli:latest as azure-cli
FROM quay.io/centos/centos:stream9
LABEL org.opencontainers.image.authors="Red Hat OpenShift Chaos Engineering"
@@ -12,17 +14,18 @@ ENV KUBECONFIG /root/.kube/config
COPY --from=origintests /usr/bin/oc /usr/bin/oc
COPY --from=origintests /usr/bin/kubectl /usr/bin/kubectl
# Copy azure client binary from azure-cli image
COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
# Install dependencies
RUN yum install epel-release -y && \
yum install -y git python python3-pip jq gettext && \
python3 -m pip install -U pip && \
rpm --import https://packages.microsoft.com/keys/microsoft.asc && \
echo -e "[azure-cli]\nname=Azure CLI\nbaseurl=https://packages.microsoft.com/yumrepos/azure-cli\nenabled=1\ngpgcheck=1\ngpgkey=https://packages.microsoft.com/keys/microsoft.asc" > /etc/yum.repos.d/azure-cli.repo && yum install -y azure-cli && \
git clone https://github.com/openshift-scale/kraken /root/kraken && \
yum install -y git python39 python3-pip jq gettext && \
python3.9 -m pip install -U pip && \
git clone https://github.com/redhat-chaos/krkn.git --branch v1.0.1 /root/kraken && \
mkdir -p /root/.kube && cd /root/kraken && \
pip3 install -r requirements.txt
pip3.9 install -r requirements.txt
WORKDIR /root/kraken
ENTRYPOINT ["python3", "run_kraken.py"]
ENTRYPOINT ["python3.9", "run_kraken.py"]
CMD ["--config=config/config.yaml"]

View File

@@ -15,7 +15,7 @@ RUN curl -L -o openshift-client-linux.tar.gz https://mirror.openshift.com/pub/op
# Install dependencies
RUN yum install epel-release -y && \
yum install -y git python36 python3-pip gcc libffi-devel python36-devel openssl-devel gcc-c++ make jq gettext && \
git clone https://github.com/cloud-bulldozer/kraken /root/kraken && \
git clone https://github.com/redhat-chaos/krkn.git --branch v1.0.1 /root/kraken && \
mkdir -p /root/.kube && cd /root/kraken && \
pip3 install cryptography==3.3.2 && \
pip3 install -r requirements.txt setuptools==40.3.0 urllib3==1.25.4

View File

@@ -3,17 +3,17 @@
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/chaos-kubox/krkn).
### Run containerized version
Refer [instructions](https://github.com/chaos-kubox/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
Refer [instructions](https://github.com/redhat-chaos/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
### Run Custom Kraken Image
Refer to [instructions](https://github.com/chaos-kubox/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
Refer to [instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
### Kraken as a KubeApp
To run containerized Kraken as a Kubernetes/OpenShift Deployment, follow these steps:
1. Configure the [config.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) file according to your requirements.
1. Configure the [config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) file according to your requirements.
2. Create a namespace under which you want to run the kraken pod using `kubectl create ns <namespace>`.
3. Switch to `<namespace>` namespace:
- In Kubernetes, use `kubectl config set-context --current --namespace=<namespace>`

View File

@@ -18,7 +18,7 @@ spec:
privileged: true
image: quay.io/chaos-kubox/krkn
command: ["/bin/sh", "-c"]
args: ["python3 run_kraken.py -c config/config.yaml"]
args: ["python3.9 run_kraken.py -c config/config.yaml"]
volumeMounts:
- mountPath: "/root/.kube"
name: config

View File

@@ -1,6 +1,6 @@
## Alerts
Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports alerting based on the queries defined by the user and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) by setting the following:
Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports alerting based on the queries defined by the user and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
```
performance_monitoring:
@@ -12,7 +12,7 @@ performance_monitoring:
```
### Alert profile
A couple of [alert profiles](https://github.com/chaos-kubox/krkn/tree/main/config) [alerts](https://github.com/chaos-kubox/krkn/blob/main/config/alerts) are shipped by default and can be tweaked to add more queries to alert on. The following are a few alerts examples:
A couple of [alert profiles](https://github.com/redhat-chaos/krkn/tree/main/config) [alerts](https://github.com/redhat-chaos/krkn/blob/main/config/alerts) are shipped by default and can be tweaked to add more queries to alert on. The following are a few alerts examples:
```
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01

View File

@@ -5,6 +5,7 @@ Supported Cloud Providers:
* [Openstack](#openstack)
* [Azure](#azure)
* [Alibaba](#alibaba)
* [VMware](#vmware)
## AWS
@@ -53,3 +54,15 @@ See the [Installation guide](https://www.alibabacloud.com/help/en/alibaba-cloud-
Refer to [region and zone page](https://www.alibabacloud.com/help/en/elastic-compute-service/latest/regions-and-zones#concept-2459516) to get the region id for the region you are running on.
Set cloud_type to either alibaba or alicloud in your node scenario yaml file.
## VMware
Set the following environment variables
1. ```export VSPHERE_IP=<vSphere_client_IP_address>```
2. ```export VSPHERE_USERNAME=<vSphere_client_username>```
3. ```export VSPHERE_PASSWORD=<vSphere_client_password>```
These are the credentials that you would normally use to access the vSphere client.

View File

@@ -1,5 +1,5 @@
#### Kubernetes/OpenShift cluster shut down scenario
Scenario to shut down all the nodes including the masters and restart them after specified duration. Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to [cluster_shut_down_scenario](https://github.com/chaos-kubox/krkn/blob/main/scenarios/cluster_shut_down_scenario.yml) config file.
Scenario to shut down all the nodes including the masters and restart them after specified duration. Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to [cluster_shut_down_scenario](https://github.com/redhat-chaos/krkn/blob/main/scenarios/cluster_shut_down_scenario.yml) config file.
Refer to [cloud setup](cloud_setup.md) to configure your cli properly for the cloud provider of the cluster you want to shut down.

View File

@@ -1,4 +1,4 @@
### Config
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at [config/config.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml).
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at [config/config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml).
**NOTE**: [config](https://github.com/chaos-kubox/krkn/blob/main/config/config_performance.yaml) can be used if leveraging the [automated way](https://github.com/chaos-kubox/krkn#setting-up-infrastructure-dependencies) to install the infrastructure pieces.
**NOTE**: [config](https://github.com/redhat-chaos/krkn/blob/main/config/config_performance.yaml) can be used if leveraging the [automated way](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies) to install the infrastructure pieces.

View File

@@ -23,7 +23,7 @@ In all scenarios we do a post chaos check to wait and verify the specific compon
Here there are two options:
1. Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.
See [scenarios/post_action_etcd_container.py](https://github.com/chaos-kubox/krkn/blob/main/scenarios/post_action_etcd_container.py) for an example.
See [scenarios/post_action_etcd_container.py](https://github.com/redhat-chaos/krkn/blob/main/scenarios/post_action_etcd_container.py) for an example.
```
- container_scenarios: # List of chaos pod scenarios to load.
- - scenarios/container_etcd.yml

View File

@@ -1,52 +1,26 @@
## Getting Started Running Chaos Scenarios
#### Adding New Scenarios
Adding a new scenario is as simple as adding a new config file under [scenarios directory](https://github.com/chaos-kubox/krkn/tree/main/scenarios) and defining it in the main kraken [config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml#L8).
Adding a new scenario is as simple as adding a new config file under [scenarios directory](https://github.com/redhat-chaos/krkn/tree/main/scenarios) and defining it in the main kraken [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml#L8).
You can either copy an existing yaml file and make it your own, or fill in one of the templates below to suit your needs.
### Templates
#### Pod Scenario Yaml Template
For example, for adding a pod level scenario for a new application, refer to the sample scenario below to know what fields are necessary and what to add in each location:
```
config:
runStrategy:
runs: <number of times to execute the scenario>
#This will choose a random number to wait between min and max
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete pods example"
steps:
- podAction:
matches:
- labels:
namespace: "<namespace>"
selector: "<pod label>" # This can be left blank.
filters:
- randomSample:
size: <number of pods to kill>
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "<namespace>"
selector: "<pod label>" # This can be left blank.
retries:
retriesTimeout:
# Amount of time to wait with retrying, before failing if pod count does not match expected
# timeout: 180.
actions:
- checkPodCount:
count: <expected number of pods that match namespace and label"
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^<namespace>$
label_selector: <pod label>
kill: <number of pods to kill>
- id: wait-for-pods
config:
namespace_pattern: ^<namespace>$
label_selector: <pod label>
count: <expected number of pods that match namespace and label>
```
More information on specific items that you can add to the pod killing scenarios can be found in the [powerfulseal policies](https://powerfulseal.github.io/powerfulseal/policies) documentation
#### Node Scenario Yaml Template
```

View File

@@ -90,18 +90,18 @@ We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
### Tooling
Now that we looked at the best practices, In this section, we will go through how [Kraken](https://github.com/chaos-kubox/krkn) - a chaos testing framework can help test the resilience of OpenShift and make sure the applications and services are following the best practices.
Now that we looked at the best practices, In this section, we will go through how [Kraken](https://github.com/redhat-chaos/krkn) - a chaos testing framework can help test the resilience of OpenShift and make sure the applications and services are following the best practices.
#### Workflow
Let us start by understanding the workflow of kraken: the user will start by running kraken by pointing to a specific OpenShift cluster using kubeconfig to be able to talk to the platform on top of which the OpenShift cluster is hosted. This can be done by either the oc/kubectl API or the cloud API. Based on the configuration of kraken, it will inject specific chaos scenarios as shown below, talk to [Cerberus](https://github.com/chaos-kubox/cerberus) to get the go/no-go signal representing the overall health of the cluster ( optional - can be turned off ), scrapes metrics from in-cluster prometheus given a metrics profile with the promql queries and stores them long term in Elasticsearch configured ( optional - can be turned off ), evaluates the promql expressions specified in the alerts profile ( optional - can be turned off ) and aggregated everything to set the pass/fail i.e. exits 0 or 1. More about the metrics collection, cerberus and metrics evaluation can be found in the next section.
Let us start by understanding the workflow of kraken: the user will start by running kraken by pointing to a specific OpenShift cluster using kubeconfig to be able to talk to the platform on top of which the OpenShift cluster is hosted. This can be done by either the oc/kubectl API or the cloud API. Based on the configuration of kraken, it will inject specific chaos scenarios as shown below, talk to [Cerberus](https://github.com/redhat-chaos/cerberus) to get the go/no-go signal representing the overall health of the cluster ( optional - can be turned off ), scrapes metrics from in-cluster prometheus given a metrics profile with the promql queries and stores them long term in Elasticsearch configured ( optional - can be turned off ), evaluates the promql expressions specified in the alerts profile ( optional - can be turned off ) and aggregated everything to set the pass/fail i.e. exits 0 or 1. More about the metrics collection, cerberus and metrics evaluation can be found in the next section.
![Kraken workflow](../media/kraken-workflow.png)
#### Cluster recovery checks, metrics evaluation and pass/fail criteria
- Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and its extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where [Cerberus](https://github.com/chaos-kubox/cerberus) comes to the rescue.
- Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and its extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where [Cerberus](https://github.com/redhat-chaos/cerberus) comes to the rescue.
If the monitoring tool, cerberus is enabled it will consume the signal and continue running chaos or not based on that signal.
- Apart from checking the recovery and cluster health status, its equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Kraken has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found [here](https://github.com/chaos-kubox/krkn#alerts).
- Apart from checking the recovery and cluster health status, its equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Kraken has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found [here](https://github.com/redhat-chaos/krkn#alerts).
- The overall pass or fail of kraken is based on the recovery of the specific component (within a certain amount of time), the cerberus health signal which tracks the health of the entire cluster and metrics evaluation from incluster prometheus.
@@ -112,17 +112,17 @@ If the monitoring tool, cerberus is enabled it will consume the signal and conti
Let us take a look at how to run the chaos scenarios on your OpenShift clusters using Kraken-hub - a lightweight wrapper around Kraken to ease the runs by providing the ability to run them by just running container images using podman with parameters set as environment variables. This eliminates the need to carry around and edit configuration files and makes it easy for any CI framework integration. Here are the scenarios supported:
- Pod Scenarios ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/pod-scenarios.md))
- Pod Scenarios ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/pod-scenarios.md))
- Disrupts OpenShift/Kubernetes and applications deployed as pods:
- Helps understand the availability of the application, the initialization timing and recovery status.
- [Demo](https://asciinema.org/a/452351?speed=3&theme=solarized-dark)
- Container Scenarios ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/container-scenarios.md))
- Container Scenarios ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/container-scenarios.md))
- Disrupts OpenShift/Kubernetes and applications deployed as containers running as part of a pod(s) using a specified kill signal to mimic failures:
- Helps understand the impact and recovery timing when the program/process running in the containers are disrupted - hangs, paused, killed etc., using various kill signals, i.e. SIGHUP, SIGTERM, SIGKILL etc.
- [Demo](https://asciinema.org/a/BXqs9JSGDSEKcydTIJ5LpPZBM?speed=3&theme=solarized-dark)
- Node Scenarios ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-scenarios.md))
- Node Scenarios ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-scenarios.md))
- Disrupts nodes as part of the cluster infrastructure by talking to the cloud API. AWS, Azure, GCP, OpenStack and Baremetal are the supported platforms as of now. Possible disruptions include:
- Terminate nodes
- Fork bomb inside the node
@@ -131,18 +131,18 @@ Let us take a look at how to run the chaos scenarios on your OpenShift clusters
- etc.
- [Demo](https://asciinema.org/a/ANZY7HhPdWTNaWt4xMFanF6Q5)
- Zone Outages ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/zone-outages.md))
- Zone Outages ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/zone-outages.md))
- Creates outage of availability zone(s) in a targeted region in the public cloud where the OpenShift cluster is running by tweaking the network acl of the zone to simulate the failure, and that in turn will stop both ingress and egress traffic from all nodes in a particular zone for the specified duration and reverts it back to the previous state.
- Helps understand the impact on both Kubernetes/OpenShift control plane as well as applications and services running on the worker nodes in that zone.
- Currently, only set up for AWS cloud platform: 1 VPC and multiples subnets within the VPC can be specified.
- [Demo](https://asciinema.org/a/452672?speed=3&theme=solarized-dark)
- Application Outages ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/application-outages.md))
- Application Outages ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/application-outages.md))
- Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during the downtime.
- Helps understand how the dependent services react to the unavailability.
- [Demo](https://asciinema.org/a/452403?speed=3&theme=solarized-dark)
- Power Outages ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/power-outages.md))
- Power Outages ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md))
- This scenario imitates a power outage by shutting down of the entire cluster for a specified duration of time, then restarts all the nodes after the specified time and checks the health of the cluster.
- There are various use cases in the customer environments. For example, when some of the clusters are shutdown in cases where the applications are not needed to run in a particular time/season in order to save costs.
- The nodes are stopped in parallel to mimic a power outage i.e., pulling off the plug
@@ -151,24 +151,24 @@ Let us take a look at how to run the chaos scenarios on your OpenShift clusters
- Resource Hog
- Hogs CPU, Memory and IO on the targeted nodes
- Helps understand if the application/system components have reserved resources to not get disrupted because of rogue applications, or get performance throttled.
- CPU Hog ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-cpu-hog.md), [Demo](https://asciinema.org/a/452762))
- Memory Hog ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-memory-hog.md), [Demo](https://asciinema.org/a/452742?speed=3&theme=solarized-dark))
- IO Hog ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-io-hog.md))
- CPU Hog ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-cpu-hog.md), [Demo](https://asciinema.org/a/452762))
- Memory Hog ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-memory-hog.md), [Demo](https://asciinema.org/a/452742?speed=3&theme=solarized-dark))
- IO Hog ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-io-hog.md))
- Time Skewing ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/time-scenarios.md))
- Time Skewing ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/time-scenarios.md))
- Manipulate the system time and/or date of specific pods/nodes.
- Verify scheduling of objects so they continue to work.
- Verify time gets reset properly.
- Namespace Failures ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/namespace-scenarios.md))
- Namespace Failures ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/namespace-scenarios.md))
- Delete namespaces for the specified duration.
- Helps understand the impact on other components and tests/improves recovery time of the components in the targeted namespace.
- Persistent Volume Fill ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/pvc-scenarios.md))
- Persistent Volume Fill ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/pvc-scenarios.md))
- Fills up the persistent volumes, up to a given percentage, used by the pod for the specified duration.
- Helps understand how an application deals when it is no longer able to write data to the disk. For example, kafkas behavior when it is not able to commit data to the disk.
- Network Chaos ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/network-chaos.md))
- Network Chaos ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/network-chaos.md))
- Scenarios supported includes:
- Network latency
- Packet loss

View File

@@ -9,28 +9,29 @@ The following ways are supported to run Kraken:
**NOTE**: It is recommended to run Kraken external to the cluster ( Standalone or Containerized ) hitting the Kubernetes/OpenShift API as running it internal to the cluster might be disruptive to itself and also might not report back the results if the chaos leads to cluster's API server instability.
**NOTE**: To run Kraken on Power (ppc64le) architecture, build and run a containerized version by following the
instructions given [here](https://github.com/chaos-kubox/krkn/blob/main/containers/build_own_image-README.md).
instructions given [here](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md).
### Git
#### Clone the repository
Pick the latest stable release to install [here](https://github.com/redhat-chaos/krkn/releases).
```
$ git clone https://github.com/openshift-scale/krkn.git
$ git clone https://github.com/redhat-chaos/krkn.git --branch <release version>
$ cd kraken
```
#### Install the dependencies
```
$ python3 -m venv chaos
$ python3.9 -m venv chaos
$ source chaos/bin/activate
$ pip3 install -r requirements.txt
$ pip3.9 install -r requirements.txt
```
**NOTE**: Make sure python3-devel and latest pip versions are installed on the system. The dependencies install has been tested with pip >= 21.1.3 versions.
#### Run
```
$ python3 run_kraken.py --config <config_file_location>
$ python3.9 run_kraken.py --config <config_file_location>
```
### Run containerized version
@@ -50,8 +51,8 @@ $ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config
$ podman logs -f kraken
```
If you want to build your own kraken image see [here](https://github.com/chaos-kubox/krkn/blob/main/containers/build_own_image-README.md)
If you want to build your own kraken image see [here](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md)
### Run Kraken as a Kubernetes deployment
Refer [Instructions](https://github.com/chaos-kubox/krkn/blob/main/containers/README.md) on how to deploy and run Kraken as a Kubernetes/OpenShift deployment.
Refer [Instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/README.md) on how to deploy and run Kraken as a Kubernetes/OpenShift deployment.

View File

@@ -36,6 +36,6 @@ The following are the start of scenarios for which a chaos scenario config exist
Scenario | Description | Working
------------------------ |-----------------------------------------------------------------------------------------| ------------------------- |
[Node CPU Hog](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_cpu_hog_engine.yaml) | Chaos scenario that hogs up the CPU on a defined node for a specific amount of time. | :heavy_check_mark: |
[Node Memory Hog](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_mem_engine.yaml) | Chaos scenario that hogs up the memory on a defined node for a specific amount of time. | :heavy_check_mark: |
[Node IO Hog](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_io_engine.yaml) | Chaos scenario that hogs up the IO on a defined node for a specific amount of time. | :heavy_check_mark: |
[Node CPU Hog](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_cpu_hog_engine.yaml) | Chaos scenario that hogs up the CPU on a defined node for a specific amount of time. | :heavy_check_mark: |
[Node Memory Hog](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_mem_engine.yaml) | Chaos scenario that hogs up the memory on a defined node for a specific amount of time. | :heavy_check_mark: |
[Node IO Hog](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_io_engine.yaml) | Chaos scenario that hogs up the IO on a defined node for a specific amount of time. | :heavy_check_mark: |

View File

@@ -2,7 +2,7 @@
There are cases where the state of the cluster and metrics on the cluster during the chaos test run need to be stored long term to review after the cluster is terminated, for example CI and automation test runs. To help with this, Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes them into Elasticsearch. The indexed metrics can be visualized with the help of Grafana.
It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Each run has a unique identifier ( uuid ) and all the metrics/documents in Elasticsearch will be associated with it. The uuid is generated automatically if not set in the config. This feature can be enabled in the [config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) by setting the following:
It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Each run has a unique identifier ( uuid ) and all the metrics/documents in Elasticsearch will be associated with it. The uuid is generated automatically if not set in the config. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
```
performance_monitoring:
@@ -16,7 +16,7 @@ performance_monitoring:
```
### Metrics profile
A couple of [metric profiles](https://github.com/chaos-kubox/krkn/tree/main/config), [metrics.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/metrics.yaml), and [metrics-aggregated.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/metrics-aggregated.yaml) are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:
A couple of [metric profiles](https://github.com/redhat-chaos/krkn/tree/main/config), [metrics.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/metrics.yaml), and [metrics-aggregated.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/metrics-aggregated.yaml) are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:
```
metrics:

View File

@@ -16,7 +16,7 @@ Set to '^.*$' and label_selector to "" to randomly select any namespace in your
**sleep:** Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set
Refer to [namespace_scenarios_example](https://github.com/chaos-kubox/krkn/blob/main/scenarios/regex_namespace.yaml) config file.
Refer to [namespace_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/regex_namespace.yaml) config file.
```
scenarios:

View File

@@ -1,7 +1,7 @@
### Network chaos
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node's host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
##### Sample scenario config
##### Sample scenario config for egress traffic shaping
```
network_chaos: # Scenario to create an outage by simulating random variations in the network.
duration: 300 # In seconds - duration network chaos will be applied.
@@ -17,6 +17,29 @@ network_chaos: # Scenario to create an outage
bandwidth: 100mbit
```
##### Sample scenario config for ingress traffic shaping (using a plugin)
'''
- id: network_chaos
config:
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
ip-10-0-128-153.us-west-2.compute.internal:
- ens5
- genev_sys_6081
label_selector: node-role.kubernetes.io/master # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
instance_count: 1 # Number of nodes to perform action/select that match the label selector
kubeconfig_path: /root/.kube/config # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
execution_type: parallel # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).
network_params:
latency: 50ms
loss: '0.02'
bandwidth: 100mbit
wait_duration: 120
test_duration: 60
'''
Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
##### Steps
- Pick the nodes to introduce the network anomaly either from node_name or label_selector.
- Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.

View File

@@ -4,7 +4,7 @@ The following node chaos scenarios are supported:
1. **node_start_scenario**: Scenario to stop the node instance.
2. **node_stop_scenario**: Scenario to stop the node instance.
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance.
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance. Not supported on VMware.
4. **node_termination_scenario**: Scenario to terminate the node instance.
5. **node_reboot_scenario**: Scenario to reboot the node instance.
6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
@@ -12,13 +12,14 @@ The following node chaos scenarios are supported:
8. **node_crash_scenario**: Scenario to crash the node instance.
9. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
**NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
**NOTE**: node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario
, node_reboot_scenario and stop_start_kubelet_scenario are supported only on AWS, Azure, OpenStack, BareMetal, GCP
, and Alibaba as of now.
, VMware and Alibaba as of now.
**NOTE**: Node scenarios are supported only when running the standalone version of Kraken until https://github.com/chaos-kubox/krkn/issues/106 gets fixed.
**NOTE**: Node scenarios are supported only when running the standalone version of Kraken until https://github.com/redhat-chaos/krkn/issues/106 gets fixed.
#### AWS
@@ -64,13 +65,17 @@ How to set up Alibaba cli to run node scenarios is defined [here](cloud_setup.md
. Releasing a node is 2 steps, stopping the node and then releasing it.
#### VMware
How to set up VMware vSphere to run node scenarios is defined [here](cloud_setup.md#vmware)
#### General
**NOTE**: The `node_crash_scenario` and `stop_kubelet_scenario` scenario is supported independent of the cloud platform.
Use 'generic' or do not add the 'cloud_type' key to your scenario if your cluster is not set up using one of the current supported cloud types.
Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to [node_scenarios_example](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_scenarios_example.yml) config file.
Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to [node_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_scenarios_example.yml) config file.
```

View File

@@ -1,14 +1,40 @@
### Pod Scenarios
Kraken consumes [Powerfulseal](https://github.com/powerfulseal/powerfulseal) under the hood to run the pod scenarios.
These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.
Krkn recently replaced PowerfulSeal with its own internal pod scenarios using a plugin system. You can run pod scenarios by adding the following config to Krkn:
```yaml
kraken:
chaos_scenarios:
- plugin_scenarios:
- path/to/scenario.yaml
```
You can then create the scenario file with the following contents:
```yaml
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^kube-system$
label_selector: k8s-app=kube-scheduler
- id: wait-for-pods
config:
namespace_pattern: ^kube-system$
label_selector: k8s-app=kube-scheduler
count: 3
```
Please adjust the schema reference to point to the [schema file](../scenarios/plugin.schema.json). This file will give you code completion and documentation for the available options in your IDE.
#### Pod Chaos Scenarios
The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.
Component | Description | Working
------------------------ |----------------------------------------------------------------------------------------------| ------------------------- |
[Etcd](https://github.com/chaos-kubox/krkn/blob/main/scenarios/etcd.yml) | Kills a single/multiple etcd replicas for the specified number of times in a loop. | :heavy_check_mark: |
[Kube ApiServer](https://github.com/chaos-kubox/krkn/blob/main/scenarios/openshift-kube-apiserver.yml) | Kills a single/multiple kube-apiserver replicas for the specified number of times in a loop. | :heavy_check_mark: |
[ApiServer](https://github.com/chaos-kubox/krkn/blob/main/scenarios/openshift-apiserver.yml) | Kills a single/multiple apiserver replicas for the specified number of times in a loop. | :heavy_check_mark: |
[Prometheus](https://github.com/chaos-kubox/krkn/blob/main/scenarios/prometheus.yml) | Kills a single/multiple prometheus replicas for the specified number of times in a loop. | :heavy_check_mark: |
[OpenShift System Pods](https://github.com/chaos-kubox/krkn/blob/main/scenarios/regex_openshift_pod_kill.yml) | Kills random pods running in the OpenShift system namespaces. | :heavy_check_mark: |
| Component | Description | Working |
| ------------------------ |-------------| -------- |
| [Basic pod scenario](../scenarios/kube/pod.yml) | Kill a pod. | :heavy_check_mark: |
| [Etcd](../scenarios/openshift/etcd.yml) | Kills a single/multiple etcd replicas. | :heavy_check_mark: |
| [Kube ApiServer](../scenarios/openshift/openshift-kube-apiserver.yml)| Kills a single/multiple kube-apiserver replicas. | :heavy_check_mark: |
| [ApiServer](../scenarios/openshift/openshift-apiserver.yml) | Kills a single/multiple apiserver replicas. | :heavy_check_mark: |
| [Prometheus](../scenarios/openshift/prometheus.yml) | Kills a single/multiple prometheus replicas. | :heavy_check_mark: |
| [OpenShift System Pods](../scenarios/openshift/regex_openshift_pod_kill.yml) | Kills random pods running in the OpenShift system namespaces. | :heavy_check_mark: |

View File

@@ -16,7 +16,7 @@ Configuration Options:
**object_name:** List of the names of pods or nodes you want to skew.
Refer to [time_scenarios_example](https://github.com/chaos-kubox/krkn/blob/main/scenarios/time_scenarios_example.yml) config file.
Refer to [time_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/time_scenarios_example.yml) config file.
```
time_scenarios:

View File

@@ -1,5 +1,5 @@
### Zone outage scenario
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone. It tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state. Zone outage can be injected by placing the zone_outage config file under zone_outages option in the [kraken config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml). Refer to [zone_outage_scenario](https://github.com/chaos-kubox/krkn/blob/main/scenarios/zone_outage.yaml) config file for the parameters that need to be defined.
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone. It tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state. Zone outage can be injected by placing the zone_outage config file under zone_outages option in the [kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml). Refer to [zone_outage_scenario](https://github.com/redhat-chaos/krkn/blob/main/scenarios/zone_outage.yaml) config file for the parameters that need to be defined.
Refer to [cloud setup](cloud_setup.md) to configure your cli properly for the cloud provider of the cluster you want to shut down.

136
kraken/cerberus/setup.py Normal file
View File

@@ -0,0 +1,136 @@
import logging
import requests
import sys
import json
def get_status(config, start_time, end_time):
"""
Get cerberus status
"""
cerberus_status = True
check_application_routes = False
application_routes_status = True
if config["cerberus"]["cerberus_enabled"]:
cerberus_url = config["cerberus"]["cerberus_url"]
check_application_routes = \
config["cerberus"]["check_applicaton_routes"]
if not cerberus_url:
logging.error(
"url where Cerberus publishes True/False signal "
"is not provided."
)
sys.exit(1)
cerberus_status = requests.get(cerberus_url, timeout=60).content
cerberus_status = True if cerberus_status == b"True" else False
# Fail if the application routes monitored by cerberus
# experience downtime during the chaos
if check_application_routes:
application_routes_status, unavailable_routes = application_status(
cerberus_url,
start_time,
end_time
)
if not application_routes_status:
logging.error(
"Application routes: %s monitored by cerberus "
"encountered downtime during the run, failing"
% unavailable_routes
)
else:
logging.info(
"Application routes being monitored "
"didn't encounter any downtime during the run!"
)
if not cerberus_status:
logging.error(
"Received a no-go signal from Cerberus, looks like "
"the cluster is unhealthy. Please check the Cerberus "
"report for more details. Test failed."
)
if not application_routes_status or not cerberus_status:
sys.exit(1)
else:
logging.info(
"Received a go signal from Ceberus, the cluster is healthy. "
"Test passed."
)
return cerberus_status
def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
"""
Publish kraken status to cerberus
"""
cerberus_status = get_status(config, start_time, end_time)
if not cerberus_status:
if failed_post_scenarios:
if config["kraken"]["exit_on_failure"]:
logging.info(
"Cerberus status is not healthy and post action scenarios "
"are still failing, exiting kraken run"
)
sys.exit(1)
else:
logging.info(
"Cerberus status is not healthy and post action scenarios "
"are still failing"
)
else:
if failed_post_scenarios:
if config["kraken"]["exit_on_failure"]:
logging.info(
"Cerberus status is healthy but post action scenarios "
"are still failing, exiting kraken run"
)
sys.exit(1)
else:
logging.info(
"Cerberus status is healthy but post action scenarios "
"are still failing"
)
def application_status(cerberus_url, start_time, end_time):
"""
Check application availability
"""
if not cerberus_url:
logging.error(
"url where Cerberus publishes True/False signal is not provided."
)
sys.exit(1)
else:
duration = (end_time - start_time) / 60
url = "{baseurl}/history?loopback={duration}".format(
baseurl=cerberus_url,
duration=str(duration)
)
logging.info(
"Scraping the metrics for the test "
"duration from cerberus url: %s" % url
)
try:
failed_routes = []
status = True
metrics = requests.get(url, timeout=60).content
metrics_json = json.loads(metrics)
for entry in metrics_json["history"]["failures"]:
if entry["component"] == "route":
name = entry["name"]
failed_routes.append(name)
status = False
else:
continue
except Exception as e:
logging.error(
"Failed to scrape metrics from cerberus API at %s: %s" % (
url,
e
)
)
sys.exit(1)
return status, set(failed_routes)

View File

@@ -1,6 +0,0 @@
from dataclasses import dataclass
@dataclass
class CerberusConfig:
cerberus_url: str

View File

@@ -1,13 +0,0 @@
import requests as requests
from kraken.health.cerberus.config import CerberusConfig
from kraken.health.health import HealthChecker, HealthCheckDecision
class CerberusHealthChecker(HealthChecker):
def __init__(self, config: CerberusConfig):
self._config = config
def check(self) -> HealthCheckDecision:
cerberus_status = requests.get(self._config.cerberus_url, timeout=60).content
return HealthCheckDecision.GO if cerberus_status == b"True" else HealthCheckDecision.STOP

View File

@@ -1,14 +0,0 @@
from abc import ABC, abstractmethod
from enum import Enum
class HealthCheckDecision(Enum):
GO = "GO"
PAUSE = "PAUSE"
STOP = "STOP"
class HealthChecker(ABC):
@abstractmethod
def check(self) -> HealthCheckDecision:
pass

View File

@@ -1,12 +1,17 @@
from kubernetes import client, config
from kubernetes.stream import stream
from kubernetes.client.rest import ApiException
import logging
import kraken.invoke.command as runcommand
import sys
import re
import sys
import time
from kubernetes import client, config, utils, watch
from kubernetes.client.rest import ApiException
from kubernetes.dynamic.client import DynamicClient
from kubernetes.stream import stream
from ..kubernetes.resources import (PVC, ChaosEngine, ChaosResult, Container,
LitmusChaosObject, Pod, Volume,
VolumeMount)
kraken_node_name = ""
@@ -14,10 +19,19 @@ kraken_node_name = ""
def initialize_clients(kubeconfig_path):
global cli
global batch_cli
global watch_resource
global api_client
global dyn_client
global custom_object_client
try:
config.load_kube_config(kubeconfig_path)
cli = client.CoreV1Api()
batch_cli = client.BatchV1Api()
watch_resource = watch.Watch()
api_client = client.ApiClient()
custom_object_client = client.CustomObjectsApi()
k8s_client = config.new_client_from_config()
dyn_client = DynamicClient(k8s_client)
except ApiException as e:
logging.error("Failed to initialize kubernetes client: %s\n" % e)
sys.exit(1)
@@ -29,10 +43,12 @@ def get_host() -> str:
def get_clusterversion_string() -> str:
"""Returns clusterversion status text on OpenShift, empty string on other distributions"""
"""
Returns clusterversion status text on OpenShift, empty string
on other distributions
"""
try:
custom_objects_api = client.CustomObjectsApi()
cvs = custom_objects_api.list_cluster_custom_object(
cvs = custom_object_client.list_cluster_custom_object(
"config.openshift.io",
"v1",
"clusterversions",
@@ -54,11 +70,16 @@ def list_namespaces(label_selector=None):
namespaces = []
try:
if label_selector:
ret = cli.list_namespace(pretty=True, label_selector=label_selector)
ret = cli.list_namespace(
pretty=True,
label_selector=label_selector
)
else:
ret = cli.list_namespace(pretty=True)
except ApiException as e:
logging.error("Exception when calling CoreV1Api->list_namespaced_pod: %s\n" % e)
logging.error(
"Exception when calling CoreV1Api->list_namespaced_pod: %s\n" % e
)
raise e
for namespace in ret.items:
namespaces.append(namespace.metadata.name)
@@ -71,7 +92,9 @@ def get_namespace_status(namespace_name):
try:
ret = cli.read_namespace_status(namespace_name)
except ApiException as e:
logging.error("Exception when calling CoreV1Api->read_namespace_status: %s\n" % e)
logging.error(
"Exception when calling CoreV1Api->read_namespace_status: %s\n" % e
)
return ret.status.phase
@@ -79,7 +102,9 @@ def delete_namespace(namespace):
"""Deletes a given namespace using kubernetes python client"""
try:
api_response = cli.delete_namespace(namespace)
logging.debug("Namespace deleted. status='%s'" % str(api_response.status))
logging.debug(
"Namespace deleted. status='%s'" % str(api_response.status)
)
return api_response
except Exception as e:
logging.error(
@@ -105,7 +130,10 @@ def check_namespaces(namespaces, label_selectors=None):
break
invalid_namespaces = regex_namespaces - valid_regex
if invalid_namespaces:
raise Exception("There exists no namespaces matching: %s" % (invalid_namespaces))
raise Exception(
"There exists no namespaces matching: %s" %
(invalid_namespaces)
)
return list(final_namespaces)
except Exception as e:
logging.info("%s" % (e))
@@ -152,7 +180,11 @@ def list_pods(namespace, label_selector=None):
pods = []
try:
if label_selector:
ret = cli.list_namespaced_pod(namespace, pretty=True, label_selector=label_selector)
ret = cli.list_namespaced_pod(
namespace,
pretty=True,
label_selector=label_selector
)
else:
ret = cli.list_namespaced_pod(namespace, pretty=True)
except ApiException as e:
@@ -170,7 +202,10 @@ def list_pods(namespace, label_selector=None):
def get_all_pods(label_selector=None):
pods = []
if label_selector:
ret = cli.list_pod_for_all_namespaces(pretty=True, label_selector=label_selector)
ret = cli.list_pod_for_all_namespaces(
pretty=True,
label_selector=label_selector
)
else:
ret = cli.list_pod_for_all_namespaces(pretty=True)
for pod in ret.items:
@@ -179,7 +214,13 @@ def get_all_pods(label_selector=None):
# Execute command in pod
def exec_cmd_in_pod(command, pod_name, namespace, container=None, base_command="bash"):
def exec_cmd_in_pod(
command,
pod_name,
namespace,
container=None,
base_command="bash"
):
exec_command = [base_command, "-c", command]
try:
@@ -230,7 +271,10 @@ def create_pod(body, namespace, timeout=120):
pod_stat = cli.create_namespaced_pod(body=body, namespace=namespace)
end_time = time.time() + timeout
while True:
pod_stat = cli.read_namespaced_pod(name=body["metadata"]["name"], namespace=namespace)
pod_stat = cli.read_namespaced_pod(
name=body["metadata"]["name"],
namespace=namespace
)
if pod_stat.status.phase == "Running":
break
if time.time() > end_time:
@@ -250,7 +294,10 @@ def read_pod(name, namespace="default"):
def get_pod_log(name, namespace="default"):
return cli.read_namespaced_pod_log(
name=name, namespace=namespace, _return_http_data_only=True, _preload_content=False
name=name,
namespace=namespace,
_return_http_data_only=True,
_preload_content=False
)
@@ -268,7 +315,10 @@ def delete_job(name, namespace="default"):
api_response = batch_cli.delete_namespaced_job(
name=name,
namespace=namespace,
body=client.V1DeleteOptions(propagation_policy="Foreground", grace_period_seconds=0),
body=client.V1DeleteOptions(
propagation_policy="Foreground",
grace_period_seconds=0
),
)
logging.debug("Job deleted. status='%s'" % str(api_response.status))
return api_response
@@ -290,7 +340,10 @@ def delete_job(name, namespace="default"):
def create_job(body, namespace="default"):
try:
api_response = batch_cli.create_namespaced_job(body=body, namespace=namespace)
api_response = batch_cli.create_namespaced_job(
body=body,
namespace=namespace
)
return api_response
except ApiException as api:
logging.warn(
@@ -311,7 +364,10 @@ def create_job(body, namespace="default"):
def get_job_status(name, namespace="default"):
try:
return batch_cli.read_namespaced_job_status(name=name, namespace=namespace)
return batch_cli.read_namespaced_job_status(
name=name,
namespace=namespace
)
except Exception as e:
logging.error(
"Exception when calling \
@@ -321,22 +377,6 @@ def get_job_status(name, namespace="default"):
raise
# Obtain node status
def get_node_status(node, timeout=60):
try:
node_info = cli.read_node_status(node, pretty=True, _request_timeout=timeout)
except ApiException as e:
logging.error(
"Exception when calling \
CoreV1Api->read_node_status: %s\n"
% e
)
return None
for condition in node_info.status.conditions:
if condition.type == "Ready":
return condition.status
# Monitor the status of the cluster nodes and set the status to true or false
def monitor_nodes():
nodes = list_nodes()
@@ -375,7 +415,11 @@ def monitor_namespace(namespace):
notready_pods = []
for pod in pods:
try:
pod_info = cli.read_namespaced_pod_status(pod, namespace, pretty=True)
pod_info = cli.read_namespaced_pod_status(
pod,
namespace,
pretty=True
)
except ApiException as e:
logging.error(
"Exception when calling \
@@ -384,7 +428,11 @@ def monitor_namespace(namespace):
)
raise e
pod_status = pod_info.status.phase
if pod_status != "Running" and pod_status != "Completed" and pod_status != "Succeeded":
if (
pod_status != "Running" and
pod_status != "Completed" and
pod_status != "Succeeded"
):
notready_pods.append(pod)
if len(notready_pods) != 0:
status = False
@@ -395,11 +443,328 @@ def monitor_namespace(namespace):
# Monitor component namespace
def monitor_component(iteration, component_namespace):
watch_component_status, failed_component_pods = monitor_namespace(component_namespace)
logging.info("Iteration %s: %s: %s" % (iteration, component_namespace, watch_component_status))
watch_component_status, failed_component_pods = \
monitor_namespace(component_namespace)
logging.info(
"Iteration %s: %s: %s" % (
iteration,
component_namespace,
watch_component_status
)
)
return watch_component_status, failed_component_pods
def apply_yaml(path, namespace='default'):
"""
Apply yaml config to create Kubernetes resources
Args:
path (string)
- Path to the YAML file
namespace (string)
- Namespace to create the resource
Returns:
The object created
"""
return utils.create_from_yaml(
api_client,
yaml_file=path,
namespace=namespace
)
def get_pod_info(name: str, namespace: str = 'default') -> Pod:
"""
Function to retrieve information about a specific pod
in a given namespace. The kubectl command is given by:
kubectl get pods <name> -n <namespace>
Args:
name (string)
- Name of the pod
namespace (string)
- Namespace to look for the pod
Returns:
- Data class object of type Pod with the output of the above
kubectl command in the given format if the pod exists
- Returns None if the pod doesn't exist
"""
pod_exists = check_if_pod_exists(name=name, namespace=namespace)
if pod_exists:
response = cli.read_namespaced_pod(
name=name,
namespace=namespace,
pretty='true'
)
container_list = []
# Create a list of containers present in the pod
for container in response.spec.containers:
volume_mount_list = []
for volume_mount in container.volume_mounts:
volume_mount_list.append(
VolumeMount(
name=volume_mount.name,
mountPath=volume_mount.mount_path
)
)
container_list.append(
Container(
name=container.name,
image=container.image,
volumeMounts=volume_mount_list
)
)
for i, container in enumerate(response.status.container_statuses):
container_list[i].ready = container.ready
# Create a list of volumes associated with the pod
volume_list = []
for volume in response.spec.volumes:
volume_name = volume.name
pvc_name = (
volume.persistent_volume_claim.claim_name
if volume.persistent_volume_claim is not None
else None
)
volume_list.append(Volume(name=volume_name, pvcName=pvc_name))
# Create the Pod data class object
pod_info = Pod(
name=response.metadata.name,
podIP=response.status.pod_ip,
namespace=response.metadata.namespace,
containers=container_list,
nodeName=response.spec.node_name,
volumes=volume_list
)
return pod_info
else:
logging.error(
"Pod '%s' doesn't exist in namespace '%s'" % (
str(name),
str(namespace)
)
)
return None
def get_litmus_chaos_object(
kind: str,
name: str,
namespace: str
) -> LitmusChaosObject:
"""
Function that returns an object of a custom resource type of
the litmus project. Currently, only ChaosEngine and ChaosResult
objects are supported.
Args:
kind (string)
- The custom resource type
namespace (string)
- Namespace where the custom object is present
Returns:
Data class object of a subclass of LitmusChaosObject
"""
group = 'litmuschaos.io'
version = 'v1alpha1'
if kind.lower() == 'chaosengine':
plural = 'chaosengines'
response = custom_object_client.get_namespaced_custom_object(
group=group,
plural=plural,
version=version,
namespace=namespace,
name=name
)
try:
engine_status = response['status']['engineStatus']
exp_status = response['status']['experiments'][0]['status']
except Exception:
engine_status = 'Not Initialized'
exp_status = 'Not Initialized'
custom_object = ChaosEngine(
kind='ChaosEngine',
group=group,
namespace=namespace,
name=name,
plural=plural,
version=version,
engineStatus=engine_status,
expStatus=exp_status
)
elif kind.lower() == 'chaosresult':
plural = 'chaosresults'
response = custom_object_client.get_namespaced_custom_object(
group=group,
plural=plural,
version=version,
namespace=namespace,
name=name
)
try:
verdict = response['status']['experimentStatus']['verdict']
fail_step = response['status']['experimentStatus']['failStep']
except Exception:
verdict = 'N/A'
fail_step = 'N/A'
custom_object = ChaosResult(
kind='ChaosResult',
group=group,
namespace=namespace,
name=name,
plural=plural,
version=version,
verdict=verdict,
failStep=fail_step
)
else:
logging.error("Invalid litmus chaos custom resource name")
custom_object = None
return custom_object
def check_if_namespace_exists(name: str) -> bool:
"""
Function that checks if a namespace exists by parsing through
the list of projects.
Args:
name (string)
- Namespace name
Returns:
Boolean value indicating whether the namespace exists or not
"""
v1_projects = dyn_client.resources.get(
api_version='project.openshift.io/v1',
kind='Project'
)
project_list = v1_projects.get()
return True if name in str(project_list) else False
def check_if_pod_exists(name: str, namespace: str) -> bool:
"""
Function that checks if a pod exists in the given namespace
Args:
name (string)
- Pod name
namespace (string)
- Namespace name
Returns:
Boolean value indicating whether the pod exists or not
"""
namespace_exists = check_if_namespace_exists(namespace)
if namespace_exists:
pod_list = list_pods(namespace=namespace)
if name in pod_list:
return True
else:
logging.error("Namespace '%s' doesn't exist" % str(namespace))
return False
def check_if_pvc_exists(name: str, namespace: str) -> bool:
"""
Function that checks if a namespace exists by parsing through
the list of projects.
Args:
name (string)
- PVC name
namespace (string)
- Namespace name
Returns:
Boolean value indicating whether the Persistent Volume Claim
exists or not.
"""
namespace_exists = check_if_namespace_exists(namespace)
if namespace_exists:
response = cli.list_namespaced_persistent_volume_claim(
namespace=namespace
)
pvc_list = [pvc.metadata.name for pvc in response.items]
if name in pvc_list:
return True
else:
logging.error("Namespace '%s' doesn't exist" % str(namespace))
return False
def get_pvc_info(name: str, namespace: str) -> PVC:
"""
Function to retrieve information about a Persistent Volume Claim in a
given namespace
Args:
name (string)
- Name of the persistent volume claim
namespace (string)
- Namespace where the persistent volume claim is present
Returns:
- A PVC data class containing the name, capacity, volume name,
namespace and associated pod names of the PVC if the PVC exists
- Returns None if the PVC doesn't exist
"""
pvc_exists = check_if_pvc_exists(name=name, namespace=namespace)
if pvc_exists:
pvc_info_response = cli.read_namespaced_persistent_volume_claim(
name=name,
namespace=namespace,
pretty=True
)
pod_list_response = cli.list_namespaced_pod(namespace=namespace)
capacity = pvc_info_response.status.capacity['storage']
volume_name = pvc_info_response.spec.volume_name
# Loop through all pods in the namespace to find associated PVCs
pvc_pod_list = []
for pod in pod_list_response.items:
for volume in pod.spec.volumes:
if (
volume.persistent_volume_claim is not None
and volume.persistent_volume_claim.claim_name == name
):
pvc_pod_list.append(pod.metadata.name)
pvc_info = PVC(
name=name,
capacity=capacity,
volumeName=volume_name,
podNames=pvc_pod_list,
namespace=namespace
)
return pvc_info
else:
logging.error(
"PVC '%s' doesn't exist in namespace '%s'" % (
str(name),
str(namespace)
)
)
return None
# Find the node kraken is deployed on
# Set global kraken node to not delete
def find_kraken_node():
@@ -415,16 +780,40 @@ def find_kraken_node():
if kraken_pod_name:
# get kraken-deployment pod, find node name
try:
node_name = runcommand.invoke(
"kubectl get pods/"
+ str(kraken_pod_name)
+ ' -o jsonpath="{.spec.nodeName}"'
+ " -n"
+ str(kraken_project)
)
node_name = get_pod_info(kraken_pod_name, kraken_project).nodeName
global kraken_node_name
kraken_node_name = node_name
except Exception as e:
logging.info("%s" % (e))
sys.exit(1)
# Watch for a specific node status
def watch_node_status(node, status, timeout, resource_version):
count = timeout
for event in watch_resource.stream(
cli.list_node,
field_selector=f"metadata.name={node}",
timeout_seconds=timeout,
resource_version=f"{resource_version}"
):
conditions = [
status
for status in event["object"].status.conditions
if status.type == "Ready"
]
if conditions[0].status == status:
watch_resource.stop()
break
else:
count -= 1
logging.info(
"Status of node " + node + ": " + str(conditions[0].status)
)
if not count:
watch_resource.stop()
# Get the resource version for the specified node
def get_node_resource_version(node):
return cli.read_node(name=node).metadata.resource_version

View File

@@ -1,125 +0,0 @@
import unittest
from dataclasses import dataclass
from typing import Dict, List
from kubernetes import config, client
from kubernetes.client.models import V1Pod, V1PodSpec, V1ObjectMeta, V1Container
from kubernetes.client.exceptions import ApiException
@dataclass
class Pod:
"""
A pod is a simplified representation of a Kubernetes pod. We only extract the data we need in krkn.
"""
name: str
namespace: str
labels: Dict[str, str]
class Client:
"""
This is the implementation of all Kubernetes API calls used in Krkn.
"""
def __init__(self, kubeconfig_path: str = None):
# Note: this function replicates much of the functionality already represented in the Kubernetes Python client,
# but in an object-oriented manner. This allows for creating multiple clients and accessing multiple clusters
# with minimal effort if needed, which the procedural implementation doesn't allow.
if kubeconfig_path is None:
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
if kubeconfig.config is None:
raise config.ConfigException(
'Invalid kube-config file: %s. '
'No configuration found.' % kubeconfig_path)
loader = config.kube_config.KubeConfigLoader(
config_dict=kubeconfig.config,
)
client_config = client.Configuration()
loader.load_and_set(client_config)
self.client = client.ApiClient(configuration=client_config)
self.core_v1 = client.CoreV1Api(self.client)
@staticmethod
def _convert_pod(pod: V1Pod) -> Pod:
return Pod(
name=pod.metadata.name,
namespace=pod.metadata.namespace,
labels=pod.metadata.labels
)
def create_test_pod(self) -> Pod:
"""
create_test_pod creates a test pod in the default namespace that can be safely killed.
"""
return self._convert_pod(self.core_v1.create_namespaced_pod(
"default",
V1Pod(
metadata=V1ObjectMeta(
generate_name="test-",
),
spec=V1PodSpec(
containers=[
V1Container(
name="test",
image="alpine",
tty=True,
)
]
),
)
))
def list_all_pods(self, label_selector: str = None) -> List[Pod]:
"""
list_all_pods lists all pods in all namespaces, possibly with a label selector applied.
"""
try:
pod_response = self.core_v1.list_pod_for_all_namespaces(watch=False, label_selector=label_selector)
pod_list: List[client.models.V1Pod] = pod_response.items
result: List[Pod] = []
for pod in pod_list:
result.append(self._convert_pod(pod))
return result
except ApiException as e:
if e.status == 404:
raise NotFoundException(e)
raise
def get_pod(self, name: str, namespace: str = "default") -> Pod:
"""
get_pod returns a pod based on the name and a namespace.
"""
try:
return self._convert_pod(self.core_v1.read_namespaced_pod(name, namespace))
except ApiException as e:
if e.status == 404:
raise NotFoundException(e)
raise
def remove_pod(self, name: str, namespace: str = "default"):
"""
remove_pod removes a pod based on the name and namespace. A NotFoundException is raised if the pod doesn't
exist.
"""
try:
self.core_v1.delete_namespaced_pod(name, namespace)
except ApiException as e:
if e.status == 404:
raise NotFoundException(e)
raise
class NotFoundException(Exception):
"""
NotFoundException is an exception specific to the scenario Kubernetes abstraction and is thrown when a specific
resource (e.g. a pod) cannot be found.
"""
def __init__(self, cause: Exception):
self.__cause__ = cause
if __name__ == '__main__':
unittest.main()

View File

@@ -0,0 +1,74 @@
from dataclasses import dataclass
from typing import List
@dataclass(frozen=True, order=False)
class Volume:
"""Data class to hold information regarding volumes in a pod"""
name: str
pvcName: str
@dataclass(order=False)
class VolumeMount:
"""Data class to hold information regarding volume mounts"""
name: str
mountPath: str
@dataclass(frozen=True, order=False)
class PVC:
"""Data class to hold information regarding persistent volume claims"""
name: str
capacity: str
volumeName: str
podNames: List[str]
namespace: str
@dataclass(order=False)
class Container:
"""Data class to hold information regarding containers in a pod"""
image: str
name: str
volumeMounts: List[VolumeMount]
ready: bool = False
@dataclass(frozen=True, order=False)
class Pod:
"""Data class to hold information regarding a pod"""
name: str
podIP: str
namespace: str
containers: List[Container]
nodeName: str
volumes: List[Volume]
@dataclass(frozen=True, order=False)
class LitmusChaosObject:
"""Data class to hold information regarding a custom object of litmus project"""
kind: str
group: str
namespace: str
name: str
plural: str
version: str
@dataclass(frozen=True, order=False)
class ChaosEngine(LitmusChaosObject):
"""Data class to hold information regarding a ChaosEngine object"""
engineStatus: str
expStatus: str
@dataclass(frozen=True, order=False)
class ChaosResult(LitmusChaosObject):
"""Data class to hold information regarding a ChaosResult object"""
verdict: str
failStep: str

View File

@@ -1,42 +0,0 @@
import unittest
from kraken.scenarios import kube
class TestClient(unittest.TestCase):
def test_list_all_pods(self):
c = kube.Client()
pod = c.create_test_pod()
self.addCleanup(lambda: self._remove_pod(c, pod.name, pod.namespace))
pods = c.list_all_pods()
for pod in pods:
if pod.name == pod.name and pod.namespace == pod.namespace:
return
self.fail("The created pod %s was not in the pod list." % pod.name)
def test_get_pod(self):
c = kube.Client()
pod = c.create_test_pod()
self.addCleanup(lambda: c.remove_pod(pod.name, pod.namespace))
pod2 = c.get_pod(pod.name, pod.namespace)
assert pod2.name == pod.name
assert pod2.namespace == pod.namespace
def test_get_pod_notfound(self):
c = kube.Client()
try:
c.get_pod("non-existent-pod")
self.fail("Fetching a non-existent pod did not result in a NotFoundException.")
except kube.NotFoundException:
pass
@staticmethod
def _remove_pod(c: kube.Client, pod_name: str, pod_namespace: str):
try:
c.remove_pod(pod_name, pod_namespace)
except kube.NotFoundException:
pass
if __name__ == '__main__':
unittest.main()

View File

@@ -1,4 +1,5 @@
import kraken.invoke.command as runcommand
import kraken.kubernetes.client as kubecli
import logging
import time
import sys
@@ -86,18 +87,17 @@ def deploy_all_experiments(version_string, namespace):
def wait_for_initialized(engine_name, experiment_name, namespace):
chaos_engine = runcommand.invoke(
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.engineStatus}'" % (engine_name, namespace)
)
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
namespace=namespace).engineStatus
engine_status = chaos_engine.strip()
max_tries = 30
engine_counter = 0
while engine_status.lower() != "initialized":
time.sleep(10)
logging.info("Waiting for " + experiment_name + " to be initialized")
chaos_engine = runcommand.invoke(
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.engineStatus}'" % (engine_name, namespace)
)
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
namespace=namespace).engineStatus
engine_status = chaos_engine.strip()
if engine_counter >= max_tries:
logging.error("Chaos engine " + experiment_name + " took longer than 5 minutes to be initialized")
@@ -117,18 +117,16 @@ def wait_for_status(engine_name, expected_status, experiment_name, namespace):
if not response:
logging.info("Chaos engine never initialized, exiting")
return False
chaos_engine = runcommand.invoke(
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.experiments[0].status}'" % (engine_name, namespace)
)
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
namespace=namespace).expStatus
engine_status = chaos_engine.strip()
max_tries = 30
engine_counter = 0
while engine_status.lower() != expected_status:
time.sleep(10)
logging.info("Waiting for " + experiment_name + " to be " + expected_status)
chaos_engine = runcommand.invoke(
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.experiments[0].status}'" % (engine_name, namespace)
)
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
namespace=namespace).expStatus
engine_status = chaos_engine.strip()
if engine_counter >= max_tries:
logging.error("Chaos engine " + experiment_name + " took longer than 5 minutes to be " + expected_status)
@@ -151,20 +149,14 @@ def check_experiment(engine_name, experiment_name, namespace):
else:
sys.exit(1)
chaos_result = runcommand.invoke(
"kubectl get chaosresult %s"
"-%s -n %s -o "
"jsonpath='{.status.experimentStatus.verdict}'" % (engine_name, experiment_name, namespace)
)
chaos_result = kubecli.get_litmus_chaos_object(kind='chaosresult', name=engine_name+'-'+experiment_name,
namespace=namespace).verdict
if chaos_result == "Pass":
logging.info("Engine " + str(engine_name) + " finished with status " + str(chaos_result))
return True
else:
chaos_result = runcommand.invoke(
"kubectl get chaosresult %s"
"-%s -n %s -o jsonpath="
"'{.status.experimentStatus.failStep}'" % (engine_name, experiment_name, namespace)
)
chaos_result = kubecli.get_litmus_chaos_object(kind='chaosresult', name=engine_name+'-'+experiment_name,
namespace=namespace).failStep
logging.info("Chaos scenario:" + engine_name + " failed with error: " + str(chaos_result))
logging.info(
"See 'kubectl get chaosresult %s"
@@ -176,8 +168,7 @@ def check_experiment(engine_name, experiment_name, namespace):
# Delete all chaos engines in a given namespace
def delete_chaos_experiments(namespace):
namespace_exists = runcommand.invoke("oc get project -o name | grep -c " + namespace + " | xargs")
if namespace_exists.strip() != "0":
if kubecli.check_if_namespace_exists(namespace):
chaos_exp_exists = runcommand.invoke_no_exit("kubectl get chaosexperiment")
if "returned non-zero exit status 1" not in chaos_exp_exists:
logging.info("Deleting all litmus experiments")
@@ -187,8 +178,7 @@ def delete_chaos_experiments(namespace):
# Delete all chaos engines in a given namespace
def delete_chaos(namespace):
namespace_exists = runcommand.invoke("oc get project -o name | grep -c " + namespace + " | xargs")
if namespace_exists.strip() != "0":
if kubecli.check_if_namespace_exists(namespace):
logging.info("Deleting all litmus run objects")
chaos_engine_exists = runcommand.invoke_no_exit("kubectl get chaosengine")
if "returned non-zero exit status 1" not in chaos_engine_exists:
@@ -201,8 +191,8 @@ def delete_chaos(namespace):
def uninstall_litmus(version, litmus_namespace):
namespace_exists = runcommand.invoke("oc get project -o name | grep -c " + litmus_namespace + " | xargs")
if namespace_exists.strip() != "0":
if kubecli.check_if_namespace_exists(litmus_namespace):
logging.info("Uninstalling Litmus operator")
runcommand.invoke_no_exit(
"kubectl delete -n %s -f "

View File

@@ -107,10 +107,7 @@ def verify_interface(test_interface, nodelst, template):
interface_lst = output[:-1].split(",")
for interface in test_interface:
if interface not in interface_lst:
logging.error(
"Interface %s not found in node %s interface list %s" % (interface, nodelst[pod_index]),
interface_lst,
)
logging.error("Interface %s not found in node %s interface list %s" % (interface, nodelst[pod_index], interface_lst))
sys.exit(1)
return test_interface
finally:

View File

@@ -5,7 +5,6 @@ import paramiko
import kraken.kubernetes.client as kubecli
import kraken.invoke.command as runcommand
node_general = False
@@ -30,30 +29,22 @@ def get_node(node_name, label_selector, instance_kill_count):
return nodes_to_return
# Wait till node status becomes Ready
# Wait until the node status becomes Ready
def wait_for_ready_status(node, timeout):
for _ in range(timeout):
if kubecli.get_node_status(node) == "Ready":
break
time.sleep(3)
if kubecli.get_node_status(node) != "Ready":
raise Exception("Node condition status isn't Ready")
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "True", timeout, resource_version)
# Wait till node status becomes NotReady
# Wait until the node status becomes Not Ready
def wait_for_not_ready_status(node, timeout):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "False", timeout, resource_version)
# Wait until the node status becomes Unknown
def wait_for_unknown_status(node, timeout):
for _ in range(timeout):
try:
node_status = kubecli.get_node_status(node, timeout)
if node_status is None or node_status == "Unknown":
break
except Exception:
logging.error("Encountered error while getting node status, waiting 3 seconds and retrying")
time.sleep(3)
node_status = kubecli.get_node_status(node, timeout)
logging.info("node status " + str(node_status))
if node_status is not None and node_status != "Unknown":
raise Exception("Node condition status isn't Unknown after %s seconds" % str(timeout))
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "Unknown", timeout, resource_version)
# Get the ip of the cluster node
@@ -74,7 +65,11 @@ def check_service_status(node, service, ssh_private_key, timeout):
i += sleeper
logging.info("Trying to ssh to instance: %s" % (node))
connection = ssh.connect(
node, username="root", key_filename=ssh_private_key, timeout=800, banner_timeout=400
node,
username="root",
key_filename=ssh_private_key,
timeout=800,
banner_timeout=400,
)
if connection is None:
break

200
kraken/plugins/__init__.py Normal file
View File

@@ -0,0 +1,200 @@
import dataclasses
import json
import logging
from os.path import abspath
from typing import List, Dict
from arcaflow_plugin_sdk import schema, serialization, jsonschema
import kraken.plugins.vmware.vmware_plugin as vmware_plugin
from kraken.plugins.pod_plugin import kill_pods, wait_for_pods
from kraken.plugins.run_python_plugin import run_python_file
from kraken.plugins.network.ingress_shaping import network_chaos
@dataclasses.dataclass
class PluginStep:
schema: schema.StepSchema
error_output_ids: List[str]
def render_output(self, output_id: str, output_data) -> str:
return json.dumps({
"output_id": output_id,
"output_data": self.schema.outputs[output_id].serialize(output_data),
}, indent='\t')
class Plugins:
"""
Plugins is a class that can run plugins sequentially. The output is rendered to the standard output and the process
is aborted if a step fails.
"""
steps_by_id: Dict[str, PluginStep]
def __init__(self, steps: List[PluginStep]):
self.steps_by_id = dict()
for step in steps:
if step.schema.id in self.steps_by_id:
raise Exception(
"Duplicate step ID: {}".format(step.schema.id)
)
self.steps_by_id[step.schema.id] = step
def run(self, file: str, kubeconfig_path: str):
"""
Run executes a series of steps
"""
data = serialization.load_from_file(abspath(file))
if not isinstance(data, list):
raise Exception(
"Invalid scenario configuration file: {} expected list, found {}".format(file, type(data).__name__)
)
i = 0
for entry in data:
if not isinstance(entry, dict):
raise Exception(
"Invalid scenario configuration file: {} expected a list of dict's, found {} on step {}".format(
file,
type(entry).__name__,
i
)
)
if "id" not in entry:
raise Exception(
"Invalid scenario configuration file: {} missing 'id' field on step {}".format(
file,
i,
)
)
if "config" not in entry:
raise Exception(
"Invalid scenario configuration file: {} missing 'config' field on step {}".format(
file,
i,
)
)
if entry["id"] not in self.steps_by_id:
raise Exception(
"Invalid step {} in {} ID: {} expected one of: {}".format(
i,
file,
entry["id"],
', '.join(self.steps_by_id.keys())
)
)
step = self.steps_by_id[entry["id"]]
unserialized_input = step.schema.input.unserialize(entry["config"])
if "kubeconfig_path" in step.schema.input.properties:
unserialized_input.kubeconfig_path = kubeconfig_path
output_id, output_data = step.schema(unserialized_input)
logging.info(step.render_output(output_id, output_data) + "\n")
if output_id in step.error_output_ids:
raise Exception(
"Step {} in {} ({}) failed".format(i, file, step.schema.id)
)
i = i + 1
def json_schema(self):
"""
This function generates a JSON schema document and renders it from the steps passed.
"""
result = {
"$id": "https://github.com/redhat-chaos/krkn/",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Kraken Arcaflow scenarios",
"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
"type": "array",
"minContains": 1,
"items": {
"oneOf": [
]
}
}
for step_id in self.steps_by_id.keys():
step = self.steps_by_id[step_id]
step_input = jsonschema.step_input(step.schema)
del step_input["$id"]
del step_input["$schema"]
del step_input["title"]
del step_input["description"]
result["items"]["oneOf"].append({
"type": "object",
"properties": {
"id": {
"type": "string",
"const": step_id,
},
"config": step_input,
},
"required": [
"id",
"config",
]
})
return json.dumps(result, indent="\t")
PLUGINS = Plugins(
[
PluginStep(
kill_pods,
[
"error",
]
),
PluginStep(
wait_for_pods,
[
"error"
]
),
PluginStep(
run_python_file,
[
"error"
]
),
PluginStep(
vmware_plugin.node_start,
[
"error"
]
),
PluginStep(
vmware_plugin.node_stop,
[
"error"
]
),
PluginStep(
vmware_plugin.node_reboot,
[
"error"
]
),
PluginStep(
vmware_plugin.node_terminate,
[
"error"
]
),
PluginStep(
network_chaos,
[
"error"
]
)
]
)
def run(scenarios: List[str], kubeconfig_path: str, failed_post_scenarios: List[str]) -> List[str]:
for scenario in scenarios:
try:
PLUGINS.run(scenario, kubeconfig_path)
except Exception as e:
failed_post_scenarios.append(scenario)
logging.error("Error while running {}: {}".format(scenario, e))
return failed_post_scenarios
return failed_post_scenarios

View File

@@ -0,0 +1,4 @@
from kraken.plugins import PLUGINS
if __name__ == "__main__":
print(PLUGINS.json_schema())

View File

@@ -4,8 +4,24 @@ import sys
import json
# Get cerberus status
def get_status(config, start_time, end_time):
"""
Function to get Cerberus status
Args:
config
- Kraken config dictionary
start_time
- The time when chaos is injected
end_time
- The time when chaos is removed
Returns:
Cerberus status
"""
cerberus_status = True
check_application_routes = False
application_routes_status = True
@@ -43,8 +59,24 @@ def get_status(config, start_time, end_time):
return cerberus_status
# Function to publish kraken status to cerberus
def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
"""
Function to publish Kraken status to Cerberus
Args:
config
- Kraken config dictionary
failed_post_scenarios
- String containing the failed post scenarios
start_time
- The time when chaos is injected
end_time
- The time when chaos is removed
"""
cerberus_status = get_status(config, start_time, end_time)
if not cerberus_status:
if failed_post_scenarios:
@@ -66,8 +98,24 @@ def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
logging.info("Cerberus status is healthy but post action scenarios " "are still failing")
# Check application availability
def application_status(cerberus_url, start_time, end_time):
"""
Function to check application availability
Args:
cerberus_url
- url where Cerberus publishes True/False signal
start_time
- The time when chaos is injected
end_time
- The time when chaos is removed
Returns:
Application status and failed routes
"""
if not cerberus_url:
logging.error("url where Cerberus publishes True/False signal is not provided.")
sys.exit(1)

View File

@@ -0,0 +1,937 @@
from dataclasses import dataclass, field
import yaml
import logging
import time
import sys
import os
import re
from traceback import format_exc
from jinja2 import Environment, FileSystemLoader
from . import kubernetes_functions as kube_helper
from . import cerberus
import typing
from arcaflow_plugin_sdk import validation, plugin
from kubernetes.client.api.core_v1_api import CoreV1Api as CoreV1Api
from kubernetes.client.api.batch_v1_api import BatchV1Api as BatchV1Api
@dataclass
class NetworkScenarioConfig:
node_interface_name: typing.Dict[
str, typing.List[str]
] = field(
default=None,
metadata={
"name": "Node Interface Name",
"description":
"Dictionary with node names as key and values as a list of "
"their test interfaces. "
"Required if label_selector is not set.",
}
)
label_selector: typing.Annotated[
typing.Optional[str], validation.required_if_not("node_interface_name")
] = field(
default=None,
metadata={
"name": "Label selector",
"description":
"Kubernetes label selector for the target nodes. "
"Required if node_interface_name is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
"for details.",
}
)
test_duration: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=120,
metadata={
"name": "Test duration",
"description":
"Duration for which each step of the ingress chaos testing "
"is to be performed.",
},
)
wait_duration: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=300,
metadata={
"name": "Wait Duration",
"description":
"Wait duration for finishing a test and its cleanup."
"Ensure that it is significantly greater than wait_duration"
}
)
instance_count: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=1,
metadata={
"name": "Instance Count",
"description":
"Number of nodes to perform action/select that match "
"the label selector.",
}
)
kubeconfig_path: typing.Optional[str] = field(
default=None,
metadata={
"name": "Kubeconfig path",
"description":
"Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
"for details.",
}
)
execution_type: typing.Optional[str] = field(
default='parallel',
metadata={
"name": "Execution Type",
"description":
"The order in which the ingress filters are applied. "
"Execution type can be 'serial' or 'parallel'"
}
)
network_params: typing.Dict[str, str] = field(
default=None,
metadata={
"name": "Network Parameters",
"description":
"The network filters that are applied on the interface. "
"The currently supported filters are latency, "
"loss and bandwidth"
}
)
kraken_config: typing.Optional[str] = field(
default='',
metadata={
"name": "Kraken Config",
"description":
"Path to the config file of Kraken. "
"Set this field if you wish to publish status onto Cerberus"
}
)
@dataclass
class NetworkScenarioSuccessOutput:
filter_direction: str = field(
metadata={
"name": "Filter Direction",
"description":
"Direction in which the traffic control filters are applied "
"on the test interfaces"
}
)
test_interfaces: typing.Dict[str, typing.List[str]] = field(
metadata={
"name": "Test Interfaces",
"description":
"Dictionary of nodes and their interfaces on which "
"the chaos experiment was performed"
}
)
network_parameters: typing.Dict[str, str] = field(
metadata={
"name": "Network Parameters",
"description":
"The network filters that are applied on the interfaces"
}
)
execution_type: str = field(
metadata={
"name": "Execution Type",
"description": "The order in which the filters are applied"
}
)
@dataclass
class NetworkScenarioErrorOutput:
error: str = field(
metadata={
"name": "Error",
"description":
"Error message when there is a run-time error during "
"the execution of the scenario"
}
)
def get_default_interface(
node: str,
pod_template,
cli: CoreV1Api
) -> str:
"""
Function that returns a random interface from a node
Args:
node (string)
- Node from which the interface is to be returned
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interface
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns:
Default interface (string) belonging to the node
"""
pod_body = yaml.safe_load(pod_template.render(nodename=node))
logging.info("Creating pod to query interface on node %s" % node)
kube_helper.create_pod(cli, pod_body, "default", 300)
try:
cmd = ["ip", "r"]
output = kube_helper.exec_cmd_in_pod(cli, cmd, "fedtools", "default")
if not output:
logging.error("Exception occurred while executing command in pod")
sys.exit(1)
routes = output.split('\n')
for route in routes:
if 'default' in route:
default_route = route
break
interfaces = [default_route.split()[4]]
finally:
logging.info("Deleting pod to query interface on node")
kube_helper.delete_pod(cli, "fedtools", "default")
return interfaces
def verify_interface(
input_interface_list: typing.List[str],
node: str,
pod_template,
cli: CoreV1Api
) -> typing.List[str]:
"""
Function that verifies whether a list of interfaces is present in the node.
If the list is empty, it fetches the interface of the default route
Args:
input_interface_list (List of strings)
- The interfaces to be checked on the node
node (string):
- Node on which input_interface_list is to be verified
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interfaces
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns:
The interface list for the node
"""
pod_body = yaml.safe_load(pod_template.render(nodename=node))
logging.info("Creating pod to query interface on node %s" % node)
kube_helper.create_pod(cli, pod_body, "default", 300)
try:
if input_interface_list == []:
cmd = ["ip", "r"]
output = kube_helper.exec_cmd_in_pod(
cli,
cmd,
"fedtools",
"default"
)
if not output:
logging.error(
"Exception occurred while executing command in pod"
)
sys.exit(1)
routes = output.split('\n')
for route in routes:
if 'default' in route:
default_route = route
break
input_interface_list = [default_route.split()[4]]
else:
cmd = ["ip", "-br", "addr", "show"]
output = kube_helper.exec_cmd_in_pod(
cli,
cmd,
"fedtools",
"default"
)
if not output:
logging.error(
"Exception occurred while executing command in pod"
)
sys.exit(1)
interface_ip = output.split('\n')
node_interface_list = [
interface.split()[0] for interface in interface_ip[:-1]
]
for interface in input_interface_list:
if interface not in node_interface_list:
logging.error(
"Interface %s not found in node %s interface list %s" %
(interface, node, node_interface_list)
)
raise Exception(
"Interface %s not found in node %s interface list %s" %
(interface, node, node_interface_list)
)
finally:
logging.info("Deleteing pod to query interface on node")
kube_helper.delete_pod(cli, "fedtools", "default")
return input_interface_list
def get_node_interfaces(
node_interface_dict: typing.Dict[str, typing.List[str]],
label_selector: str,
instance_count: int,
pod_template,
cli: CoreV1Api
) -> typing.Dict[str, typing.List[str]]:
"""
Function that is used to process the input dictionary with the nodes and
its test interfaces.
If the dictionary is empty, the label selector is used to select the nodes,
and then a random interface on each node is chosen as a test interface.
If the dictionary is not empty, it is filtered to include the nodes which
are active and then their interfaces are verified to be present
Args:
node_interface_dict (Dictionary with keys as node name and value as
a list of interface names)
- Nodes and their interfaces for the scenario
label_selector (string):
- Label selector to get nodes if node_interface_dict is empty
instance_count (int):
- Number of nodes to fetch in case node_interface_dict is empty
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interfaces
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns:
Filtered dictionary containing the test nodes and their test interfaces
"""
if not node_interface_dict:
if not label_selector:
raise Exception(
"If node names and interfaces aren't provided, "
"then the label selector must be provided"
)
nodes = kube_helper.get_node(None, label_selector, instance_count, cli)
node_interface_dict = {}
for node in nodes:
node_interface_dict[node] = get_default_interface(
node,
pod_template,
cli
)
else:
node_name_list = node_interface_dict.keys()
filtered_node_list = []
for node in node_name_list:
filtered_node_list.extend(
kube_helper.get_node(node, label_selector, instance_count, cli)
)
for node in filtered_node_list:
node_interface_dict[node] = verify_interface(
node_interface_dict[node], node, pod_template, cli
)
return node_interface_dict
def apply_ingress_filter(
cfg: NetworkScenarioConfig,
interface_list: typing.List[str],
node: str,
pod_template,
job_template,
batch_cli: BatchV1Api,
cli: CoreV1Api,
create_interfaces: bool = True,
param_selector: str = 'all'
) -> str:
"""
Function that applies the filters to shape incoming traffic to
the provided node's interfaces.
This is done by adding a virtual interface before each physical interface
and then performing egress traffic control on the virtual interface
Args:
cfg (NetworkScenarioConfig)
- Configurations used in this scenario
interface_list (List of strings)
- The interfaces on the node on which the filter is applied
node (string):
- Node on which the interfaces in interface_list are present
pod_template (jinja2.environment.Template))
- The YAML template used to instantiate a pod to create
virtual interfaces on the node
job_template (jinja2.environment.Template))
- The YAML template used to instantiate a job to apply and remove
the filters on the interfaces
batch_cli
- Object to interact with Kubernetes Python client's BatchV1 API
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
param_selector (string)
- Used to specify what kind of filter to apply. Useful during
serial execution mode. Default value is 'all'
Returns:
The name of the job created that executes the commands on a node
for ingress chaos scenario
"""
network_params = cfg.network_params
if param_selector != 'all':
network_params = {param_selector: cfg.network_params[param_selector]}
if create_interfaces:
create_virtual_interfaces(cli, interface_list, node, pod_template)
exec_cmd = get_ingress_cmd(
interface_list, network_params, duration=cfg.test_duration
)
logging.info("Executing %s on node %s" % (exec_cmd, node))
job_body = yaml.safe_load(
job_template.render(
jobname=str(hash(node))[:5],
nodename=node,
cmd=exec_cmd
)
)
api_response = kube_helper.create_job(batch_cli, job_body)
if api_response is None:
raise Exception("Error creating job")
return job_body["metadata"]["name"]
def create_virtual_interfaces(
cli: CoreV1Api,
interface_list: typing.List[str],
node: str,
pod_template
) -> None:
"""
Function that creates a privileged pod and uses it to create
virtual interfaces on the node
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
interface_list (List of strings)
- The list of interfaces on the node for which virtual interfaces
are to be created
node (string)
- The node on which the virtual interfaces are created
pod_template (jinja2.environment.Template))
- The YAML template used to instantiate a pod to create
virtual interfaces on the node
"""
pod_body = yaml.safe_load(
pod_template.render(nodename=node)
)
kube_helper.create_pod(cli, pod_body, "default", 300)
logging.info(
"Creating {0} virtual interfaces on node {1} using a pod".format(
len(interface_list),
node
)
)
create_ifb(cli, len(interface_list), 'modtools')
logging.info("Deleting pod used to create virtual interfaces")
kube_helper.delete_pod(cli, "modtools", "default")
def delete_virtual_interfaces(
cli: CoreV1Api,
node_list: typing.List[str],
pod_template
):
"""
Function that creates a privileged pod and uses it to delete all
virtual interfaces on the specified nodes
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
node_list (List of strings)
- The list of nodes on which the list of virtual interfaces are
to be deleted
node (string)
- The node on which the virtual interfaces are created
pod_template (jinja2.environment.Template))
- The YAML template used to instantiate a pod to delete
virtual interfaces on the node
"""
for node in node_list:
pod_body = yaml.safe_load(
pod_template.render(nodename=node)
)
kube_helper.create_pod(cli, pod_body, "default", 300)
logging.info(
"Deleting all virtual interfaces on node {0}".format(node)
)
delete_ifb(cli, 'modtools')
kube_helper.delete_pod(cli, "modtools", "default")
def create_ifb(cli: CoreV1Api, number: int, pod_name: str):
"""
Function that creates virtual interfaces in a pod.
Makes use of modprobe commands
"""
exec_command = [
'chroot', '/host',
'modprobe', 'ifb', 'numifbs=' + str(number)
]
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
for i in range(0, number):
exec_command = ['chroot', '/host', 'ip', 'link', 'set', 'dev']
exec_command += ['ifb' + str(i), 'up']
kube_helper.exec_cmd_in_pod(
cli,
exec_command,
pod_name,
'default'
)
def delete_ifb(cli: CoreV1Api, pod_name: str):
"""
Function that deletes all virtual interfaces in a pod.
Makes use of modprobe command
"""
exec_command = ['chroot', '/host', 'modprobe', '-r', 'ifb']
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
def get_job_pods(cli: CoreV1Api, api_response):
"""
Function that gets the pod corresponding to the job
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
api_response
- The API response for the job status
Returns
Pod corresponding to the job
"""
controllerUid = api_response.metadata.labels["controller-uid"]
pod_label_selector = "controller-uid=" + controllerUid
pods_list = kube_helper.list_pods(
cli,
label_selector=pod_label_selector,
namespace="default"
)
return pods_list[0]
def wait_for_job(
batch_cli: BatchV1Api,
job_list: typing.List[str],
timeout: int = 300
) -> None:
"""
Function that waits for a list of jobs to finish within a time period
Args:
batch_cli (BatchV1Api)
- Object to interact with Kubernetes Python client's BatchV1 API
job_list (List of strings)
- The list of jobs to check for completion
timeout (int)
- Max duration to wait for checking whether the jobs are completed
"""
wait_time = time.time() + timeout
count = 0
job_len = len(job_list)
while count != job_len:
for job_name in job_list:
try:
api_response = kube_helper.get_job_status(
batch_cli,
job_name,
namespace="default"
)
if (
api_response.status.succeeded is not None or
api_response.status.failed is not None
):
count += 1
job_list.remove(job_name)
except Exception:
logging.warn("Exception in getting job status")
if time.time() > wait_time:
raise Exception(
"Jobs did not complete within "
"the {0}s timeout period".format(timeout)
)
time.sleep(5)
def delete_jobs(
cli: CoreV1Api,
batch_cli: BatchV1Api,
job_list: typing.List[str]
):
"""
Function that deletes jobs
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
batch_cli (BatchV1Api)
- Object to interact with Kubernetes Python client's BatchV1 API
job_list (List of strings)
- The list of jobs to delete
"""
for job_name in job_list:
try:
api_response = kube_helper.get_job_status(
batch_cli,
job_name,
namespace="default"
)
if api_response.status.failed is not None:
pod_name = get_job_pods(cli, api_response)
pod_stat = kube_helper.read_pod(
cli,
name=pod_name,
namespace="default"
)
logging.error(pod_stat.status.container_statuses)
pod_log_response = kube_helper.get_pod_log(
cli,
name=pod_name,
namespace="default"
)
pod_log = pod_log_response.data.decode("utf-8")
logging.error(pod_log)
except Exception as e:
logging.warn("Exception in getting job status: %s" % str(e))
api_response = kube_helper.delete_job(
batch_cli,
name=job_name,
namespace="default"
)
def get_ingress_cmd(
interface_list: typing.List[str],
network_parameters: typing.Dict[str, str],
duration: int = 300
):
"""
Function that returns the commands to the ingress traffic shaping on
the node.
First, the virtual interfaces created are linked to the test interfaces
such that there is a one-to-one mapping between a virtual interface and
a test interface.
Then, incoming traffic to each test interface is forced to first pass
through the corresponding virtual interface.
Linux's tc commands are then used to performing egress traffic control
on the virtual interface. Since the outbound traffic from
the virtual interface passes through the test interface, this is
effectively ingress traffic control.
After a certain time interval, the traffic is restored to normal
Args:
interface_list (List of strings)
- Test interface list
network_parameters (Dictionary with key and value as string)
- Loss/Delay/Bandwidth and their corresponding values
duration (int)
- Duration for which the traffic control is to be done
Returns:
The traffic shaping commands as a string
"""
tc_set = tc_unset = tc_ls = ""
param_map = {"latency": "delay", "loss": "loss", "bandwidth": "rate"}
interface_pattern = re.compile(r"^[a-z0-9\-\@\_]+$")
ifb_pattern = re.compile(r"^ifb[0-9]+$")
for i, interface in enumerate(interface_list):
if not interface_pattern.match(interface):
logging.error(
"Interface name can only consist of alphanumeric characters"
)
raise Exception(
"Interface '{0}' does not match the required regex pattern :"
r" ^[a-z0-9\-\@\_]+$".format(interface)
)
ifb_name = "ifb{0}".format(i)
if not ifb_pattern.match(ifb_name):
logging.error("Invalid IFB name")
raise Exception(
"Interface '{0}' is an invalid IFB name. IFB name should "
"follow the regex pattern ^ifb[0-9]+$".format(ifb_name)
)
tc_set += "tc qdisc add dev {0} handle ffff: ingress;".format(
interface
)
tc_set += "tc filter add dev {0} parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev {1};".format( # noqa
interface,
ifb_name
)
tc_set = "{0} tc qdisc add dev {1} root netem".format(tc_set, ifb_name)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, ifb_name)
tc_unset += "tc qdisc del dev {0} handle ffff: ingress;".format(
interface
)
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, ifb_name)
for parameter in network_parameters.keys():
tc_set += " {0} {1} ".format(
param_map[parameter],
network_parameters[parameter]
)
tc_set += ";"
exec_cmd = "{0} {1} sleep {2};{3} sleep 20;{4}".format(
tc_set,
tc_ls,
duration,
tc_unset,
tc_ls
)
return exec_cmd
@plugin.step(
id="network_chaos",
name="Network Ingress",
description="Applies filters to ihe ingress side of node(s) interfaces",
outputs={
"success": NetworkScenarioSuccessOutput,
"error": NetworkScenarioErrorOutput
},
)
def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
str,
typing.Union[
NetworkScenarioSuccessOutput,
NetworkScenarioErrorOutput
]
]:
"""
Function that performs the ingress network chaos scenario based
on the provided configuration
Args:
cfg (NetworkScenarioConfig)
- The object containing the configuration for the scenario
Returns
A 'success' or 'error' message along with their details
"""
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader)
job_template = env.get_template("job.j2")
pod_interface_template = env.get_template("pod_interface.j2")
pod_module_template = env.get_template("pod_module.j2")
cli, batch_cli = kube_helper.setup_kubernetes(cfg.kubeconfig_path)
try:
node_interface_dict = get_node_interfaces(
cfg.node_interface_name,
cfg.label_selector,
cfg.instance_count,
pod_interface_template,
cli
)
except Exception:
return "error", NetworkScenarioErrorOutput(
format_exc()
)
job_list = []
publish = False
if cfg.kraken_config:
failed_post_scenarios = ""
try:
with open(cfg.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error(
"Error reading Kraken config from %s" % cfg.kraken_config
)
return "error", NetworkScenarioErrorOutput(
format_exc()
)
publish = True
try:
if cfg.execution_type == 'parallel':
for node in node_interface_dict:
job_list.append(
apply_ingress_filter(
cfg,
node_interface_dict[node],
node,
pod_module_template,
job_template,
batch_cli,
cli
)
)
logging.info("Waiting for parallel job to finish")
start_time = int(time.time())
wait_for_job(batch_cli, job_list[:], cfg.wait_duration)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
elif cfg.execution_type == 'serial':
create_interfaces = True
for param in cfg.network_params:
for node in node_interface_dict:
job_list.append(
apply_ingress_filter(
cfg,
node_interface_dict[node],
node,
pod_module_template,
job_template,
batch_cli,
cli,
create_interfaces=create_interfaces,
param_selector=param
)
)
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
wait_for_job(batch_cli, job_list[:], cfg.wait_duration)
logging.info("Deleting jobs")
delete_jobs(cli, batch_cli, job_list[:])
job_list = []
logging.info(
"Waiting for wait_duration : %ss" % cfg.wait_duration
)
time.sleep(cfg.wait_duration)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
create_interfaces = False
else:
return "error", NetworkScenarioErrorOutput(
"Invalid execution type - serial and parallel are "
"the only accepted types"
)
return "success", NetworkScenarioSuccessOutput(
filter_direction="ingress",
test_interfaces=node_interface_dict,
network_parameters=cfg.network_params,
execution_type=cfg.execution_type
)
except Exception as e:
logging.error("Network Chaos exiting due to Exception - %s" % e)
return "error", NetworkScenarioErrorOutput(
format_exc()
)
finally:
delete_virtual_interfaces(
cli,
node_interface_dict.keys(),
pod_module_template
)
logging.info("Deleting jobs(if any)")
delete_jobs(cli, batch_cli, job_list[:])

View File

@@ -0,0 +1,25 @@
apiVersion: batch/v1
kind: Job
metadata:
name: chaos-{{jobname}}
spec:
template:
spec:
nodeName: {{nodename}}
hostNetwork: true
containers:
- name: networkchaos
image: docker.io/fedora/tools
command: ["/bin/sh", "-c", "{{cmd}}"]
securityContext:
privileged: true
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
readOnly: true
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
restartPolicy: Never
backoffLimit: 0

View File

@@ -0,0 +1,284 @@
from kubernetes import config, client
from kubernetes.client.rest import ApiException
from kubernetes.stream import stream
import sys
import time
import logging
import random
def setup_kubernetes(kubeconfig_path):
"""
Sets up the Kubernetes client
"""
if kubeconfig_path is None:
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
config.load_kube_config(kubeconfig_path)
cli = client.CoreV1Api()
batch_cli = client.BatchV1Api()
return cli, batch_cli
def create_job(batch_cli, body, namespace="default"):
"""
Function used to create a job from a YAML config
"""
try:
api_response = batch_cli.create_namespaced_job(body=body, namespace=namespace)
return api_response
except ApiException as api:
logging.warn(
"Exception when calling \
BatchV1Api->create_job: %s"
% api
)
if api.status == 409:
logging.warn("Job already present")
except Exception as e:
logging.error(
"Exception when calling \
BatchV1Api->create_namespaced_job: %s"
% e
)
raise
def delete_pod(cli, name, namespace):
"""
Function that deletes a pod and waits until deletion is complete
"""
try:
cli.delete_namespaced_pod(name=name, namespace=namespace)
while cli.read_namespaced_pod(name=name, namespace=namespace):
time.sleep(1)
except ApiException as e:
if e.status == 404:
logging.info("Pod deleted")
else:
logging.error("Failed to delete pod %s" % e)
raise e
def create_pod(cli, body, namespace, timeout=120):
"""
Function used to create a pod from a YAML config
"""
try:
pod_stat = None
pod_stat = cli.create_namespaced_pod(body=body, namespace=namespace)
end_time = time.time() + timeout
while True:
pod_stat = cli.read_namespaced_pod(name=body["metadata"]["name"], namespace=namespace)
if pod_stat.status.phase == "Running":
break
if time.time() > end_time:
raise Exception("Starting pod failed")
time.sleep(1)
except Exception as e:
logging.error("Pod creation failed %s" % e)
if pod_stat:
logging.error(pod_stat.status.container_statuses)
delete_pod(cli, body["metadata"]["name"], namespace)
sys.exit(1)
def exec_cmd_in_pod(cli, command, pod_name, namespace, container=None):
"""
Function used to execute a command in a running pod
"""
exec_command = command
try:
if container:
ret = stream(
cli.connect_get_namespaced_pod_exec,
pod_name,
namespace,
container=container,
command=exec_command,
stderr=True,
stdin=False,
stdout=True,
tty=False,
)
else:
ret = stream(
cli.connect_get_namespaced_pod_exec,
pod_name,
namespace,
command=exec_command,
stderr=True,
stdin=False,
stdout=True,
tty=False,
)
except Exception as e:
return False
return ret
def create_ifb(cli, number, pod_name):
"""
Function that creates virtual interfaces in a pod. Makes use of modprobe commands
"""
exec_command = ['chroot', '/host', 'modprobe', 'ifb','numifbs=' + str(number)]
resp = exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
for i in range(0, number):
exec_command = ['chroot', '/host','ip','link','set','dev']
exec_command+= ['ifb' + str(i), 'up']
resp = exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
def delete_ifb(cli, pod_name):
"""
Function that deletes all virtual interfaces in a pod. Makes use of modprobe command
"""
exec_command = ['chroot', '/host', 'modprobe', '-r', 'ifb']
resp = exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
def list_pods(cli, namespace, label_selector=None):
"""
Function used to list pods in a given namespace and having a certain label
"""
pods = []
try:
if label_selector:
ret = cli.list_namespaced_pod(namespace, pretty=True, label_selector=label_selector)
else:
ret = cli.list_namespaced_pod(namespace, pretty=True)
except ApiException as e:
logging.error(
"Exception when calling \
CoreV1Api->list_namespaced_pod: %s\n"
% e
)
raise e
for pod in ret.items:
pods.append(pod.metadata.name)
return pods
def get_job_status(batch_cli, name, namespace="default"):
"""
Function that retrieves the status of a running job in a given namespace
"""
try:
return batch_cli.read_namespaced_job_status(name=name, namespace=namespace)
except Exception as e:
logging.error(
"Exception when calling \
BatchV1Api->read_namespaced_job_status: %s"
% e
)
raise
def get_pod_log(cli, name, namespace="default"):
"""
Function that retrieves the logs of a running pod in a given namespace
"""
return cli.read_namespaced_pod_log(
name=name, namespace=namespace, _return_http_data_only=True, _preload_content=False
)
def read_pod(cli, name, namespace="default"):
"""
Function that retrieves the info of a running pod in a given namespace
"""
return cli.read_namespaced_pod(name=name, namespace=namespace)
def delete_job(batch_cli, name, namespace="default"):
"""
Deletes a job with the input name and namespace
"""
try:
api_response = batch_cli.delete_namespaced_job(
name=name,
namespace=namespace,
body=client.V1DeleteOptions(propagation_policy="Foreground", grace_period_seconds=0),
)
logging.debug("Job deleted. status='%s'" % str(api_response.status))
return api_response
except ApiException as api:
logging.warn(
"Exception when calling \
BatchV1Api->create_namespaced_job: %s"
% api
)
logging.warn("Job already deleted\n")
except Exception as e:
logging.error(
"Exception when calling \
BatchV1Api->delete_namespaced_job: %s\n"
% e
)
sys.exit(1)
def list_ready_nodes(cli, label_selector=None):
"""
Returns a list of ready nodes
"""
nodes = []
try:
if label_selector:
ret = cli.list_node(pretty=True, label_selector=label_selector)
else:
ret = cli.list_node(pretty=True)
except ApiException as e:
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
raise e
for node in ret.items:
for cond in node.status.conditions:
if str(cond.type) == "Ready" and str(cond.status) == "True":
nodes.append(node.metadata.name)
return nodes
def get_node(node_name, label_selector, instance_kill_count, cli):
"""
Returns active node(s) on which the scenario can be performed
"""
if node_name in list_ready_nodes(cli):
return [node_name]
elif node_name:
logging.info(
"Node with provided node_name does not exist or the node might "
"be in NotReady state."
)
nodes = list_ready_nodes(cli, label_selector)
if not nodes:
raise Exception("Ready nodes with the provided label selector do not exist")
logging.info(
"Ready nodes with the label selector %s: %s" % (label_selector, nodes)
)
number_of_nodes = len(nodes)
if instance_kill_count == number_of_nodes:
return nodes
nodes_to_return = []
for i in range(instance_kill_count):
node_to_add = nodes[random.randint(0, len(nodes) - 1)]
nodes_to_return.append(node_to_add)
nodes.remove(node_to_add)
return nodes_to_return

View File

@@ -0,0 +1,16 @@
apiVersion: v1
kind: Pod
metadata:
name: fedtools
spec:
hostNetwork: true
nodeName: {{nodename}}
containers:
- name: fedtools
image: docker.io/fedora/tools
command:
- /bin/sh
- -c
- "trap : TERM INT; sleep infinity & wait"
securityContext:
privileged: true

View File

@@ -0,0 +1,30 @@
apiVersion: v1
kind: Pod
metadata:
name: modtools
spec:
nodeName: {{nodename}}
containers:
- name: modtools
image: docker.io/fedora/tools
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- "trap : TERM INT; sleep infinity & wait"
tty: true
stdin: true
stdinOnce: true
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
hostNetwork: true
hostIPC: true
hostPID: true
restartPolicy: Never

269
kraken/plugins/pod_plugin.py Executable file
View File

@@ -0,0 +1,269 @@
#!/usr/bin/env python
import re
import sys
import time
import typing
from dataclasses import dataclass, field
import random
from datetime import datetime
from traceback import format_exc
from kubernetes import config, client
from kubernetes.client import V1PodList, V1Pod, ApiException, V1DeleteOptions
from arcaflow_plugin_sdk import validation, plugin, schema
def setup_kubernetes(kubeconfig_path):
if kubeconfig_path is None:
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
if kubeconfig.config is None:
raise Exception(
'Invalid kube-config file: %s. '
'No configuration found.' % kubeconfig_path
)
loader = config.kube_config.KubeConfigLoader(
config_dict=kubeconfig.config,
)
client_config = client.Configuration()
loader.load_and_set(client_config)
return client.ApiClient(configuration=client_config)
def _find_pods(core_v1, label_selector, name_pattern, namespace_pattern):
pods: typing.List[V1Pod] = []
_continue = None
finished = False
while not finished:
pod_response: V1PodList = core_v1.list_pod_for_all_namespaces(
watch=False,
label_selector=label_selector
)
for pod in pod_response.items:
pod: V1Pod
if (name_pattern is None or name_pattern.match(pod.metadata.name)) and \
namespace_pattern.match(pod.metadata.namespace):
pods.append(pod)
_continue = pod_response.metadata._continue
if _continue is None:
finished = True
return pods
@dataclass
class Pod:
namespace: str
name: str
@dataclass
class PodKillSuccessOutput:
pods: typing.Dict[int, Pod] = field(metadata={
"name": "Pods removed",
"description": "Map between timestamps and the pods removed. The timestamp is provided in nanoseconds."
})
@dataclass
class PodWaitSuccessOutput:
pods: typing.List[Pod] = field(metadata={
"name": "Pods",
"description": "List of pods that have been found to run."
})
@dataclass
class PodErrorOutput:
error: str
@dataclass
class KillPodConfig:
"""
This is a configuration structure specific to pod kill scenario. It describes which pod from which
namespace(s) to select for killing and how many pods to kill.
"""
namespace_pattern: re.Pattern = field(metadata={
"name": "Namespace pattern",
"description": "Regular expression for target pod namespaces."
})
name_pattern: typing.Annotated[
typing.Optional[re.Pattern],
validation.required_if_not("label_selector")
] = field(default=None, metadata={
"name": "Name pattern",
"description": "Regular expression for target pods. Required if label_selector is not set."
})
kill: typing.Annotated[int, validation.min(1)] = field(
default=1,
metadata={"name": "Number of pods to kill", "description": "How many pods should we attempt to kill?"}
)
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name_pattern")
] = field(default=None, metadata={
"name": "Label selector",
"description": "Kubernetes label selector for the target pods. Required if name_pattern is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details."
})
kubeconfig_path: typing.Optional[str] = field(default=None, metadata={
"name": "Kubeconfig path",
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for "
"details."
})
timeout: int = field(default=180, metadata={
"name": "Timeout",
"description": "Timeout to wait for the target pod(s) to be removed in seconds."
})
backoff: int = field(default=1, metadata={
"name": "Backoff",
"description": "How many seconds to wait between checks for the target pod status."
})
@plugin.step(
"kill-pods",
"Kill pods",
"Kill pods as specified by parameters",
{"success": PodKillSuccessOutput, "error": PodErrorOutput}
)
def kill_pods(cfg: KillPodConfig) -> typing.Tuple[str, typing.Union[PodKillSuccessOutput, PodErrorOutput]]:
try:
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
# region Select target pods
pods = _find_pods(core_v1, cfg.label_selector, cfg.name_pattern, cfg.namespace_pattern)
if len(pods) < cfg.kill:
return "error", PodErrorOutput(
"Not enough pods match the criteria, expected {} but found only {} pods".format(cfg.kill, len(pods))
)
random.shuffle(pods)
# endregion
# region Remove pods
killed_pods: typing.Dict[int, Pod] = {}
watch_pods: typing.List[Pod] = []
for i in range(cfg.kill):
pod = pods[i]
core_v1.delete_namespaced_pod(pod.metadata.name, pod.metadata.namespace, body=V1DeleteOptions(
grace_period_seconds=0,
))
p = Pod(
pod.metadata.namespace,
pod.metadata.name
)
killed_pods[int(time.time_ns())] = p
watch_pods.append(p)
# endregion
# region Wait for pods to be removed
start_time = time.time()
while len(watch_pods) > 0:
time.sleep(cfg.backoff)
new_watch_pods: typing.List[Pod] = []
for p in watch_pods:
try:
read_pod = core_v1.read_namespaced_pod(p.name, p.namespace)
new_watch_pods.append(p)
except ApiException as e:
if e.status != 404:
raise
watch_pods = new_watch_pods
current_time = time.time()
if current_time - start_time > cfg.timeout:
return "error", PodErrorOutput("Timeout while waiting for pods to be removed.")
return "success", PodKillSuccessOutput(killed_pods)
# endregion
except Exception:
return "error", PodErrorOutput(
format_exc()
)
@dataclass
class WaitForPodsConfig:
"""
WaitForPodsConfig is a configuration structure for wait-for-pod steps.
"""
namespace_pattern: re.Pattern
name_pattern: typing.Annotated[
typing.Optional[re.Pattern],
validation.required_if_not("label_selector")
] = None
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name_pattern")
] = None
count: typing.Annotated[int, validation.min(1)] = field(
default=1,
metadata={"name": "Pod count", "description": "Wait for at least this many pods to exist"}
)
timeout: typing.Annotated[int, validation.min(1)] = field(
default=180,
metadata={"name": "Timeout", "description": "How many seconds to wait for?"}
)
backoff: int = field(default=1, metadata={
"name": "Backoff",
"description": "How many seconds to wait between checks for the target pod status."
})
kubeconfig_path: typing.Optional[str] = None
@plugin.step(
"wait-for-pods",
"Wait for pods",
"Wait for the specified number of pods to be present",
{"success": PodWaitSuccessOutput, "error": PodErrorOutput}
)
def wait_for_pods(cfg: WaitForPodsConfig) -> typing.Tuple[str, typing.Union[PodWaitSuccessOutput, PodErrorOutput]]:
try:
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
timeout = False
start_time = datetime.now()
while not timeout:
pods = _find_pods(core_v1, cfg.label_selector, cfg.name_pattern, cfg.namespace_pattern)
if len(pods) >= cfg.count:
return "success", \
PodWaitSuccessOutput(list(map(lambda p: Pod(p.metadata.namespace, p.metadata.name), pods)))
time.sleep(cfg.backoff)
now_time = datetime.now()
time_diff = now_time - start_time
if time_diff.seconds > cfg.timeout:
return "error", PodErrorOutput(
"timeout while waiting for pods to come up"
)
except Exception:
return "error", PodErrorOutput(
format_exc()
)
if __name__ == "__main__":
sys.exit(plugin.run(plugin.build_schema(
kill_pods,
wait_for_pods,
)))

View File

@@ -0,0 +1,50 @@
import dataclasses
import subprocess
import sys
import typing
from arcaflow_plugin_sdk import plugin
@dataclasses.dataclass
class RunPythonFileInput:
filename: str
@dataclasses.dataclass
class RunPythonFileOutput:
stdout: str
stderr: str
@dataclasses.dataclass
class RunPythonFileError:
exit_code: int
stdout: str
stderr: str
@plugin.step(
id="run_python",
name="Run a Python script",
description="Run a specified Python script",
outputs={"success": RunPythonFileOutput, "error": RunPythonFileError}
)
def run_python_file(params: RunPythonFileInput) -> typing.Tuple[
str,
typing.Union[RunPythonFileOutput, RunPythonFileError]
]:
run_results = subprocess.run(
[sys.executable, params.filename],
capture_output=True
)
if run_results.returncode == 0:
return "success", RunPythonFileOutput(
str(run_results.stdout, 'utf-8'),
str(run_results.stderr, 'utf-8')
)
return "error", RunPythonFileError(
run_results.returncode,
str(run_results.stdout, 'utf-8'),
str(run_results.stderr, 'utf-8')
)

View File

@@ -0,0 +1,179 @@
from kubernetes import config, client
from kubernetes.client.rest import ApiException
import logging
import random
from enum import Enum
class Actions(Enum):
"""
This enumeration indicates different kinds of node operations
"""
START = "Start"
STOP = "Stop"
TERMINATE = "Terminate"
REBOOT = "Reboot"
def setup_kubernetes(kubeconfig_path):
"""
Sets up the Kubernetes client
"""
if kubeconfig_path is None:
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
if kubeconfig.config is None:
raise Exception(
"Invalid kube-config file: %s. " "No configuration found." % kubeconfig_path
)
loader = config.kube_config.KubeConfigLoader(
config_dict=kubeconfig.config,
)
client_config = client.Configuration()
loader.load_and_set(client_config)
return client.ApiClient(configuration=client_config)
def list_killable_nodes(core_v1, label_selector=None):
"""
Returns a list of nodes that can be stopped/reset/released
"""
nodes = []
try:
if label_selector:
ret = core_v1.list_node(pretty=True, label_selector=label_selector)
else:
ret = core_v1.list_node(pretty=True)
except ApiException as e:
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
raise e
for node in ret.items:
for cond in node.status.conditions:
if str(cond.type) == "Ready" and str(cond.status) == "True":
nodes.append(node.metadata.name)
return nodes
def list_startable_nodes(core_v1, label_selector=None):
"""
Returns a list of nodes that can be started
"""
nodes = []
try:
if label_selector:
ret = core_v1.list_node(pretty=True, label_selector=label_selector)
else:
ret = core_v1.list_node(pretty=True)
except ApiException as e:
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
raise e
for node in ret.items:
for cond in node.status.conditions:
if str(cond.type) == "Ready" and str(cond.status) != "True":
nodes.append(node.metadata.name)
return nodes
def get_node_list(cfg, action, core_v1):
"""
Returns a list of nodes to be used in the node scenarios. The list returned is constructed as follows:
- If the key 'name' is present in the node scenario config, the value is extracted and split into
a list
- Each node in the list is fed to the get_node function which checks if the node is killable or
fetches the node using the label selector
"""
def get_node(node_name, label_selector, instance_kill_count, action, core_v1):
list_nodes_func = (
list_startable_nodes if action == Actions.START else list_killable_nodes
)
if node_name in list_nodes_func(core_v1):
return [node_name]
elif node_name:
logging.info(
"Node with provided node_name does not exist or the node might "
"be in NotReady state."
)
nodes = list_nodes_func(core_v1, label_selector)
if not nodes:
raise Exception("Ready nodes with the provided label selector do not exist")
logging.info(
"Ready nodes with the label selector %s: %s" % (label_selector, nodes)
)
number_of_nodes = len(nodes)
if instance_kill_count == number_of_nodes:
return nodes
nodes_to_return = []
for i in range(instance_kill_count):
node_to_add = nodes[random.randint(0, len(nodes) - 1)]
nodes_to_return.append(node_to_add)
nodes.remove(node_to_add)
return nodes_to_return
if cfg.name:
input_nodes = cfg.name.split(",")
else:
input_nodes = [""]
scenario_nodes = set()
if cfg.skip_openshift_checks:
scenario_nodes = input_nodes
else:
for node in input_nodes:
nodes = get_node(
node, cfg.label_selector, cfg.instance_count, action, core_v1
)
scenario_nodes.update(nodes)
return list(scenario_nodes)
def watch_node_status(node, status, timeout, watch_resource, core_v1):
"""
Monitor the status of a node for change
"""
count = timeout
for event in watch_resource.stream(
core_v1.list_node,
field_selector=f"metadata.name={node}",
timeout_seconds=timeout,
):
conditions = [
status
for status in event["object"].status.conditions
if status.type == "Ready"
]
if conditions[0].status == status:
watch_resource.stop()
break
else:
count -= 1
logging.info("Status of node " + node + ": " + str(conditions[0].status))
if not count:
watch_resource.stop()
def wait_for_ready_status(node, timeout, watch_resource, core_v1):
"""
Wait until the node status becomes Ready
"""
watch_node_status(node, "True", timeout, watch_resource, core_v1)
def wait_for_not_ready_status(node, timeout, watch_resource, core_v1):
"""
Wait until the node status becomes Not Ready
"""
watch_node_status(node, "False", timeout, watch_resource, core_v1)
def wait_for_unknown_status(node, timeout, watch_resource, core_v1):
"""
Wait until the node status becomes Unknown
"""
watch_node_status(node, "Unknown", timeout, watch_resource, core_v1)

View File

@@ -0,0 +1,770 @@
#!/usr/bin/env python
import logging
import random
import sys
import time
import typing
from dataclasses import dataclass, field
from os import environ
from traceback import format_exc
import requests
from arcaflow_plugin_sdk import plugin, validation
from com.vmware.vapi.std.errors_client import (AlreadyInDesiredState,
NotAllowedInCurrentState)
from com.vmware.vcenter.vm_client import Power
from com.vmware.vcenter_client import VM, ResourcePool
from kubernetes import client, watch
from vmware.vapi.vsphere.client import create_vsphere_client
from kraken.plugins.vmware import kubernetes_functions as kube_helper
class vSphere:
def __init__(self, verify=True):
"""
Initialize the vSphere client by using the the env variables:
'VSPHERE_IP', 'VSPHERE_USERNAME', 'VSPHERE_PASSWORD'
"""
self.server = environ.get("VSPHERE_IP")
self.username = environ.get("VSPHERE_USERNAME")
self.password = environ.get("VSPHERE_PASSWORD")
session = self.get_unverified_session() if not verify else None
self.credentials_present = (
True if self.server and self.username and self.password else False
)
if not self.credentials_present:
raise Exception(
"Environmental variables "
"'VSPHERE_IP', 'VSPHERE_USERNAME', "
"'VSPHERE_PASSWORD' are not set"
)
self.client = create_vsphere_client(
server=self.server,
username=self.username,
password=self.password,
session=session,
)
def get_unverified_session(self):
"""
Returns an unverified session object
"""
session = requests.session()
session.verify = False
requests.packages.urllib3.disable_warnings()
return session
def get_vm(self, instance_id):
"""
Returns the VM ID corresponding to the VM Name (instance_id)
If there are multiple matches, this only returns the first one
"""
names = set([instance_id])
vms = self.client.vcenter.VM.list(VM.FilterSpec(names=names))
if len(vms) == 0:
logging.info("VM with name ({}) not found", instance_id)
return None
vm = vms[0].vm
return vm
def release_instances(self, instance_id):
"""
Deletes the VM whose name is given by 'instance_id'
"""
vm = self.get_vm(instance_id)
if not vm:
raise Exception(
"VM with the name ({}) does not exist."
"Please create the vm first.".format(instance_id)
)
state = self.client.vcenter.vm.Power.get(vm)
if state == Power.Info(state=Power.State.POWERED_ON):
self.client.vcenter.vm.Power.stop(vm)
elif state == Power.Info(state=Power.State.SUSPENDED):
self.client.vcenter.vm.Power.start(vm)
self.client.vcenter.vm.Power.stop(vm)
self.client.vcenter.VM.delete(vm)
logging.info("Deleted VM -- '{}-({})'", instance_id, vm)
def reboot_instances(self, instance_id):
"""
Reboots the VM whose name is given by 'instance_id'.
@Returns: True if successful, or False if the VM is not powered on
"""
vm = self.get_vm(instance_id)
try:
self.client.vcenter.vm.Power.reset(vm)
logging.info("Reset VM -- '{}-({})'", instance_id, vm)
return True
except NotAllowedInCurrentState:
logging.info(
"VM '{}'-'({})' is not Powered On. Cannot reset it",
instance_id,
vm
)
return False
def stop_instances(self, instance_id):
"""
Stops the VM whose name is given by 'instance_id'.
@Returns: True if successful, or False if the VM is already powered off
"""
vm = self.get_vm(instance_id)
try:
self.client.vcenter.vm.Power.stop(vm)
logging.info("Stopped VM -- '{}-({})'", instance_id, vm)
return True
except AlreadyInDesiredState:
logging.info(
"VM '{}'-'({})' is already Powered Off", instance_id, vm
)
return False
def start_instances(self, instance_id):
"""
Stops the VM whose name is given by 'instance_id'.
@Returns: True if successful, or False if the VM is already powered on
"""
vm = self.get_vm(instance_id)
try:
self.client.vcenter.vm.Power.start(vm)
logging.info("Started VM -- '{}-({})'", instance_id, vm)
return True
except AlreadyInDesiredState:
logging.info(
"VM '{}'-'({})' is already Powered On", instance_id, vm
)
return False
def list_instances(self, datacenter):
"""
@Returns: a list of VMs present in the datacenter
"""
datacenter_filter = self.client.vcenter.Datacenter.FilterSpec(
names=set([datacenter])
)
datacenter_summaries = self.client.vcenter.Datacenter.list(
datacenter_filter
)
try:
datacenter_id = datacenter_summaries[0].datacenter
except IndexError:
logging.error("Datacenter '{}' doesn't exist", datacenter)
sys.exit(1)
vm_filter = self.client.vcenter.VM.FilterSpec(
datacenters={datacenter_id}
)
vm_summaries = self.client.vcenter.VM.list(vm_filter)
vm_names = []
for vm in vm_summaries:
vm_names.append({"vm_name": vm.name, "vm_id": vm.vm})
return vm_names
def get_datacenter_list(self):
"""
Returns a dictionary containing all the datacenter names and IDs
"""
datacenter_summaries = self.client.vcenter.Datacenter.list()
datacenter_names = [
{
"datacenter_id": datacenter.datacenter,
"datacenter_name": datacenter.name
}
for datacenter in datacenter_summaries
]
return datacenter_names
def get_datastore_list(self, datacenter=None):
"""
@Returns: a dictionary containing all the datastore names and
IDs belonging to a specific datacenter
"""
datastore_filter = self.client.vcenter.Datastore.FilterSpec(
datacenters={datacenter}
)
datastore_summaries = self.client.vcenter.Datastore.list(
datastore_filter
)
datastore_names = []
for datastore in datastore_summaries:
datastore_names.append(
{
"datastore_name": datastore.name,
"datastore_id": datastore.datastore
}
)
return datastore_names
def get_folder_list(self, datacenter=None):
"""
@Returns: a dictionary containing all the folder names and
IDs belonging to a specific datacenter
"""
folder_filter = self.client.vcenter.Folder.FilterSpec(
datacenters={datacenter}
)
folder_summaries = self.client.vcenter.Folder.list(folder_filter)
folder_names = []
for folder in folder_summaries:
folder_names.append(
{"folder_name": folder.name, "folder_id": folder.folder}
)
return folder_names
def get_resource_pool(self, datacenter, resource_pool_name=None):
"""
Returns the identifier of the resource pool with the given name or the
first resource pool in the datacenter if the name is not provided.
"""
names = set([resource_pool_name]) if resource_pool_name else None
filter_spec = ResourcePool.FilterSpec(
datacenters=set([datacenter]), names=names
)
resource_pool_summaries = self.client.vcenter.ResourcePool.list(
filter_spec
)
if len(resource_pool_summaries) > 0:
resource_pool = resource_pool_summaries[0].resource_pool
return resource_pool
else:
logging.error(
"ResourcePool not found in Datacenter '{}'",
datacenter
)
return None
def create_default_vm(self, guest_os="RHEL_7_64", max_attempts=10):
"""
Creates a default VM with 2 GB memory, 1 CPU and 16 GB disk space in a
random datacenter. Accepts the guest OS as a parameter. Since the VM
placement is random, it might fail due to resource constraints.
So, this function tries for upto 'max_attempts' to create the VM
"""
def create_vm(vm_name, resource_pool, folder, datastore, guest_os):
"""
Creates a VM and returns its ID and name. Requires the VM name,
resource pool name, folder name, datastore and the guest OS
"""
placement_spec = VM.PlacementSpec(
folder=folder, resource_pool=resource_pool, datastore=datastore
)
vm_create_spec = VM.CreateSpec(
name=vm_name, guest_os=guest_os, placement=placement_spec
)
vm_id = self.client.vcenter.VM.create(vm_create_spec)
return vm_id
for _ in range(max_attempts):
try:
datacenter_list = self.get_datacenter_list()
# random generator not used for
# security/cryptographic purposes in this loop
datacenter = random.choice(datacenter_list) # nosec
resource_pool = self.get_resource_pool(
datacenter["datacenter_id"]
)
folder = random.choice( # nosec
self.get_folder_list(datacenter["datacenter_id"])
)["folder_id"]
datastore = random.choice( # nosec
self.get_datastore_list(datacenter["datacenter_id"])
)["datastore_id"]
vm_name = "Test-" + str(time.time_ns())
return (
create_vm(
vm_name,
resource_pool,
folder,
datastore,
guest_os
),
vm_name,
)
except Exception as e:
logging.error(
"Default VM could not be created, retrying. "
"Error was: %s",
str(e)
)
logging.error(
"Default VM could not be created in %s attempts. "
"Check your VMware resources",
max_attempts
)
return None, None
def get_vm_status(self, instance_id):
"""
Returns the status of the VM whose name is given by 'instance_id'
"""
try:
vm = self.get_vm(instance_id)
state = self.client.vcenter.vm.Power.get(vm).state
logging.info("Check instance %s status", instance_id)
return state
except Exception as e:
logging.error(
"Failed to get node instance status %s. Encountered following "
"exception: %s.", instance_id, e
)
return None
def wait_until_released(self, instance_id, timeout):
"""
Waits until the VM is deleted or until the timeout. Returns True if
the VM is successfully deleted, else returns False
"""
time_counter = 0
vm = self.get_vm(instance_id)
while vm is not None:
vm = self.get_vm(instance_id)
logging.info(
"VM %s is still being deleted, "
"sleeping for 5 seconds",
instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
"VM %s is still not deleted in allotted time",
instance_id
)
return False
return True
def wait_until_running(self, instance_id, timeout):
"""
Waits until the VM switches to POWERED_ON state or until the timeout.
Returns True if the VM switches to POWERED_ON, else returns False
"""
time_counter = 0
status = self.get_vm_status(instance_id)
while status != Power.State.POWERED_ON:
status = self.get_vm_status(instance_id)
logging.info(
"VM %s is still not running, "
"sleeping for 5 seconds",
instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
"VM %s is still not ready in allotted time",
instance_id
)
return False
return True
def wait_until_stopped(self, instance_id, timeout):
"""
Waits until the VM switches to POWERED_OFF state or until the timeout.
Returns True if the VM switches to POWERED_OFF, else returns False
"""
time_counter = 0
status = self.get_vm_status(instance_id)
while status != Power.State.POWERED_OFF:
status = self.get_vm_status(instance_id)
logging.info(
"VM %s is still not running, "
"sleeping for 5 seconds",
instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
"VM %s is still not ready in allotted time",
instance_id
)
return False
return True
@dataclass
class Node:
name: str
@dataclass
class NodeScenarioSuccessOutput:
nodes: typing.Dict[int, Node] = field(
metadata={
"name": "Nodes started/stopped/terminated/rebooted",
"description": "Map between timestamps and the pods "
"started/stopped/terminated/rebooted. "
"The timestamp is provided in nanoseconds",
}
)
action: kube_helper.Actions = field(
metadata={
"name": "The action performed on the node",
"description": "The action performed or attempted to be "
"performed on the node. Possible values"
"are : Start, Stop, Terminate, Reboot",
}
)
@dataclass
class NodeScenarioErrorOutput:
error: str
action: kube_helper.Actions = field(
metadata={
"name": "The action performed on the node",
"description": "The action attempted to be performed on the node. "
"Possible values are : Start Stop, Terminate, Reboot",
}
)
@dataclass
class NodeScenarioConfig:
name: typing.Annotated[
typing.Optional[str],
validation.required_if_not("label_selector"),
validation.required_if("skip_openshift_checks"),
] = field(
default=None,
metadata={
"name": "Name",
"description": "Name(s) for target nodes. "
"Required if label_selector is not set.",
},
)
runs: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=1,
metadata={
"name": "Number of runs per node",
"description": "Number of times to inject each scenario under "
"actions (will perform on same node each time)",
},
)
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name")
] = field(
default=None,
metadata={
"name": "Label selector",
"description": "Kubernetes label selector for the target nodes. "
"Required if name is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
"for details.",
},
)
timeout: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=180,
metadata={
"name": "Timeout",
"description": "Timeout to wait for the target pod(s) "
"to be removed in seconds.",
},
)
instance_count: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=1,
metadata={
"name": "Instance Count",
"description": "Number of nodes to perform action/select "
"that match the label selector.",
},
)
skip_openshift_checks: typing.Optional[bool] = field(
default=False,
metadata={
"name": "Skip Openshift Checks",
"description": "Skip checking the status of the openshift nodes.",
},
)
verify_session: bool = field(
default=True,
metadata={
"name": "Verify API Session",
"description": "Verifies the vSphere client session. "
"It is enabled by default",
},
)
kubeconfig_path: typing.Optional[str] = field(
default=None,
metadata={
"name": "Kubeconfig path",
"description": "Path to your Kubeconfig file. "
"Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
"for details.",
},
)
@plugin.step(
id="node_start_scenario",
name="Start the node",
description="Start the node(s) by starting the VMware VM "
"on which the node is configured",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
)
def node_start(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(
cfg,
kube_helper.Actions.START,
core_v1
)
nodes_started = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info("Starting node_start_scenario injection")
logging.info("Starting the node %s ", name)
vm_started = vsphere.start_instances(name)
if vm_started:
vsphere.wait_until_running(name, cfg.timeout)
if not cfg.skip_openshift_checks:
kube_helper.wait_for_ready_status(
name, cfg.timeout, watch_resource, core_v1
)
nodes_started[int(time.time_ns())] = Node(name=name)
logging.info(
"Node with instance ID: %s is in running state", name
)
logging.info(
"node_start_scenario has been successfully injected!"
)
except Exception as e:
logging.error("Failed to start node instance. Test Failed")
logging.error(
"node_start_scenario injection failed! "
"Error was: %s", str(e)
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.START
)
return "success", NodeScenarioSuccessOutput(
nodes_started, kube_helper.Actions.START
)
@plugin.step(
id="node_stop_scenario",
name="Stop the node",
description="Stop the node(s) by starting the VMware VM "
"on which the node is configured",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
)
def node_stop(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(
cfg,
kube_helper.Actions.STOP,
core_v1
)
nodes_stopped = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info("Starting node_stop_scenario injection")
logging.info("Stopping the node %s ", name)
vm_stopped = vsphere.stop_instances(name)
if vm_stopped:
vsphere.wait_until_stopped(name, cfg.timeout)
if not cfg.skip_openshift_checks:
kube_helper.wait_for_ready_status(
name, cfg.timeout, watch_resource, core_v1
)
nodes_stopped[int(time.time_ns())] = Node(name=name)
logging.info(
"Node with instance ID: %s is in stopped state", name
)
logging.info(
"node_stop_scenario has been successfully injected!"
)
except Exception as e:
logging.error("Failed to stop node instance. Test Failed")
logging.error(
"node_stop_scenario injection failed! "
"Error was: %s", str(e)
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.STOP
)
return "success", NodeScenarioSuccessOutput(
nodes_stopped, kube_helper.Actions.STOP
)
@plugin.step(
id="node_reboot_scenario",
name="Reboot VMware VM",
description="Reboot the node(s) by starting the VMware VM "
"on which the node is configured",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
)
def node_reboot(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(
cfg,
kube_helper.Actions.REBOOT,
core_v1
)
nodes_rebooted = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info("Starting node_reboot_scenario injection")
logging.info("Rebooting the node %s ", name)
vsphere.reboot_instances(name)
if not cfg.skip_openshift_checks:
kube_helper.wait_for_unknown_status(
name, cfg.timeout, watch_resource, core_v1
)
kube_helper.wait_for_ready_status(
name, cfg.timeout, watch_resource, core_v1
)
nodes_rebooted[int(time.time_ns())] = Node(name=name)
logging.info(
"Node with instance ID: %s has rebooted "
"successfully", name
)
logging.info(
"node_reboot_scenario has been successfully injected!"
)
except Exception as e:
logging.error("Failed to reboot node instance. Test Failed")
logging.error(
"node_reboot_scenario injection failed! "
"Error was: %s", str(e)
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.REBOOT
)
return "success", NodeScenarioSuccessOutput(
nodes_rebooted, kube_helper.Actions.REBOOT
)
@plugin.step(
id="node_terminate_scenario",
name="Reboot VMware VM",
description="Wait for the specified number of pods to be present",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
)
def node_terminate(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
vsphere = vSphere(verify=cfg.verify_session)
core_v1 = client.CoreV1Api(cli)
node_list = kube_helper.get_node_list(
cfg, kube_helper.Actions.TERMINATE, core_v1
)
nodes_terminated = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info(
"Starting node_termination_scenario injection "
"by first stopping the node"
)
vsphere.stop_instances(name)
vsphere.wait_until_stopped(name, cfg.timeout)
logging.info(
"Releasing the node with instance ID: %s ", name
)
vsphere.release_instances(name)
vsphere.wait_until_released(name, cfg.timeout)
nodes_terminated[int(time.time_ns())] = Node(name=name)
logging.info(
"Node with instance ID: %s has been released", name
)
logging.info(
"node_terminate_scenario has been "
"successfully injected!"
)
except Exception as e:
logging.error("Failed to terminate node instance. Test Failed")
logging.error(
"node_terminate_scenario injection failed! "
"Error was: %s", str(e)
)
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.TERMINATE
)
return "success", NodeScenarioSuccessOutput(
nodes_terminated, kube_helper.Actions.TERMINATE
)

View File

@@ -1,5 +1,8 @@
import logging
import kraken.invoke.command as runcommand
from arcaflow_plugin_sdk import serialization
from kraken.plugins import pod_plugin
import kraken.cerberus.setup as cerberus
import kraken.post_actions.actions as post_actions
import kraken.kubernetes.client as kubecli
@@ -20,20 +23,30 @@ def run(kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_dur
try:
# capture start time
start_time = int(time.time())
scenario_logs = runcommand.invoke(
"powerfulseal autonomous --use-pod-delete-instead-"
"of-ssh-kill --policy-file %s --kubeconfig %s "
"--no-cloud --inventory-kubernetes --headless" % (pod_scenario[0], kubeconfig_path)
)
input = serialization.load_from_file(pod_scenario)
s = pod_plugin.get_schema()
input_data: pod_plugin.KillPodConfig = s.unserialize_input("pod", input)
if kubeconfig_path is not None:
input_data.kubeconfig_path = kubeconfig_path
output_id, output_data = s.call_step("pod", input_data)
if output_id == "error":
data: pod_plugin.PodErrorOutput = output_data
logging.error("Failed to run pod scenario: {}".format(data.error))
else:
data: pod_plugin.PodSuccessOutput = output_data
for pod in data.pods:
print("Deleted pod {} in namespace {}\n".format(pod.pod_name, pod.pod_namespace))
except Exception as e:
logging.error(
"Failed to run scenario: %s. Encountered the following " "exception: %s" % (pod_scenario[0], e)
)
sys.exit(1)
# Display pod scenario logs/actions
print(scenario_logs)
logging.info("Scenario: %s has been successfully injected!" % (pod_scenario[0]))
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
@@ -119,14 +132,13 @@ def container_killing_in_pod(cont_scenario):
container_pod_list = []
for pod in pods:
if type(pod) == list:
container_names = runcommand.invoke(
'kubectl get pods %s -n %s -o jsonpath="{.spec.containers[*].name}"' % (pod[0], pod[1])
).split(" ")
pod_output = kubecli.get_pod_info(pod[0], pod[1])
container_names = [container.name for container in pod_output.containers]
container_pod_list.append([pod[0], pod[1], container_names])
else:
container_names = runcommand.invoke(
'oc get pods %s -n %s -o jsonpath="{.spec.containers[*].name}"' % (pod, namespace)
).split(" ")
pod_output = kubecli.get_pod_info(pod, namespace)
container_names = [container.name for container in pod_output.containers]
container_pod_list.append([pod, namespace, container_names])
killed_count = 0
@@ -176,13 +188,11 @@ def check_failed_containers(killed_container_list, wait_time):
while timer <= wait_time:
for killed_container in killed_container_list:
# pod namespace contain name
pod_output = runcommand.invoke(
"kubectl get pods %s -n %s -o yaml" % (killed_container[0], killed_container[1])
)
pod_output_yaml = yaml.full_load(pod_output)
for statuses in pod_output_yaml["status"]["containerStatuses"]:
if statuses["name"] == killed_container[2]:
if str(statuses["ready"]).lower() == "true":
pod_output = kubecli.get_pod_info(killed_container[0], killed_container[1])
for container in pod_output.containers:
if container.name == killed_container[2]:
if container.ready:
container_ready.append(killed_container)
if len(container_ready) != 0:
for item in container_ready:

View File

@@ -5,21 +5,7 @@ import kraken.invoke.command as runcommand
def run(kubeconfig_path, scenario, pre_action_output=""):
if scenario.endswith(".yaml") or scenario.endswith(".yml"):
action_output = runcommand.invoke(
"powerfulseal autonomous "
"--use-pod-delete-instead-of-ssh-kill"
" --policy-file %s --kubeconfig %s --no-cloud"
" --inventory-kubernetes --headless" % (scenario, kubeconfig_path)
)
# read output to make sure no error
if "ERROR" in action_output:
action_output.split("ERROR")[1].split("\n")[0]
if not pre_action_output:
logging.info("Powerful seal pre action check failed for " + str(scenario))
return False
else:
logging.info(scenario + " post action checks passed")
logging.error("Powerfulseal support has recently been removed. Please switch to using plugins instead.")
elif scenario.endswith(".py"):
action_output = runcommand.invoke("python3 " + scenario).strip()
if pre_action_output:

View File

@@ -9,5 +9,8 @@ def instance(distribution, prometheus_url, prometheus_bearer_token):
)
prometheus_url = "https://" + url
if distribution == "openshift" and not prometheus_bearer_token:
prometheus_bearer_token = runcommand.invoke("oc -n openshift-monitoring " "sa get-token prometheus-k8s")
prometheus_bearer_token = runcommand.invoke(
"oc -n openshift-monitoring sa get-token prometheus-k8s "
"|| oc create token -n openshift-monitoring prometheus-k8s"
)
return prometheus_url, prometheus_bearer_token

View File

@@ -1,17 +1,19 @@
import sys
import yaml
import re
import json
import logging
import random
import re
import sys
import time
import kraken.cerberus.setup as cerberus
import kraken.kubernetes.client as kubecli
import kraken.invoke.command as runcommand
# Reads the scenario config and creates a temp file to fill up the PVC
import yaml
from ..cerberus import setup as cerberus
from ..kubernetes import client as kubecli
def run(scenarios_list, config):
"""
Reads the scenario config and creates a temp file to fill up the PVC
"""
failed_post_scenarios = ""
for app_config in scenarios_list:
if len(app_config) > 1:
@@ -21,169 +23,265 @@ def run(scenarios_list, config):
pvc_name = scenario_config.get("pvc_name", "")
pod_name = scenario_config.get("pod_name", "")
namespace = scenario_config.get("namespace", "")
target_fill_percentage = scenario_config.get("fill_percentage", "50")
target_fill_percentage = scenario_config.get(
"fill_percentage", "50"
)
duration = scenario_config.get("duration", 60)
logging.info(
"""Input params:
pvc_name: '%s'\npod_name: '%s'\nnamespace: '%s'\ntarget_fill_percentage: '%s%%'\nduration: '%ss'"""
% (str(pvc_name), str(pod_name), str(namespace), str(target_fill_percentage), str(duration))
"Input params:\n"
"pvc_name: '%s'\n"
"pod_name: '%s'\n"
"namespace: '%s'\n"
"target_fill_percentage: '%s%%'\nduration: '%ss'"
% (
str(pvc_name),
str(pod_name),
str(namespace),
str(target_fill_percentage),
str(duration)
)
)
# Check input params
if namespace is None:
logging.error("You must specify the namespace where the PVC is")
logging.error(
"You must specify the namespace where the PVC is"
)
sys.exit(1)
if pvc_name is None and pod_name is None:
logging.error("You must specify the pvc_name or the pod_name")
logging.error(
"You must specify the pvc_name or the pod_name"
)
sys.exit(1)
if pvc_name and pod_name:
logging.info(
"pod_name will be ignored, pod_name used will be a retrieved from the pod used in the pvc_name"
"pod_name will be ignored, pod_name used will be "
"a retrieved from the pod used in the pvc_name"
)
# Get pod name
if pvc_name:
if pod_name:
logging.info(
"pod_name '%s' will be overridden from the pod mounted in the PVC" % (str(pod_name))
"pod_name '%s' will be overridden with one of "
"the pods mounted in the PVC" % (str(pod_name))
)
command = "kubectl describe pvc %s -n %s | grep -E 'Mounted By:|Used By:' | grep -Eo '[^: ]*$'" % (
str(pvc_name),
str(namespace),
)
logging.debug("Get pod name command:\n %s" % command)
pod_name = runcommand.invoke(command, 60).rstrip()
logging.info("Pod name: %s" % pod_name)
if pod_name == "<none>":
pvc = kubecli.get_pvc_info(pvc_name, namespace)
try:
# random generator not used for
# security/cryptographic purposes.
pod_name = random.choice(pvc.podNames) # nosec
logging.info("Pod name: %s" % pod_name)
except Exception:
logging.error(
"Pod associated with %s PVC, on namespace %s, not found" % (str(pvc_name), str(namespace))
"Pod associated with %s PVC, on namespace %s, "
"not found" % (str(pvc_name), str(namespace))
)
sys.exit(1)
# Get volume name
command = 'kubectl get pods %s -n %s -o json | jq -r ".spec.volumes"' % (
str(pod_name),
str(namespace),
)
logging.debug("Get mount path command:\n %s" % command)
volumes_list = runcommand.invoke(command, 60).rstrip()
volumes_list_json = json.loads(volumes_list)
for entry in volumes_list_json:
if len(entry["persistentVolumeClaim"]["claimName"]) > 0:
volume_name = entry["name"]
pvc_name = entry["persistentVolumeClaim"]["claimName"]
pod = kubecli.get_pod_info(name=pod_name, namespace=namespace)
if pod is None:
logging.error(
"Exiting as pod '%s' doesn't exist "
"in namespace '%s'" % (
str(pod_name),
str(namespace)
)
)
sys.exit(1)
for volume in pod.volumes:
if volume.pvcName is not None:
volume_name = volume.name
pvc_name = volume.pvcName
pvc = kubecli.get_pvc_info(pvc_name, namespace)
break
if 'pvc' not in locals():
logging.error(
"Pod '%s' in namespace '%s' does not use a pvc" % (
str(pod_name),
str(namespace)
)
)
sys.exit(1)
logging.info("Volume name: %s" % volume_name)
logging.info("PVC name: %s" % pvc_name)
# Get container name and mount path
command = 'kubectl get pods %s -n %s -o json | jq -r ".spec.containers"' % (
str(pod_name),
str(namespace),
)
logging.debug("Get mount path command:\n %s" % command)
volume_mounts_list = runcommand.invoke(command, 60).rstrip().replace("\n]\n[\n", ",\n")
volume_mounts_list_json = json.loads(volume_mounts_list)
for entry in volume_mounts_list_json:
for vol in entry["volumeMounts"]:
if vol["name"] == volume_name:
mount_path = vol["mountPath"]
container_name = entry["name"]
for container in pod.containers:
for vol in container.volumeMounts:
if vol.name == volume_name:
mount_path = vol.mountPath
container_name = container.name
break
logging.info("Container path: %s" % container_name)
logging.info("Mount path: %s" % mount_path)
# Get PVC capacity
command = "kubectl describe pvc %s -n %s | grep \"Capacity:\" | grep -Eo '[^: ]*$'" % (
str(pvc_name),
str(namespace),
)
pvc_capacity = runcommand.invoke(
command,
60,
).rstrip()
logging.debug("Get PVC capacity command:\n %s" % command)
pvc_capacity_bytes = toKbytes(pvc_capacity)
logging.info("PVC capacity: %s KB" % pvc_capacity_bytes)
# Get used bytes in PVC
command = "du -sk %s | grep -Eo '^[0-9]*'" % (str(mount_path))
logging.debug("Get used bytes in PVC command:\n %s" % command)
pvc_used = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
logging.info("PVC used: %s KB" % pvc_used)
# Get PVC capacity and used bytes
command = "df %s -B 1024 | sed 1d" % (str(mount_path))
command_output = (
kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name,
"sh"
)
).split()
pvc_used_kb = int(command_output[2])
pvc_capacity_kb = pvc_used_kb + int(command_output[3])
logging.info("PVC used: %s KB" % pvc_used_kb)
logging.info("PVC capacity: %s KB" % pvc_capacity_kb)
# Check valid fill percentage
current_fill_percentage = float(pvc_used) / float(pvc_capacity_bytes)
if not (current_fill_percentage * 100 < float(target_fill_percentage) <= 99):
current_fill_percentage = pvc_used_kb / pvc_capacity_kb
if not (
current_fill_percentage * 100
< float(target_fill_percentage)
<= 99
):
logging.error(
"""
Target fill percentage (%.2f%%) is lower than current fill percentage (%.2f%%)
or higher than 99%%
"""
% (target_fill_percentage, current_fill_percentage * 100)
"Target fill percentage (%.2f%%) is lower than "
"current fill percentage (%.2f%%) "
"or higher than 99%%" % (
target_fill_percentage,
current_fill_percentage * 100
)
)
sys.exit(1)
# Calculate file size
file_size = int((float(target_fill_percentage / 100) * float(pvc_capacity_bytes)) - float(pvc_used))
logging.debug("File size: %s KB" % file_size)
file_size_kb = int(
(
float(
target_fill_percentage / 100
) * float(pvc_capacity_kb)
) - float(pvc_used_kb)
)
logging.debug("File size: %s KB" % file_size_kb)
file_name = "kraken.tmp"
logging.info(
"Creating %s file, %s KB size, in pod %s at %s (ns %s)"
% (str(file_name), str(file_size), str(pod_name), str(mount_path), str(namespace))
% (
str(file_name),
str(file_size_kb),
str(pod_name),
str(mount_path),
str(namespace)
)
)
start_time = int(time.time())
# Create temp file in the PVC
full_path = "%s/%s" % (str(mount_path), str(file_name))
command = "dd bs=1024 count=%s </dev/urandom >%s" % (str(file_size), str(full_path))
logging.debug("Create temp file in the PVC command:\n %s" % command)
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
logging.info("\n" + str(response))
command = "fallocate -l $((%s*1024)) %s" % (
str(file_size_kb),
str(full_path)
)
logging.debug(
"Create temp file in the PVC command:\n %s" % command
)
kubecli.exec_cmd_in_pod(
command, pod_name, namespace, container_name, "sh"
)
# Check if file is created
command = "ls %s" % (str(mount_path))
command = "ls -lh %s" % (str(mount_path))
logging.debug("Check file is created command:\n %s" % command)
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
response = kubecli.exec_cmd_in_pod(
command, pod_name, namespace, container_name, "sh"
)
logging.info("\n" + str(response))
if str(file_name).lower() in str(response).lower():
logging.info("%s file successfully created" % (str(full_path)))
logging.info(
"%s file successfully created" % (str(full_path))
)
else:
logging.error("Failed to create tmp file with %s size" % (str(file_size)))
remove_temp_file(file_name, full_path, pod_name, namespace, container_name, mount_path, file_size)
logging.error(
"Failed to create tmp file with %s size" % (
str(file_size_kb)
)
)
remove_temp_file(
file_name,
full_path,
pod_name,
namespace,
container_name,
mount_path,
file_size_kb
)
sys.exit(1)
# Wait for the specified duration
logging.info("Waiting for the specified duration in the config: %ss" % (duration))
logging.info(
"Waiting for the specified duration in the config: %ss" % (
duration
)
)
time.sleep(duration)
logging.info("Finish waiting")
remove_temp_file(file_name, full_path, pod_name, namespace, container_name, mount_path, file_size)
remove_temp_file(
file_name,
full_path,
pod_name,
namespace,
container_name,
mount_path,
file_size_kb
)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
def remove_temp_file(file_name, full_path, pod_name, namespace, container_name, mount_path, file_size):
command = "rm %s" % (str(full_path))
def remove_temp_file(
file_name,
full_path,
pod_name,
namespace,
container_name,
mount_path,
file_size_kb
):
command = "rm -f %s" % (str(full_path))
logging.debug("Remove temp file from the PVC command:\n %s" % command)
kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
command = "ls %s" % (str(mount_path))
command = "ls -lh %s" % (str(mount_path))
logging.debug("Check temp file is removed command:\n %s" % command)
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
response = kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name,
"sh"
)
logging.info("\n" + str(response))
if not (str(file_name).lower() in str(response).lower()):
logging.info("Temp file successfully removed")
else:
logging.error("Failed to delete tmp file with %s size" % (str(file_size)))
logging.error(
"Failed to delete tmp file with %s size" % (str(file_size_kb))
)
sys.exit(1)
def toKbytes(value):
if not re.match("^[0-9]+[K|M|G|T]i$", value):
logging.error("PVC capacity %s does not match expression regexp '^[0-9]+[K|M|G|T]i$'")
logging.error(
"PVC capacity %s does not match expression "
"regexp '^[0-9]+[K|M|G|T]i$'"
)
sys.exit(1)
unit = {"K": 0, "M": 1, "G": 2, "T": 3}
base = 1024 if ("i" in value) else 1000

View File

@@ -1,16 +0,0 @@
from typing import List, Dict
from kraken.scenarios.base import Scenario
from kraken.scenarios.runner import ScenarioRunnerConfig
class Loader:
def __init__(self, scenarios: List[Scenario]):
self.scenarios = scenarios
def load(self, data: Dict) -> ScenarioRunnerConfig:
"""
This function loads data from a dictionary and produces a scenario runner config. It uses the scenarios provided
when instantiating the loader.
"""

View File

@@ -1,28 +0,0 @@
from dataclasses import dataclass
from typing import List
from kraken.scenarios import base
from kraken.scenarios.health import HealthChecker
@dataclass
class ScenarioRunnerConfig:
iterations: int
steps: List[base.ScenarioConfig]
class ScenarioRunner:
"""
This class provides the services to load a scenario configuration and iterate over the scenarios, while
observing the health checks.
"""
def __init__(self, scenarios: List[base.Scenario], health_checker: HealthChecker):
self._scenarios = scenarios
self._health_checker = health_checker
def run(self, config: ScenarioRunnerConfig):
"""
This function runs a list of scenarios described in the configuration.
"""

View File

@@ -1,61 +0,0 @@
from typing import TypeVar, Generic, Dict
from kraken.scenarios.kube import Client
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class ScenarioConfig(ABC):
"""
ScenarioConfig is a generic base class for configurations for individual scenarios. Each scenario should define
its own configuration classes.
"""
@abstractmethod
def from_dict(self, data: Dict) -> None:
"""
from_dict loads the configuration from a dict. It is mainly used to load JSON data into the scenario
configuration.
"""
@abstractmethod
def validate(self) -> None:
"""
validate is a function that validates all data on the scenario configuration. If the scenario configuration
is invalid an Exception should be thrown.
"""
pass
T = TypeVar('T', bound=ScenarioConfig)
class Scenario(Generic[T]):
"""
Scenario is a generic base class that provides a uniform run function to call in a loop. Scenario implementations
should extend this class and accept their configuration via their initializer.
"""
@staticmethod
def create_config(self) -> T:
"""
create_config creates a new copy of the configuration structure that allows loading data from a dictionary
and validating it.
"""
pass
def run(self, kube: Client, config: T) -> None:
"""
run is a function that is called when the scenario should be run. A Kubernetes client implementation will be
passed. The scenario should execute and return immediately. If the scenario fails, an Exception should be
thrown.
"""
pass
class TimeoutException(Exception):
"""
TimeoutException is an exception thrown when a scenario has a timeout waiting for a condition to happen.
"""
pass

View File

@@ -1,96 +0,0 @@
import logging
import random
import re
import time
from dataclasses import dataclass
from typing import Dict, List
from kraken.scenarios import base
from kraken.scenarios.base import ScenarioConfig, Scenario
from kraken.scenarios.kube import Client, Pod, NotFoundException
@dataclass
class PodScenarioConfig(ScenarioConfig):
"""
PodScenarioConfig is a configuration structure specific to pod scenarios. It describes which pod from which
namespace(s) to select for killing and how many pods to kill.
"""
name_pattern: str
namespace_pattern: str
label_selector: str
kill: int
def from_dict(self, data: Dict) -> None:
self.name_pattern = data.get("name_pattern")
self.namespace_pattern = data.get("namespace_pattern")
self.label_selector = data.get("label_selector")
self.kill = data.get("kill")
def validate(self) -> None:
re.compile(self.name_pattern)
re.compile(self.namespace_pattern)
if self.kill < 1:
raise Exception("Invalid value for 'kill': %d" % self.kill)
def namespace_regexp(self) -> re.Pattern:
return re.compile(self.namespace_pattern)
def name_regexp(self) -> re.Pattern:
return re.compile(self.name_pattern)
class PodScenario(Scenario[PodScenarioConfig]):
"""
PodScenario is a scenario that tests the stability of a Kubernetes cluster by killing one or more pods based on the
PodScenarioConfig.
"""
def __init__(self, logger: logging.Logger):
self.logger = logger
def create_config(self) -> PodScenarioConfig:
return PodScenarioConfig(
name_pattern=".*",
namespace_pattern=".*",
label_selector="",
kill=1,
)
def run(self, kube: Client, config: PodScenarioConfig):
pod_candidates: List[Pod] = []
namespace_re = config.namespace_regexp()
name_re = config.name_regexp()
self.logger.info("Listing all pods to determine viable pods to kill...")
for pod in kube.list_all_pods(label_selector=config.label_selector):
if namespace_re.match(pod.namespace) and name_re.match(pod.name):
pod_candidates.append(pod)
random.shuffle(pod_candidates)
removed_pod: List[Pod] = []
pods_to_kill = min(config.kill, len(pod_candidates))
self.logger.info("Killing %d pods...", pods_to_kill)
for i in range(pods_to_kill):
pod = pod_candidates[i]
self.logger.info("Killing pod %s...", pod.name)
removed_pod.append(pod)
kube.remove_pod(pod.name, pod.namespace)
self.logger.info("Waiting for pods to be removed...")
for i in range(60):
time.sleep(1)
for pod in removed_pod:
try:
kube.get_pod(pod.name, pod.namespace)
self.logger.info("Pod %s still exists...", pod.name)
except NotFoundException:
self.logger.info("Pod %s is now removed.", pod.name)
removed_pod.remove(pod)
if len(removed_pod) == 0:
self.logger.info("All pods removed, pod scenario complete.")
return
self.logger.warning("Timeout waiting for pods to be removed.")
raise base.TimeoutException("Timeout while waiting for pods to be removed.")

View File

@@ -1,43 +0,0 @@
import logging
import sys
import unittest
from kraken.scenarios import kube
from kraken.scenarios.kube import Client, NotFoundException
from kraken.scenarios.pod import PodScenario
class TestPodScenario(unittest.TestCase):
def test_run(self):
"""
This test creates a test pod and then runs the pod scenario restricting the run to that specific pod.
"""
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
c = Client()
test_pod = c.create_test_pod()
self.addCleanup(lambda: self._remove_test_pod(c, test_pod.name, test_pod.namespace))
scenario = PodScenario(logging.getLogger(__name__))
config = scenario.create_config()
config.kill = 1
config.name_pattern = test_pod.name
config.namespace_pattern = test_pod.namespace
scenario.run(c, config)
try:
c.get_pod(test_pod.name)
self.fail("Getting the pod after a pod scenario run should result in a NotFoundException.")
except NotFoundException:
return
@staticmethod
def _remove_test_pod(c: kube.Client, pod_name: str, pod_namespace: str):
try:
c.remove_pod(pod_name, pod_namespace)
except NotFoundException:
pass
if __name__ == '__main__':
unittest.main()

View File

@@ -5,13 +5,14 @@ import yaml
import logging
import time
from multiprocessing.pool import ThreadPool
import kraken.cerberus.setup as cerberus
import kraken.kubernetes.client as kubecli
import kraken.post_actions.actions as post_actions
from kraken.node_actions.aws_node_scenarios import AWS
from kraken.node_actions.openstack_node_scenarios import OPENSTACKCLOUD
from kraken.node_actions.az_node_scenarios import Azure
from kraken.node_actions.gcp_node_scenarios import GCP
from ..cerberus import setup as cerberus
from ..kubernetes import client as kubecli
from ..post_actions import actions as post_actions
from ..node_actions.aws_node_scenarios import AWS
from ..node_actions.openstack_node_scenarios import OPENSTACKCLOUD
from ..node_actions.az_node_scenarios import Azure
from ..node_actions.gcp_node_scenarios import GCP
def multiprocess_nodes(cloud_object_function, nodes):
@@ -53,7 +54,10 @@ def cluster_shut_down(shut_down_config):
elif cloud_type.lower() in ["azure", "az"]:
cloud_object = Azure()
else:
logging.error("Cloud type " + cloud_type + " is not currently supported for cluster shut down")
logging.error(
"Cloud type %s is not currently supported for cluster shut down" %
cloud_type
)
sys.exit(1)
nodes = kubecli.list_nodes()
@@ -70,17 +74,28 @@ def cluster_shut_down(shut_down_config):
while len(stopping_nodes) > 0:
for node in stopping_nodes:
if type(node) is tuple:
node_status = cloud_object.wait_until_stopped(node[1], node[0], timeout)
node_status = cloud_object.wait_until_stopped(
node[1],
node[0],
timeout
)
else:
node_status = cloud_object.wait_until_stopped(node, timeout)
node_status = cloud_object.wait_until_stopped(
node,
timeout
)
# Only want to remove node from stopping list when fully stopped/no error
# Only want to remove node from stopping list
# when fully stopped/no error
if node_status:
stopped_nodes.remove(node)
stopping_nodes = stopped_nodes.copy()
logging.info("Shutting down the cluster for the specified duration: %s" % (shut_down_duration))
logging.info(
"Shutting down the cluster for the specified duration: %s" %
(shut_down_duration)
)
time.sleep(shut_down_duration)
logging.info("Restarting the nodes")
restarted_nodes = set(node_id)
@@ -90,13 +105,22 @@ def cluster_shut_down(shut_down_config):
while len(not_running_nodes) > 0:
for node in not_running_nodes:
if type(node) is tuple:
node_status = cloud_object.wait_until_running(node[1], node[0], timeout)
node_status = cloud_object.wait_until_running(
node[1],
node[0],
timeout
)
else:
node_status = cloud_object.wait_until_running(node, timeout)
node_status = cloud_object.wait_until_running(
node,
timeout
)
if node_status:
restarted_nodes.remove(node)
not_running_nodes = restarted_nodes.copy()
logging.info("Waiting for 150s to allow cluster component initialization")
logging.info(
"Waiting for 150s to allow cluster component initialization"
)
time.sleep(150)
logging.info("Successfully injected cluster_shut_down scenario!")
@@ -111,13 +135,21 @@ def run(scenarios_list, config, wait_duration):
pre_action_output = ""
with open(shut_down_config[0], "r") as f:
shut_down_config_yaml = yaml.full_load(f)
shut_down_config_scenario = shut_down_config_yaml["cluster_shut_down_scenario"]
shut_down_config_scenario = \
shut_down_config_yaml["cluster_shut_down_scenario"]
start_time = int(time.time())
cluster_shut_down(shut_down_config_scenario)
logging.info("Waiting for the specified duration: %s" % (wait_duration))
logging.info(
"Waiting for the specified duration: %s" % (wait_duration)
)
time.sleep(wait_duration)
failed_post_scenarios = post_actions.check_recovery(
"", shut_down_config, failed_post_scenarios, pre_action_output
)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)

View File

@@ -1,23 +1,32 @@
import datetime
import time
import logging
import kraken.invoke.command as runcommand
import kraken.kubernetes.client as kubecli
import re
import sys
import kraken.cerberus.setup as cerberus
import yaml
import random
from ..cerberus import setup as cerberus
from ..kubernetes import client as kubecli
from ..invoke import command as runcommand
def pod_exec(pod_name, command, namespace, container_name):
i = 0
for i in range(5):
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name)
response = kubecli.exec_cmd_in_pod(
command,
pod_name,
namespace,
container_name
)
if not response:
time.sleep(2)
continue
elif "unauthorized" in response.lower() or "authorization" in response.lower():
elif (
"unauthorized" in response.lower() or
"authorization" in response.lower()
):
time.sleep(2)
continue
else:
@@ -26,7 +35,9 @@ def pod_exec(pod_name, command, namespace, container_name):
def node_debug(node_name, command):
response = runcommand.invoke("oc debug node/" + node_name + " -- chroot /host " + command)
response = runcommand.invoke(
"oc debug node/" + node_name + " -- chroot /host " + command
)
return response
@@ -37,9 +48,18 @@ def get_container_name(pod_name, namespace, container_name=""):
if container_name in container_names:
return container_name
else:
logging.error("Container name %s not an existing container in pod %s" % (container_name, pod_name))
logging.error(
"Container name %s not an existing container in pod %s" % (
container_name,
pod_name
)
)
else:
container_name = container_names[random.randint(0, len(container_names) - 1)]
container_name = container_names[
# random module here is not used for security/cryptographic
# purposes
random.randint(0, len(container_names) - 1) # nosec
]
return container_name
@@ -55,7 +75,10 @@ def skew_time(scenario):
node_names = []
if "object_name" in scenario.keys() and scenario["object_name"]:
node_names = scenario["object_name"]
elif "label_selector" in scenario.keys() and scenario["label_selector"]:
elif (
"label_selector" in scenario.keys() and
scenario["label_selector"]
):
node_names = kubecli.list_nodes(scenario["label_selector"])
for node in node_names:
@@ -75,44 +98,79 @@ def skew_time(scenario):
elif "namespace" in scenario.keys() and scenario["namespace"]:
if "label_selector" not in scenario.keys():
logging.info(
"label_selector key not found, querying for all the pods in namespace: %s" % (scenario["namespace"])
"label_selector key not found, querying for all the pods "
"in namespace: %s" % (scenario["namespace"])
)
pod_names = kubecli.list_pods(scenario["namespace"])
else:
logging.info(
"Querying for the pods matching the %s label_selector in namespace %s"
"Querying for the pods matching the %s label_selector "
"in namespace %s"
% (scenario["label_selector"], scenario["namespace"])
)
pod_names = kubecli.list_pods(scenario["namespace"], scenario["label_selector"])
pod_names = kubecli.list_pods(
scenario["namespace"],
scenario["label_selector"]
)
counter = 0
for pod_name in pod_names:
pod_names[counter] = [pod_name, scenario["namespace"]]
counter += 1
elif "label_selector" in scenario.keys() and scenario["label_selector"]:
elif (
"label_selector" in scenario.keys() and
scenario["label_selector"]
):
pod_names = kubecli.get_all_pods(scenario["label_selector"])
if len(pod_names) == 0:
logging.info("Cannot find pods matching the namespace/label_selector, please check")
logging.info(
"Cannot find pods matching the namespace/label_selector, "
"please check"
)
sys.exit(1)
pod_counter = 0
for pod in pod_names:
if len(pod) > 1:
selected_container_name = get_container_name(pod[0], pod[1], container_name)
pod_exec_response = pod_exec(pod[0], skew_command, pod[1], selected_container_name)
selected_container_name = get_container_name(
pod[0],
pod[1],
container_name
)
pod_exec_response = pod_exec(
pod[0],
skew_command,
pod[1],
selected_container_name
)
if pod_exec_response is False:
logging.error(
"Couldn't reset time on container %s in pod %s in namespace %s"
"Couldn't reset time on container %s "
"in pod %s in namespace %s"
% (selected_container_name, pod[0], pod[1])
)
sys.exit(1)
pod_names[pod_counter].append(selected_container_name)
else:
selected_container_name = get_container_name(pod, scenario["namespace"], container_name)
pod_exec_response = pod_exec(pod, skew_command, scenario["namespace"], selected_container_name)
selected_container_name = get_container_name(
pod,
scenario["namespace"],
container_name
)
pod_exec_response = pod_exec(
pod,
skew_command,
scenario["namespace"],
selected_container_name
)
if pod_exec_response is False:
logging.error(
"Couldn't reset time on container %s in pod %s in namespace %s"
% (selected_container_name, pod, scenario["namespace"])
"Couldn't reset time on container "
"%s in pod %s in namespace %s"
% (
selected_container_name,
pod,
scenario["namespace"]
)
)
sys.exit(1)
pod_names[pod_counter].append(selected_container_name)
@@ -128,8 +186,9 @@ def parse_string_date(obj_datetime):
obj_datetime = re.sub(r"\s\s+", " ", obj_datetime).strip()
logging.info("Obj_date sub time " + str(obj_datetime))
date_line = re.match(
r"[\s\S\n]*\w{3} \w{3} \d{1,} \d{2}:\d{2}:\d{2} \w{3} \d{4}[\s\S\n]*", obj_datetime
) # noqa
r"[\s\S\n]*\w{3} \w{3} \d{1,} \d{2}:\d{2}:\d{2} \w{3} \d{4}[\s\S\n]*", # noqa
obj_datetime
)
if date_line is not None:
search_response = date_line.group().strip()
logging.info("Search response: " + str(search_response))
@@ -137,7 +196,9 @@ def parse_string_date(obj_datetime):
else:
return ""
except Exception as e:
logging.info("Exception %s when trying to parse string to date" % str(e))
logging.info(
"Exception %s when trying to parse string to date" % str(e)
)
return ""
@@ -145,7 +206,10 @@ def parse_string_date(obj_datetime):
def string_to_date(obj_datetime):
obj_datetime = parse_string_date(obj_datetime)
try:
date_time_obj = datetime.datetime.strptime(obj_datetime, "%a %b %d %H:%M:%S %Z %Y")
date_time_obj = datetime.datetime.strptime(
obj_datetime,
"%a %b %d %H:%M:%S %Z %Y"
)
return date_time_obj
except Exception:
logging.info("Couldn't parse string to datetime object")
@@ -162,36 +226,66 @@ def check_date_time(object_type, names):
node_datetime_string = node_debug(node_name, skew_command)
node_datetime = string_to_date(node_datetime_string)
counter = 0
while not first_date_time < node_datetime < datetime.datetime.utcnow():
while not (
first_date_time < node_datetime < datetime.datetime.utcnow()
):
time.sleep(10)
logging.info("Date/time on node %s still not reset, waiting 10 seconds and retrying" % node_name)
logging.info(
"Date/time on node %s still not reset, "
"waiting 10 seconds and retrying" % node_name
)
node_datetime_string = node_debug(node_name, skew_command)
node_datetime = string_to_date(node_datetime_string)
counter += 1
if counter > max_retries:
logging.error("Date and time in node %s didn't reset properly" % node_name)
logging.error(
"Date and time in node %s didn't reset properly" %
node_name
)
not_reset.append(node_name)
break
if counter < max_retries:
logging.info("Date in node " + str(node_name) + " reset properly")
logging.info(
"Date in node " + str(node_name) + " reset properly"
)
elif object_type == "pod":
for pod_name in names:
first_date_time = datetime.datetime.utcnow()
counter = 0
pod_datetime_string = pod_exec(pod_name[0], skew_command, pod_name[1], pod_name[2])
pod_datetime_string = pod_exec(
pod_name[0],
skew_command,
pod_name[1],
pod_name[2]
)
pod_datetime = string_to_date(pod_datetime_string)
while not first_date_time < pod_datetime < datetime.datetime.utcnow():
while not (
first_date_time < pod_datetime < datetime.datetime.utcnow()
):
time.sleep(10)
logging.info("Date/time on pod %s still not reset, waiting 10 seconds and retrying" % pod_name[0])
pod_datetime = pod_exec(pod_name[0], skew_command, pod_name[1], pod_name[2])
logging.info(
"Date/time on pod %s still not reset, "
"waiting 10 seconds and retrying" % pod_name[0]
)
pod_datetime = pod_exec(
pod_name[0],
skew_command,
pod_name[1],
pod_name[2]
)
pod_datetime = string_to_date(pod_datetime)
counter += 1
if counter > max_retries:
logging.error("Date and time in pod %s didn't reset properly" % pod_name[0])
logging.error(
"Date and time in pod %s didn't reset properly" %
pod_name[0]
)
not_reset.append(pod_name[0])
break
if counter < max_retries:
logging.info("Date in pod " + str(pod_name[0]) + " reset properly")
logging.info(
"Date in pod " + str(pod_name[0]) + " reset properly"
)
return not_reset
@@ -205,7 +299,14 @@ def run(scenarios_list, config, wait_duration):
not_reset = check_date_time(object_type, object_names)
if len(not_reset) > 0:
logging.info("Object times were not reset")
logging.info("Waiting for the specified duration: %s" % (wait_duration))
logging.info(
"Waiting for the specified duration: %s" % (wait_duration)
)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(config, not_reset, start_time, end_time)
cerberus.publish_kraken_status(
config,
not_reset,
start_time,
end_time
)

View File

@@ -2,12 +2,15 @@ import yaml
import sys
import logging
import time
from kraken.node_actions.aws_node_scenarios import AWS
import kraken.cerberus.setup as cerberus
from ..node_actions.aws_node_scenarios import AWS
from ..cerberus import setup as cerberus
# filters the subnet of interest and applies the network acl to create zone outage
def run(scenarios_list, config, wait_duration):
"""
filters the subnet of interest and applies the network acl
to create zone outage
"""
failed_post_scenarios = ""
for zone_outage_config in scenarios_list:
if len(zone_outage_config) > 1:
@@ -24,7 +27,11 @@ def run(scenarios_list, config, wait_duration):
if cloud_type.lower() == "aws":
cloud_object = AWS()
else:
logging.error("Cloud type " + cloud_type + " is not currently supported for zone outage scenarios")
logging.error(
"Cloud type %s is not currently supported for "
"zone outage scenarios"
% cloud_type
)
sys.exit(1)
start_time = int(time.time())
@@ -32,39 +39,62 @@ def run(scenarios_list, config, wait_duration):
for subnet_id in subnet_ids:
logging.info("Targeting subnet_id")
network_association_ids = []
associations, original_acl_id = cloud_object.describe_network_acls(vpc_id, subnet_id)
associations, original_acl_id = \
cloud_object.describe_network_acls(vpc_id, subnet_id)
for entry in associations:
if entry["SubnetId"] == subnet_id:
network_association_ids.append(entry["NetworkAclAssociationId"])
network_association_ids.append(
entry["NetworkAclAssociationId"]
)
logging.info(
"Network association ids associated with the subnet %s: %s"
"Network association ids associated with "
"the subnet %s: %s"
% (subnet_id, network_association_ids)
)
acl_id = cloud_object.create_default_network_acl(vpc_id)
new_association_id = cloud_object.replace_network_acl_association(
network_association_ids[0], acl_id
)
new_association_id = \
cloud_object.replace_network_acl_association(
network_association_ids[0], acl_id
)
# capture the orginal_acl_id, created_acl_id and new association_id to use during the recovery
# capture the orginal_acl_id, created_acl_id and
# new association_id to use during the recovery
ids[new_association_id] = original_acl_id
acl_ids_created.append(acl_id)
# wait for the specified duration
logging.info("Waiting for the specified duration in the config: %s" % (duration))
logging.info(
"Waiting for the specified duration "
"in the config: %s" % (duration)
)
time.sleep(duration)
# replace the applied acl with the previous acl in use
for new_association_id, original_acl_id in ids.items():
cloud_object.replace_network_acl_association(new_association_id, original_acl_id)
logging.info("Wating for 60 seconds to make sure the changes are in place")
cloud_object.replace_network_acl_association(
new_association_id,
original_acl_id
)
logging.info(
"Wating for 60 seconds to make sure "
"the changes are in place"
)
time.sleep(60)
# delete the network acl created for the run
for acl_id in acl_ids_created:
cloud_object.delete_network_acl(acl_id)
logging.info("End of scenario. Waiting for the specified duration: %s" % (wait_duration))
logging.info(
"End of scenario. "
"Waiting for the specified duration: %s" % (wait_duration)
)
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)

View File

@@ -12,7 +12,7 @@ oauth2client>=4.1.3
python-openstackclient
gitpython
paramiko
setuptools
setuptools==63.4.1
openshift-client
python-ipmi
podman-compose
@@ -22,4 +22,5 @@ itsdangerous==2.0.1
werkzeug==2.0.3
aliyun-python-sdk-core-v3
aliyun-python-sdk-ecs
cryptography==36.0.2 # Remove once https://github.com/paramiko/paramiko/issues/2038 gets fixed.
arcaflow-plugin-sdk==0.3.0
git+https://github.com/vmware/vsphere-automation-sdk-python.git

View File

@@ -2,8 +2,6 @@
import os
import sys
from typing import List
import yaml
import logging
import optparse
@@ -24,14 +22,13 @@ import kraken.application_outage.actions as application_outage
import kraken.pvc.pvc_scenario as pvc_scenario
import kraken.network_chaos.actions as network_chaos
import server as server
from kraken.scenarios.base import Scenario
from kraken.scenarios.pod import PodScenario
from kraken.scenarios.runner import ScenarioRunner
from kraken import plugins
def publish_kraken_status(status):
with open("/tmp/kraken_status", "w+") as file:
file.write(str(status))
KUBE_BURNER_URL = (
"https://github.com/cloud-bulldozer/kube-burner/"
"releases/download/v{version}/kube-burner-{version}-Linux-x86_64.tar.gz"
)
KUBE_BURNER_VERSION = "0.9.1"
# Main function
@@ -48,35 +45,60 @@ def main(cfg):
distribution = config["kraken"].get("distribution", "openshift")
kubeconfig_path = config["kraken"].get("kubeconfig_path", "")
chaos_scenarios = config["kraken"].get("chaos_scenarios", [])
publish_running_status = config["kraken"].get("publish_kraken_status", False)
publish_running_status = config["kraken"].get(
"publish_kraken_status", False
)
port = config["kraken"].get("port", "8081")
run_signal = config["kraken"].get("signal_state", "RUN")
litmus_install = config["kraken"].get("litmus_install", True)
litmus_version = config["kraken"].get("litmus_version", "v1.9.1")
litmus_uninstall = config["kraken"].get("litmus_uninstall", False)
litmus_uninstall_before_run = config["kraken"].get("litmus_uninstall_before_run", True)
litmus_uninstall_before_run = config["kraken"].get(
"litmus_uninstall_before_run", True
)
wait_duration = config["tunings"].get("wait_duration", 60)
iterations = config["tunings"].get("iterations", 1)
daemon_mode = config["tunings"].get("daemon_mode", False)
deploy_performance_dashboards = config["performance_monitoring"].get("deploy_dashboards", False)
deploy_performance_dashboards = config["performance_monitoring"].get(
"deploy_dashboards", False
)
dashboard_repo = config["performance_monitoring"].get(
"repo", "https://github.com/cloud-bulldozer/performance-dashboards.git"
) # noqa
capture_metrics = config["performance_monitoring"].get("capture_metrics", False)
"repo",
"https://github.com/cloud-bulldozer/performance-dashboards.git"
)
capture_metrics = config["performance_monitoring"].get(
"capture_metrics", False
)
kube_burner_url = config["performance_monitoring"].get(
"kube_burner_binary_url",
"https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz", # noqa
KUBE_BURNER_URL.format(version=KUBE_BURNER_VERSION)
)
config_path = config["performance_monitoring"].get(
"config_path", "config/kube_burner.yaml"
)
metrics_profile = config["performance_monitoring"].get(
"metrics_profile_path", "config/metrics-aggregated.yaml"
)
prometheus_url = config["performance_monitoring"].get(
"prometheus_url", ""
)
prometheus_bearer_token = config["performance_monitoring"].get(
"prometheus_bearer_token", ""
)
config_path = config["performance_monitoring"].get("config_path", "config/kube_burner.yaml")
metrics_profile = config["performance_monitoring"].get("metrics_profile_path", "config/metrics-aggregated.yaml")
prometheus_url = config["performance_monitoring"].get("prometheus_url", "")
prometheus_bearer_token = config["performance_monitoring"].get("prometheus_bearer_token", "")
run_uuid = config["performance_monitoring"].get("uuid", "")
enable_alerts = config["performance_monitoring"].get("enable_alerts", False)
alert_profile = config["performance_monitoring"].get("alert_profile", "")
enable_alerts = config["performance_monitoring"].get(
"enable_alerts", False
)
alert_profile = config["performance_monitoring"].get(
"alert_profile", ""
)
# Initialize clients
if not os.path.isfile(kubeconfig_path):
logging.error("Cannot read the kubeconfig file at %s, please check" % kubeconfig_path)
logging.error(
"Cannot read the kubeconfig file at %s, please check" %
kubeconfig_path
)
sys.exit(1)
logging.info("Initializing client to talk to the Kubernetes cluster")
os.environ["KUBECONFIG"] = str(kubeconfig_path)
@@ -87,17 +109,25 @@ def main(cfg):
# Set up kraken url to track signal
if not 0 <= int(port) <= 65535:
logging.info("Using port 8081 as %s isn't a valid port number" % (port))
logging.info(
"Using port 8081 as %s isn't a valid port number" % (port)
)
port = 8081
address = ("0.0.0.0", port)
# If publish_running_status is False this should keep us going in our loop below
# If publish_running_status is False this should keep us going
# in our loop below
if publish_running_status:
server_address = address[0]
port = address[1]
logging.info(
"Publishing kraken status at http://%s:%s" % (
server_address,
port
)
)
logging.info("Publishing kraken status at http://%s:%s" % (server_address, port))
server.start_server(address)
publish_kraken_status(run_signal)
server.start_server(address, run_signal)
# Cluster info
logging.info("Fetching cluster info")
@@ -115,35 +145,36 @@ def main(cfg):
# Generate uuid for the run
if run_uuid:
logging.info("Using the uuid defined by the user for the run: %s" % run_uuid)
logging.info(
"Using the uuid defined by the user for the run: %s" % run_uuid
)
else:
run_uuid = str(uuid.uuid4())
logging.info("Generated a uuid for the run: %s" % run_uuid)
logger = logging.getLogger(__name__)
scenarios: List[Scenario] = [
PodScenario(logger),
]
health_checker = CerberusHealthChecker(config)
runner = ScenarioRunner(scenarios, health_checker)
# Initialize the start iteration to 0
iteration = 0
# Set the number of iterations to loop to infinity if daemon mode is
# enabled or else set it to the provided iterations count in the config
if daemon_mode:
logging.info("Daemon mode enabled, kraken will cause chaos forever\n")
logging.info(
"Daemon mode enabled, kraken will cause chaos forever\n"
)
logging.info("Ignoring the iterations set")
iterations = float("inf")
else:
logging.info("Daemon mode not enabled, will run through %s iterations\n" % str(iterations))
logging.info(
"Daemon mode not enabled, will run through %s iterations\n" %
str(iterations)
)
iterations = int(iterations)
failed_post_scenarios = []
litmus_installed = False
# Capture the start time
start_time = int(time.time())
litmus_installed = False
# Loop to run the chaos starts here
while int(iteration) < iterations and run_signal != "STOP":
@@ -156,7 +187,8 @@ def main(cfg):
if run_signal == "PAUSE":
while publish_running_status and run_signal == "PAUSE":
logging.info(
"Pausing Kraken run, waiting for %s seconds and will re-poll signal"
"Pausing Kraken run, waiting for %s seconds"
" and will re-poll signal"
% str(wait_duration)
)
time.sleep(wait_duration)
@@ -169,28 +201,53 @@ def main(cfg):
if scenarios_list:
# Inject pod chaos scenarios specified in the config
if scenario_type == "pod_scenarios":
logging.info("Running pod scenarios")
failed_post_scenarios = pod_scenarios.run(
kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_duration
logging.error(
"Pod scenarios have been removed, please use "
"plugin_scenarios with the "
"kill-pods configuration instead."
)
sys.exit(1)
elif scenario_type == "plugin_scenarios":
failed_post_scenarios = plugins.run(
scenarios_list,
kubeconfig_path,
failed_post_scenarios
)
elif scenario_type == "container_scenarios":
logging.info("Running container scenarios")
failed_post_scenarios = pod_scenarios.container_run(
kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_duration
)
failed_post_scenarios = \
pod_scenarios.container_run(
kubeconfig_path,
scenarios_list,
config,
failed_post_scenarios,
wait_duration
)
# Inject node chaos scenarios specified in the config
elif scenario_type == "node_scenarios":
logging.info("Running node scenarios")
nodeaction.run(scenarios_list, config, wait_duration)
nodeaction.run(
scenarios_list,
config,
wait_duration
)
# Inject time skew chaos scenarios specified in the config
# Inject time skew chaos scenarios specified
# in the config
elif scenario_type == "time_scenarios":
if distribution == "openshift":
logging.info("Running time skew scenarios")
time_actions.run(scenarios_list, config, wait_duration)
time_actions.run(
scenarios_list,
config,
wait_duration
)
else:
logging.error("Litmus scenarios are currently supported only on openshift")
logging.error(
"Litmus scenarios are currently "
"supported only on openshift"
)
sys.exit(1)
# Inject litmus based chaos scenarios
@@ -198,46 +255,79 @@ def main(cfg):
if distribution == "openshift":
logging.info("Running litmus scenarios")
litmus_namespace = "litmus"
if not litmus_installed:
# Remove Litmus resources before running the scenarios
common_litmus.delete_chaos(litmus_namespace)
common_litmus.delete_chaos_experiments(litmus_namespace)
if litmus_uninstall_before_run:
common_litmus.uninstall_litmus(litmus_version, litmus_namespace)
common_litmus.install_litmus(litmus_version, litmus_namespace)
common_litmus.deploy_all_experiments(litmus_version, litmus_namespace)
litmus_installed = True
common_litmus.run(
scenarios_list,
config,
litmus_uninstall,
wait_duration,
litmus_namespace,
if litmus_install:
# Remove Litmus resources
# before running the scenarios
common_litmus.delete_chaos(
litmus_namespace
)
common_litmus.delete_chaos_experiments(
litmus_namespace
)
if litmus_uninstall_before_run:
common_litmus.uninstall_litmus(
litmus_version,
litmus_namespace
)
common_litmus.install_litmus(
litmus_version,
litmus_namespace
)
common_litmus.deploy_all_experiments(
litmus_version,
litmus_namespace
)
litmus_installed = True
common_litmus.run(
scenarios_list,
config,
litmus_uninstall,
wait_duration,
litmus_namespace,
)
else:
logging.error("Litmus scenarios are currently only supported on openshift")
logging.error(
"Litmus scenarios are currently "
"only supported on openshift"
)
sys.exit(1)
# Inject cluster shutdown scenarios
elif scenario_type == "cluster_shut_down_scenarios":
shut_down.run(scenarios_list, config, wait_duration)
shut_down.run(
scenarios_list,
config,
wait_duration
)
# Inject namespace chaos scenarios
elif scenario_type == "namespace_scenarios":
logging.info("Running namespace scenarios")
namespace_actions.run(
scenarios_list, config, wait_duration, failed_post_scenarios, kubeconfig_path
scenarios_list,
config,
wait_duration,
failed_post_scenarios,
kubeconfig_path
)
# Inject zone failures
elif scenario_type == "zone_outages":
logging.info("Inject zone outages")
zone_outages.run(scenarios_list, config, wait_duration)
zone_outages.run(
scenarios_list,
config,
wait_duration
)
# Application outages
elif scenario_type == "application_outages":
logging.info("Injecting application outage")
application_outage.run(scenarios_list, config, wait_duration)
application_outage.run(
scenarios_list,
config,
wait_duration
)
# PVC scenarios
elif scenario_type == "pvc_scenarios":
@@ -247,7 +337,11 @@ def main(cfg):
# Network scenarios
elif scenario_type == "network_chaos":
logging.info("Running Network Chaos")
network_chaos.run(scenarios_list, config, wait_duration)
network_chaos.run(
scenarios_list,
config,
wait_duration
)
iteration += 1
logging.info("")
@@ -293,12 +387,15 @@ def main(cfg):
common_litmus.uninstall_litmus(litmus_version, litmus_namespace)
if failed_post_scenarios:
logging.error("Post scenarios are still failing at the end of all iterations")
logging.error(
"Post scenarios are still failing at the end of all iterations"
)
sys.exit(1)
run_dir = os.getcwd() + "/kraken.report"
logging.info(
"Successfully finished running Kraken. UUID for the run: %s. Report generated at %s. Exiting"
"Successfully finished running Kraken. UUID for the run: "
"%s. Report generated at %s. Exiting"
% (run_uuid, run_dir)
)
else:
@@ -320,7 +417,10 @@ if __name__ == "__main__":
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[logging.FileHandler("kraken.report", mode="w"), logging.StreamHandler()],
handlers=[
logging.FileHandler("kraken.report", mode="w"),
logging.StreamHandler()
],
)
if options.cfg is None:
logging.error("Please check if you have passed the config")

View File

@@ -1,89 +0,0 @@
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"$id": "https://github.com/chaos-kubox/krkn/",
"type": "object",
"default": {},
"title": "Composite scenario for Krkn",
"required": [
"steps"
],
"properties": {
"iterations": {
"type": "integer",
"default": 1,
"title": "How many iterations to execute",
"examples": [
3
]
},
"steps": {
"type": "array",
"default": [],
"title": "The steps Schema",
"items": {
"type": "object",
"default": {},
"title": "A Schema",
"required": [
"pod"
],
"properties": {
"pod": {
"type": "object",
"default": {},
"title": "The pod Schema",
"required": [
"name_pattern",
"namespace_pattern"
],
"properties": {
"name_pattern": {
"type": "string",
"default": "",
"title": "The name_pattern Schema",
"examples": [
""
]
},
"namespace_pattern": {
"type": "string",
"default": "",
"title": "The namespace_pattern Schema",
"examples": [
""
]
}
},
"examples": [{
"name_pattern": "test-.*",
"namespace_pattern": "default"
}]
}
},
"examples": [{
"pod": {
"name_pattern": "test-.*",
"namespace_pattern": "default"
}
}]
},
"examples": [
[{
"pod": {
"name_pattern": "test-.*",
"namespace_pattern": "default"
}
}]
]
}
},
"examples": [{
"iterations": 1,
"steps": [{
"pod": {
"name_pattern": "test-.*",
"namespace_pattern": "default"
}
}]
}]
}

View File

@@ -1,5 +0,0 @@
iterations: 1
steps:
- pod:
name_pattern:
namespace_pattern:

6
scenarios/kube/pod.yml Normal file
View File

@@ -0,0 +1,6 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
name_pattern: ^nginx-.*$
namespace_pattern: ^default$
kill: 1

View File

@@ -1,32 +1,10 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete scheduler pods"
steps:
- podAction:
matches:
- labels:
namespace: "kube-system"
selector: "k8s-app=kube-scheduler"
filters:
- randomSample:
size: 1
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "kube-system"
selector: "k8s-app=kube-scheduler"
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 3
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^kube-system$
label_selector: k8s-app=kube-scheduler
- id: wait-for-pods
config:
namespace_pattern: ^kube-system$
label_selector: k8s-app=kube-scheduler
count: 3

View File

@@ -1,32 +1,10 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete acme-air pods"
steps:
- podAction:
matches:
- labels:
namespace: "acme-air"
selector: ""
filters:
- randomSample:
size: 1
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "acme-air"
selector: ""
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 8
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^acme-air$
name_pattern: .*
- id: wait-for-pods
config:
namespace_pattern: ^acme-air$
name_pattern: .*
count: 8

View File

@@ -1,32 +1,10 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete etcd pods"
steps:
- podAction:
matches:
- labels:
namespace: "openshift-etcd"
selector: "k8s-app=etcd"
filters:
- randomSample:
size: 1
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "openshift-etcd"
selector: "k8s-app=etcd"
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 3
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^openshift-etcd$
label_selector: k8s-app=etcd
- id: wait-for-pods
config:
namespace_pattern: ^openshift-etcd$
label_selector: k8s-app=etcd
count: 3

View File

@@ -0,0 +1,17 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: network_chaos
config:
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
<node_name_1>:
- <interface-1>
label_selector: <label_selector> # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
instance_count: <number> # Number of nodes to perform action/select that match the label selector
kubeconfig_path: <path> # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
execution_type: <serial/parallel> # Used to specify whether you want to apply filters on interfaces one at a time or all at once. Default is 'parallel'
network_params: # latency, loss and bandwidth are the three supported network parameters to alter for the chaos test
latency: <time> # Value is a string. For example : 50ms
loss: <fraction> # Loss is a fraction between 0 and 1. It has to be enclosed in quotes to treat it as a string. For example, '0.02' (not 0.02)
bandwidth: <rate> # Value is a string. For example: 100mbit
wait_duration: <time_duration> # Default is 300. Ensure that it is at least about twice of test_duration
test_duration: <time_duration> # Default is 120
kraken_config: <path> # Specify this if you want to use Cerberus config

View File

@@ -1,35 +1,10 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete openshift-apiserver pods"
steps:
- podAction:
matches:
- labels:
namespace: "openshift-apiserver"
selector: "app=openshift-apiserver-a"
filters:
- randomSample:
size: 1
# The actions will be executed in the order specified
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "openshift-apiserver"
selector: "app=openshift-apiserver-a"
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 3
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^openshift-apiserver$
label_selector: app=openshift-apiserver-a
- id: wait-for-pods
config:
namespace_pattern: ^openshift-apiserver$
label_selector: app=openshift-apiserver-a
count: 3

View File

@@ -1,3 +1,7 @@
# yaml-language-server: $schema=../pod.schema.json
namespace_pattern: openshift-kube-apiserver
kill: 1
config:
runStrategy:
runs: 1

View File

@@ -1,21 +1,10 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 10
minSecondsBetweenRuns: 1
scenarios:
- name: "check 2 pods are in namespace with selector: prometheus"
steps:
- podAction:
matches:
- labels:
namespace: "openshift-monitoring"
selector: "app=prometheus"
filters:
- property:
name: "state"
value: "Running"
# The actions will be executed in the order specified
actions:
- checkPodCount:
count: 2
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^openshift-monitoring$
label_selector: app=prometheus
- id: wait-for-pods
config:
namespace_pattern: ^openshift-monitoring$
label_selector: app=prometheus
count: 2

View File

@@ -1,71 +1,90 @@
#!/usr/bin/env python3
import subprocess
import logging
import re
import subprocess
import sys
from kubernetes import client, config
from kubernetes.client.rest import ApiException
import logging
# List all namespaces
def list_namespaces():
namespaces = []
"""
List all namespaces
"""
spaces_list = []
try:
config.load_kube_config()
cli = client.CoreV1Api()
ret = cli.list_namespace(pretty=True)
except ApiException as e:
logging.error(
"Exception when calling \
CoreV1Api->list_namespaced_pod: %s\n"
% e
"Exception when calling CoreV1Api->list_namespace: %s\n",
e
)
for namespace in ret.items:
namespaces.append(namespace.metadata.name)
return namespaces
for current_namespace in ret.items:
spaces_list.append(current_namespace.metadata.name)
return spaces_list
# Check if all the watch_namespaces are valid
def check_namespaces(namespaces):
"""
Check if all the watch_namespaces are valid
"""
try:
valid_namespaces = list_namespaces()
regex_namespaces = set(namespaces) - set(valid_namespaces)
final_namespaces = set(namespaces) - set(regex_namespaces)
valid_regex = set()
if regex_namespaces:
for namespace in valid_namespaces:
for current_ns in valid_namespaces:
for regex_namespace in regex_namespaces:
if re.search(regex_namespace, namespace):
final_namespaces.add(namespace)
if re.search(regex_namespace, current_ns):
final_namespaces.add(current_ns)
valid_regex.add(regex_namespace)
break
invalid_namespaces = regex_namespaces - valid_regex
if invalid_namespaces:
raise Exception("There exists no namespaces matching: %s" % (invalid_namespaces))
raise Exception(
"There exists no namespaces matching: %s" % (
invalid_namespaces
)
)
return list(final_namespaces)
except Exception as e:
logging.error("%s" % (e))
logging.error(str(e))
sys.exit(1)
def run(cmd):
try:
output = subprocess.Popen(
cmd, shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
cmd,
shell=True,
universal_newlines=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)
(out, err) = output.communicate()
except Exception as e:
logging.error("Failed to run %s, error: %s" % (cmd, e))
logging.error("Failed to run %s, error: %s", cmd, e)
return out
regex_namespace = ["openshift-.*"]
namespaces = check_namespaces(regex_namespace)
pods_running = 0
for namespace in namespaces:
new_pods_running = run("oc get pods -n " + namespace + " | grep -c Running").rstrip()
try:
pods_running += int(new_pods_running)
except Exception:
continue
print(pods_running)
def print_running_pods():
regex_namespace_list = ["openshift-.*"]
checked_namespaces = check_namespaces(regex_namespace_list)
pods_running = 0
for namespace in checked_namespaces:
new_pods_running = run(
"oc get pods -n " + namespace + " | grep -c Running"
).rstrip()
try:
pods_running += int(new_pods_running)
except Exception:
continue
print(pods_running)
if __name__ == '__main__':
print_running_pods()

View File

@@ -1,35 +1,11 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete prometheus pods"
steps:
- podAction:
matches:
- labels:
namespace: "openshift-monitoring"
selector: "app=prometheus"
filters:
- randomSample:
size: 1
# The actions will be executed in the order specified
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "openshift-monitoring"
selector: "app=prometheus"
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 2
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^openshift-monitoring$
label_selector: app=prometheus
- id: wait-for-pods
config:
namespace_pattern: ^openshift-monitoring$
label_selector: app=prometheus
count: 2
timeout: 180

View File

@@ -1,20 +1,6 @@
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: kill up to 3 pods in any openshift namespace
steps:
- podAction:
matches:
- namespace: "openshift-.*"
filters:
- property:
name: "state"
value: "Running"
- randomSample:
size: 3
actions:
- kill:
probability: .7
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^openshift-.*$
name_pattern: .*
kill: 3

View File

@@ -0,0 +1,10 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: <node_stop_scenario/node_start_scenario/node_reboot_scenario/node_terminate_scenario>
config:
name: <node_name> # Node on which scenario has to be injected; can set multiple names separated by comma
label_selector: <label_selector> # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
runs: 1 # Number of times to inject each scenario under actions (will perform on same node each time)
instance_count: 1 # Number of nodes to perform action/select that match the label selector
timeout: 300 # Duration to wait for completion of node scenario injection
verify_session: True # Set to True if you want to verify the vSphere client session using certificates; else False
skip_openshift_checks: False # Set to True if you don't want to wait for the status of the nodes to change on OpenShift before passing the scenario

View File

@@ -0,0 +1,5 @@
This file is generated by running the "plugins" module in the kraken project:
```
python -m kraken.plugins >scenarios/plugin.schema.json
```

View File

@@ -0,0 +1,157 @@
{
"$id": "https://github.com/redhat-chaos/krkn/",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Kraken Arcaflow scenarios",
"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
"type": "array",
"minContains": 1,
"items": {
"oneOf": [
{
"type": "object",
"properties": {
"id": {
"type": "string",
"const": "kill-pods"
},
"config": {
"type": "object",
"properties": {
"namespace_pattern": {
"type": "string",
"format": "regex",
"title": "Namespace pattern",
"description": "Regular expression for target pod namespaces."
},
"name_pattern": {
"type": "string",
"format": "regex",
"title": "Name pattern",
"description": "Regular expression for target pods. Required if label_selector is not set."
},
"kill": {
"type": "integer",
"minimum": 1,
"title": "Number of pods to kill",
"description": "How many pods should we attempt to kill?"
},
"label_selector": {
"type": "string",
"minLength": 1,
"title": "Label selector",
"description": "Kubernetes label selector for the target pods. Required if name_pattern is not set.\nSee https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details."
},
"kubeconfig_path": {
"type": "string",
"title": "Kubeconfig path",
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
},
"timeout": {
"type": "integer",
"title": "Timeout",
"description": "Timeout to wait for the target pod(s) to be removed in seconds."
},
"backoff": {
"type": "integer",
"title": "Backoff",
"description": "How many seconds to wait between checks for the target pod status."
}
},
"additionalProperties": false,
"required": [
"namespace_pattern"
]
}
},
"required": [
"id",
"config"
]
},
{
"type": "object",
"properties": {
"id": {
"type": "string",
"const": "wait-for-pods"
},
"config": {
"type": "object",
"properties": {
"namespace_pattern": {
"type": "string",
"format": "regex",
"title": "namespace_pattern"
},
"name_pattern": {
"type": "string",
"format": "regex",
"title": "name_pattern"
},
"label_selector": {
"type": "string",
"minLength": 1,
"title": "label_selector"
},
"count": {
"type": "integer",
"minimum": 1,
"title": "Pod count",
"description": "Wait for at least this many pods to exist"
},
"timeout": {
"type": "integer",
"minimum": 1,
"title": "Timeout",
"description": "How many seconds to wait for?"
},
"backoff": {
"type": "integer",
"title": "Backoff",
"description": "How many seconds to wait between checks for the target pod status."
},
"kubeconfig_path": {
"type": "string",
"title": "kubeconfig_path"
}
},
"additionalProperties": false,
"required": [
"namespace_pattern"
]
}
},
"required": [
"id",
"config"
]
},
{
"type": "object",
"properties": {
"id": {
"type": "string",
"const": "run_python"
},
"config": {
"type": "object",
"properties": {
"filename": {
"type": "string",
"title": "filename"
}
},
"additionalProperties": false,
"required": [
"filename"
]
}
},
"required": [
"id",
"config"
]
}
]
}
}

View File

@@ -4,9 +4,13 @@ import _thread
from http.server import HTTPServer, BaseHTTPRequestHandler
from http.client import HTTPConnection
server_status = ""
# Start a simple http server to publish the cerberus status file content
class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
"""
A simple http server to publish the cerberus status file content
"""
requests_served = 0
def do_GET(self):
@@ -16,9 +20,8 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
def do_status(self):
self.send_response(200)
self.end_headers()
f = open("/tmp/kraken_status", "rb")
self.wfile.write(f.read())
SimpleHTTPRequestHandler.requests_served = SimpleHTTPRequestHandler.requests_served + 1
self.wfile.write(bytes(server_status, encoding='utf8'))
SimpleHTTPRequestHandler.requests_served += 1
def do_POST(self):
if self.path == "/STOP":
@@ -31,23 +34,26 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
def set_run(self):
self.send_response(200)
self.end_headers()
with open("/tmp/kraken_status", "w+") as file:
file.write(str("RUN"))
global server_status
server_status = 'RUN'
def set_stop(self):
self.send_response(200)
self.end_headers()
with open("/tmp/kraken_status", "w+") as file:
file.write(str("STOP"))
global server_status
server_status = 'STOP'
def set_pause(self):
self.send_response(200)
self.end_headers()
with open("/tmp/kraken_status", "w+") as file:
file.write(str("PAUSE"))
global server_status
server_status = 'PAUSE'
def publish_kraken_status(status):
global server_status
server_status = status
def start_server(address):
def start_server(address, status):
server = address[0]
port = address[1]
global httpd
@@ -55,7 +61,8 @@ def start_server(address):
logging.info("Starting http server at http://%s:%s\n" % (server, port))
try:
_thread.start_new_thread(httpd.serve_forever, ())
except Exception:
publish_kraken_status(status)
except Exception as e:
logging.error(
"Failed to start the http server \
at http://%s:%s"

View File

@@ -0,0 +1,61 @@
import unittest
import logging
from arcaflow_plugin_sdk import plugin
from kraken.plugins.network import ingress_shaping
class NetworkScenariosTest(unittest.TestCase):
def test_serialization(self):
plugin.test_object_serialization(
ingress_shaping.NetworkScenarioConfig(
node_interface_name={"foo": ['bar']},
network_params={
"latency": "50ms",
"loss": "0.02",
"bandwidth": "100mbit"
}
),
self.fail,
)
plugin.test_object_serialization(
ingress_shaping.NetworkScenarioSuccessOutput(
filter_direction="ingress",
test_interfaces={"foo": ['bar']},
network_parameters={
"latency": "50ms",
"loss": "0.02",
"bandwidth": "100mbit"
},
execution_type="parallel"),
self.fail,
)
plugin.test_object_serialization(
ingress_shaping.NetworkScenarioErrorOutput(
error="Hello World",
),
self.fail,
)
def test_network_chaos(self):
output_id, output_data = ingress_shaping.network_chaos(
ingress_shaping.NetworkScenarioConfig(
label_selector="node-role.kubernetes.io/master",
instance_count=1,
network_params={
"latency": "50ms",
"loss": "0.02",
"bandwidth": "100mbit"
}
)
)
if output_id == "error":
logging.error(output_data.error)
self.fail(
"The network chaos scenario did not complete successfully "
"because an error/exception occurred"
)
if __name__ == "__main__":
unittest.main()

175
tests/test_pod_plugin.py Normal file
View File

@@ -0,0 +1,175 @@
import random
import re
import string
import threading
import unittest
from arcaflow_plugin_sdk import plugin
from kubernetes.client import V1Pod, V1ObjectMeta, V1PodSpec, V1Container, ApiException
from kraken.plugins import pod_plugin
from kraken.plugins.pod_plugin import setup_kubernetes, KillPodConfig, PodKillSuccessOutput
from kubernetes import client
class KillPodTest(unittest.TestCase):
def test_serialization(self):
plugin.test_object_serialization(
pod_plugin.KillPodConfig(
namespace_pattern=re.compile(".*"),
name_pattern=re.compile(".*")
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodKillSuccessOutput(
pods={}
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodErrorOutput(
error="Hello world!"
),
self.fail,
)
def test_not_enough_pods(self):
name = ''.join(random.choices(string.ascii_lowercase, k=8))
output_id, output_data = pod_plugin.kill_pods(KillPodConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^unit-test-" + re.escape(name) + "$"),
))
if output_id != "error":
self.fail("Not enough pods did not result in an error.")
print(output_data.error)
def test_kill_pod(self):
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
pod = core_v1.create_namespaced_pod("default", V1Pod(
metadata=V1ObjectMeta(
generate_name="test-",
),
spec=V1PodSpec(
containers=[
V1Container(
name="test",
image="alpine",
tty=True,
)
]
),
))
def remove_test_pod():
try:
core_v1.delete_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
except ApiException as e:
if e.status != 404:
raise
self.addCleanup(remove_test_pod)
output_id, output_data = pod_plugin.kill_pods(KillPodConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^" + re.escape(pod.metadata.name) + "$"),
))
if output_id == "error":
self.fail(output_data.error)
self.assertIsInstance(output_data, PodKillSuccessOutput)
out: PodKillSuccessOutput = output_data
self.assertEqual(1, len(out.pods))
pod_list = list(out.pods.values())
self.assertEqual(pod.metadata.name, pod_list[0].name)
try:
core_v1.read_namespaced_pod(pod_list[0].name, pod_list[0].namespace)
self.fail("Killed pod is still present.")
except ApiException as e:
if e.status != 404:
self.fail("Incorrect API exception encountered: {}".format(e))
class WaitForPodTest(unittest.TestCase):
def test_serialization(self):
plugin.test_object_serialization(
pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile(".*"),
name_pattern=re.compile(".*")
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile(".*"),
label_selector="app=nginx"
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodWaitSuccessOutput(
pods=[]
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodErrorOutput(
error="Hello world!"
),
self.fail,
)
def test_timeout(self):
name = "watch-test-" + ''.join(random.choices(string.ascii_lowercase, k=8))
output_id, output_data = pod_plugin.wait_for_pods(pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^" + re.escape(name) + "$"),
timeout=1
))
self.assertEqual("error", output_id)
def test_watch(self):
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
name = "watch-test-" + ''.join(random.choices(string.ascii_lowercase, k=8))
def create_test_pod():
core_v1.create_namespaced_pod("default", V1Pod(
metadata=V1ObjectMeta(
name=name,
),
spec=V1PodSpec(
containers=[
V1Container(
name="test",
image="alpine",
tty=True,
)
]
),
))
def remove_test_pod():
try:
core_v1.delete_namespaced_pod(name, "default")
except ApiException as e:
if e.status != 404:
raise
self.addCleanup(remove_test_pod)
t = threading.Timer(10, create_test_pod)
t.start()
output_id, output_data = pod_plugin.wait_for_pods(pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^" + re.escape(name) + "$"),
timeout=60
))
self.assertEqual("success", output_id)
if __name__ == '__main__':
unittest.main()

View File

@@ -0,0 +1,28 @@
import tempfile
import unittest
from kraken.plugins import run_python_file
from kraken.plugins.run_python_plugin import RunPythonFileInput
class RunPythonPluginTest(unittest.TestCase):
def test_success_execution(self):
tmp_file = tempfile.NamedTemporaryFile()
tmp_file.write(bytes("print('Hello world!')", 'utf-8'))
tmp_file.flush()
output_id, output_data = run_python_file(RunPythonFileInput(tmp_file.name))
self.assertEqual("success", output_id)
self.assertEqual("Hello world!\n", output_data.stdout)
def test_error_execution(self):
tmp_file = tempfile.NamedTemporaryFile()
tmp_file.write(bytes("import sys\nprint('Hello world!')\nsys.exit(42)\n", 'utf-8'))
tmp_file.flush()
output_id, output_data = run_python_file(RunPythonFileInput(tmp_file.name))
self.assertEqual("error", output_id)
self.assertEqual(42, output_data.exit_code)
self.assertEqual("Hello world!\n", output_data.stdout)
if __name__ == '__main__':
unittest.main()

129
tests/test_vmware_plugin.py Normal file
View File

@@ -0,0 +1,129 @@
import unittest
import os
import logging
from arcaflow_plugin_sdk import plugin
from kraken.plugins.vmware.kubernetes_functions import Actions
from kraken.plugins.vmware import vmware_plugin
class NodeScenariosTest(unittest.TestCase):
def setUp(self):
vsphere_env_vars = [
"VSPHERE_IP",
"VSPHERE_USERNAME",
"VSPHERE_PASSWORD"
]
self.credentials_present = all(
env_var in os.environ for env_var in vsphere_env_vars
)
def test_serialization(self):
plugin.test_object_serialization(
vmware_plugin.NodeScenarioConfig(
name="test",
skip_openshift_checks=True
),
self.fail,
)
plugin.test_object_serialization(
vmware_plugin.NodeScenarioSuccessOutput(
nodes={}, action=Actions.START
),
self.fail,
)
plugin.test_object_serialization(
vmware_plugin.NodeScenarioErrorOutput(
error="Hello World", action=Actions.START
),
self.fail,
)
def test_node_start(self):
if not self.credentials_present:
self.skipTest(
"Check if the environmental variables 'VSPHERE_IP', "
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
)
vsphere = vmware_plugin.vSphere(verify=False)
vm_id, vm_name = vsphere.create_default_vm()
if vm_id is None:
self.fail("Could not create test VM")
output_id, output_data = vmware_plugin.node_start(
vmware_plugin.NodeScenarioConfig(
name=vm_name, skip_openshift_checks=True, verify_session=False
)
)
if output_id == "error":
logging.error(output_data.error)
self.fail("The VMware VM did not start because an error occurred")
vsphere.release_instances(vm_name)
def test_node_stop(self):
if not self.credentials_present:
self.skipTest(
"Check if the environmental variables 'VSPHERE_IP', "
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
)
vsphere = vmware_plugin.vSphere(verify=False)
vm_id, vm_name = vsphere.create_default_vm()
if vm_id is None:
self.fail("Could not create test VM")
vsphere.start_instances(vm_name)
output_id, output_data = vmware_plugin.node_stop(
vmware_plugin.NodeScenarioConfig(
name=vm_name, skip_openshift_checks=True, verify_session=False
)
)
if output_id == "error":
logging.error(output_data.error)
self.fail("The VMware VM did not stop because an error occurred")
vsphere.release_instances(vm_name)
def test_node_reboot(self):
if not self.credentials_present:
self.skipTest(
"Check if the environmental variables 'VSPHERE_IP', "
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
)
vsphere = vmware_plugin.vSphere(verify=False)
vm_id, vm_name = vsphere.create_default_vm()
if vm_id is None:
self.fail("Could not create test VM")
vsphere.start_instances(vm_name)
output_id, output_data = vmware_plugin.node_reboot(
vmware_plugin.NodeScenarioConfig(
name=vm_name, skip_openshift_checks=True, verify_session=False
)
)
if output_id == "error":
logging.error(output_data.error)
self.fail("The VMware VM did not reboot because an error occurred")
vsphere.release_instances(vm_name)
def test_node_terminate(self):
if not self.credentials_present:
self.skipTest(
"Check if the environmental variables 'VSPHERE_IP', "
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
)
vsphere = vmware_plugin.vSphere(verify=False)
vm_id, vm_name = vsphere.create_default_vm()
if vm_id is None:
self.fail("Could not create test VM")
vsphere.start_instances(vm_name)
output_id, output_data = vmware_plugin.node_terminate(
vmware_plugin.NodeScenarioConfig(
name=vm_name, skip_openshift_checks=True, verify_session=False
)
)
if output_id == "error":
logging.error(output_data.error)
self.fail("The VMware VM did not reboot because an error occurred")
if __name__ == "__main__":
unittest.main()