mirror of
https://github.com/krkn-chaos/krkn.git
synced 2026-02-21 05:20:25 +00:00
Compare commits
45 Commits
remove-pow
...
v1.1.0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9de6c7350e | ||
|
|
9f23699cfa | ||
|
|
fcc7145b98 | ||
|
|
bce5be9667 | ||
|
|
0031912000 | ||
|
|
1a1a9c9bfe | ||
|
|
ec807e3b3a | ||
|
|
b444854cb2 | ||
|
|
1dc58d8721 | ||
|
|
6112ba63c3 | ||
|
|
155269fd9d | ||
|
|
79b92fc395 | ||
|
|
ed1c486c85 | ||
|
|
6ba1e1ad8b | ||
|
|
3b476b68f2 | ||
|
|
e17ebd0e7b | ||
|
|
d0d289fb7c | ||
|
|
66f88f5a78 | ||
|
|
abc635c699 | ||
|
|
90b45538f2 | ||
|
|
c6469ef6cd | ||
|
|
c94c2b22a9 | ||
|
|
9421a0c2c2 | ||
|
|
8a68e1cc9b | ||
|
|
d5615ac470 | ||
|
|
5ab16baafa | ||
|
|
412d718985 | ||
|
|
11f469cb8e | ||
|
|
6c75d3dddb | ||
|
|
f7e27a215e | ||
|
|
e680592762 | ||
|
|
08deae63dd | ||
|
|
f4bc30d2a1 | ||
|
|
bbde837360 | ||
|
|
5d789e7d30 | ||
|
|
69fc8e8d1b | ||
|
|
77f53b3a23 | ||
|
|
ccd902565e | ||
|
|
da117ad9d9 | ||
|
|
ca7bc3f67b | ||
|
|
b01d9895fb | ||
|
|
bbb66aa322 | ||
|
|
97d4f51f74 | ||
|
|
4522ab77b1 | ||
|
|
f4bfc08186 |
13
.github/workflows/build.yml
vendored
13
.github/workflows/build.yml
vendored
@@ -12,14 +12,19 @@ jobs:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v3
|
||||
- name: Create multi-node KinD cluster
|
||||
uses: chaos-kubox/actions/kind@main
|
||||
uses: redhat-chaos/actions/kind@main
|
||||
- name: Install Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.9'
|
||||
architecture: 'x64'
|
||||
- name: Install environment
|
||||
run: |
|
||||
sudo apt-get install build-essential python3-dev
|
||||
pip install -r requirements.txt
|
||||
- name: Run unit tests
|
||||
run: python -m unittest discover
|
||||
- name: Run e2e tests
|
||||
run: python -m unittest discover -s tests
|
||||
- name: Run CI
|
||||
run: ./CI/run.sh
|
||||
- name: Build the Docker images
|
||||
run: docker build --no-cache -t quay.io/chaos-kubox/krkn containers/
|
||||
@@ -34,7 +39,7 @@ jobs:
|
||||
run: docker push quay.io/chaos-kubox/krkn
|
||||
- name: Rebuild krkn-hub
|
||||
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
|
||||
uses: chaos-kubox/actions/krkn-hub@main
|
||||
uses: redhat-chaos/actions/krkn-hub@main
|
||||
with:
|
||||
QUAY_USER: ${{ secrets.QUAY_USER_1 }}
|
||||
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}
|
||||
|
||||
@@ -1,31 +1,6 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete hello pods"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "default"
|
||||
selector: "hello-openshift"
|
||||
filters:
|
||||
- randomSample:
|
||||
size: 1
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "default"
|
||||
selector: "hello-openshift"
|
||||
retries:
|
||||
retriesTimeout:
|
||||
timeout: 180
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 1
|
||||
# yaml-language-server: $schema=../../scenarios/plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
label_selector: name=hello-openshift
|
||||
namespace_pattern: ^default$
|
||||
kill: 1
|
||||
|
||||
15
README.md
15
README.md
@@ -1,5 +1,5 @@
|
||||
# Krkn aka Kraken
|
||||
[](https://quay.io/chaos-kubox/krkn)
|
||||
[](https://quay.io/repository/chaos-kubox/krkn?tab=tags&tag=latest)
|
||||
|
||||

|
||||
|
||||
@@ -23,7 +23,7 @@ Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check i
|
||||
- Test environment recommendations as to how and where to run chaos tests.
|
||||
- Chaos testing in practice.
|
||||
|
||||
The guide is hosted at [https://chaos-kubox.github.io/krkn/](https://chaos-kubox.github.io/krkn/).
|
||||
The guide is hosted at https://redhat-chaos.github.io/krkn.
|
||||
|
||||
|
||||
### How to Get Started
|
||||
@@ -35,7 +35,7 @@ After installation, refer back to the below sections for supported scenarios and
|
||||
|
||||
|
||||
#### Running Kraken with minimal configuration tweaks
|
||||
For cases where you want to run Kraken with minimal configuration changes, refer to [Kraken-hub](https://github.com/chaos-kubox/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
|
||||
For cases where you want to run Kraken with minimal configuration changes, refer to [Kraken-hub](https://github.com/redhat-chaos/krkn-hub). One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
|
||||
|
||||
### Setting up infrastructure dependencies
|
||||
Kraken indexes the metrics specified in the profile into Elasticsearch in addition to leveraging Cerberus for understanding the health of the Kubernetes/OpenShift cluster under test. More information on the features is documented below. The infrastructure pieces can be easily installed and uninstalled by running:
|
||||
@@ -74,7 +74,7 @@ Scenario type | Kubernetes | OpenShift
|
||||
### Kraken scenario pass/fail criteria and report
|
||||
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
|
||||
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
|
||||
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/chaos-kubox/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/chaos-kubox/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/chaos-kubox/krkn/blob/main/config/cerberus.yaml).
|
||||
- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation) or can be installed from Kraken using the [instructions](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor [application routes](https://github.com/redhat-chaos/cerberus/blob/main/docs/config.md#watch-routes) during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting `check_applicaton_routes: True` in the [Kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) provided application routes are being monitored in the [cerberus config](https://github.com/redhat-chaos/krkn/blob/main/config/cerberus.yaml).
|
||||
- Leveraging [kube-burner](docs/alerts.md) alerting feature to fail the runs in case of critical alerts.
|
||||
|
||||
### Signaling
|
||||
@@ -105,11 +105,10 @@ In addition to checking the recovery and health of the cluster and components un
|
||||
|
||||
### Roadmap
|
||||
Following is a list of enhancements that we are planning to work on adding support in Kraken. Of course any help/contributions are greatly appreciated.
|
||||
- [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/chaos-kubox/krkn/issues/124)
|
||||
- Ability to shape the ingress network similar to how Kraken supports [egress traffic shaping](https://github.com/chaos-kubox/krkn/blob/main/docs/network_chaos.md) today.
|
||||
- [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/redhat-chaos/krkn/issues/124)
|
||||
- Continue to improve [Chaos Testing Guide](https://cloud-bulldozer.github.io/kraken/) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
|
||||
- Support for running Kraken on Kubernetes distribution - see https://github.com/chaos-kubox/krkn/issues/185, https://github.com/chaos-kubox/krkn/issues/186
|
||||
- Sweet logo for Kraken - see https://github.com/chaos-kubox/krkn/issues/195
|
||||
- Support for running Kraken on Kubernetes distribution - see https://github.com/redhat-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
|
||||
- Sweet logo for Kraken - see https://github.com/redhat-chaos/krkn/issues/195
|
||||
|
||||
|
||||
### Contributions
|
||||
|
||||
@@ -5,21 +5,23 @@ kraken:
|
||||
port: 8081
|
||||
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
|
||||
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
|
||||
litmus_install: True # Installs specified version, set to False if it's already setup
|
||||
litmus_version: v1.13.6 # Litmus version to install
|
||||
litmus_uninstall: False # If you want to uninstall litmus if failure
|
||||
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
|
||||
chaos_scenarios: # List of policies/chaos scenarios to load
|
||||
- container_scenarios: # List of chaos pod scenarios to load
|
||||
- - scenarios/openshift/container_etcd.yml
|
||||
- pod_scenarios:
|
||||
- - scenarios/openshift/etcd.yml
|
||||
- - scenarios/openshift/regex_openshift_pod_kill.yml
|
||||
- scenarios/openshift/post_action_regex.py
|
||||
- plugin_scenarios:
|
||||
- scenarios/openshift/etcd.yml
|
||||
- scenarios/openshift/regex_openshift_pod_kill.yml
|
||||
- scenarios/openshift/vmware_node_scenarios.yml
|
||||
- scenarios/openshift/network_chaos_ingress.yml
|
||||
- node_scenarios: # List of chaos node scenarios to load
|
||||
- scenarios/openshift/node_scenarios_example.yml
|
||||
- pod_scenarios:
|
||||
- - scenarios/openshift/openshift-apiserver.yml
|
||||
- - scenarios/openshift/openshift-kube-apiserver.yml
|
||||
- plugin_scenarios:
|
||||
- scenarios/openshift/openshift-apiserver.yml
|
||||
- scenarios/openshift/openshift-kube-apiserver.yml
|
||||
- time_scenarios: # List of chaos time scenarios to load
|
||||
- scenarios/openshift/time_scenarios_example.yml
|
||||
- litmus_scenarios: # List of litmus scenarios to load
|
||||
|
||||
@@ -5,14 +5,15 @@ kraken:
|
||||
port: 8081
|
||||
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
|
||||
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
|
||||
litmus_install: True # Installs specified version, set to False if it's already setup
|
||||
litmus_version: v1.13.6 # Litmus version to install
|
||||
litmus_uninstall: False # If you want to uninstall litmus if failure
|
||||
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
|
||||
chaos_scenarios: # List of policies/chaos scenarios to load
|
||||
- container_scenarios: # List of chaos pod scenarios to load
|
||||
- - scenarios/kube/container_dns.yml
|
||||
- pod_scenarios:
|
||||
- - scenarios/kube/scheduler.yml
|
||||
- plugin_scenarios:
|
||||
- scenarios/kube/scheduler.yml
|
||||
|
||||
cerberus:
|
||||
cerberus_enabled: False # Enable it when cerberus is previously installed
|
||||
|
||||
@@ -9,15 +9,14 @@ kraken:
|
||||
litmus_uninstall: False # If you want to uninstall litmus if failure
|
||||
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
|
||||
chaos_scenarios: # List of policies/chaos scenarios to load
|
||||
- pod_scenarios: # List of chaos pod scenarios to load
|
||||
- - scenarios/openshift/etcd.yml
|
||||
- - scenarios/openshift/regex_openshift_pod_kill.yml
|
||||
- scenarios/openshift/post_action_regex.py
|
||||
- plugin_scenarios: # List of chaos pod scenarios to load
|
||||
- scenarios/openshift/etcd.yml
|
||||
- scenarios/openshift/regex_openshift_pod_kill.yml
|
||||
- node_scenarios: # List of chaos node scenarios to load
|
||||
- scenarios/openshift/node_scenarios_example.yml
|
||||
- pod_scenarios:
|
||||
- - scenarios/openshift/openshift-apiserver.yml
|
||||
- - scenarios/openshift/openshift-kube-apiserver.yml
|
||||
- plugin_scenarios:
|
||||
- scenarios/openshift/openshift-apiserver.yml
|
||||
- scenarios/openshift/openshift-kube-apiserver.yml
|
||||
- time_scenarios: # List of chaos time scenarios to load
|
||||
- scenarios/openshift/time_scenarios_example.yml
|
||||
- litmus_scenarios: # List of litmus scenarios to load
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
|
||||
FROM quay.io/openshift/origin-tests:latest as origintests
|
||||
|
||||
FROM mcr.microsoft.com/azure-cli:latest as azure-cli
|
||||
|
||||
FROM quay.io/centos/centos:stream9
|
||||
|
||||
LABEL org.opencontainers.image.authors="Red Hat OpenShift Chaos Engineering"
|
||||
@@ -12,17 +14,18 @@ ENV KUBECONFIG /root/.kube/config
|
||||
COPY --from=origintests /usr/bin/oc /usr/bin/oc
|
||||
COPY --from=origintests /usr/bin/kubectl /usr/bin/kubectl
|
||||
|
||||
# Copy azure client binary from azure-cli image
|
||||
COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
|
||||
|
||||
# Install dependencies
|
||||
RUN yum install epel-release -y && \
|
||||
yum install -y git python python3-pip jq gettext && \
|
||||
python3 -m pip install -U pip && \
|
||||
rpm --import https://packages.microsoft.com/keys/microsoft.asc && \
|
||||
echo -e "[azure-cli]\nname=Azure CLI\nbaseurl=https://packages.microsoft.com/yumrepos/azure-cli\nenabled=1\ngpgcheck=1\ngpgkey=https://packages.microsoft.com/keys/microsoft.asc" > /etc/yum.repos.d/azure-cli.repo && yum install -y azure-cli && \
|
||||
git clone https://github.com/openshift-scale/kraken /root/kraken && \
|
||||
yum install -y git python39 python3-pip jq gettext && \
|
||||
python3.9 -m pip install -U pip && \
|
||||
git clone https://github.com/redhat-chaos/krkn.git --branch v1.0.1 /root/kraken && \
|
||||
mkdir -p /root/.kube && cd /root/kraken && \
|
||||
pip3 install -r requirements.txt
|
||||
pip3.9 install -r requirements.txt
|
||||
|
||||
WORKDIR /root/kraken
|
||||
|
||||
ENTRYPOINT ["python3", "run_kraken.py"]
|
||||
ENTRYPOINT ["python3.9", "run_kraken.py"]
|
||||
CMD ["--config=config/config.yaml"]
|
||||
|
||||
@@ -15,7 +15,7 @@ RUN curl -L -o openshift-client-linux.tar.gz https://mirror.openshift.com/pub/op
|
||||
# Install dependencies
|
||||
RUN yum install epel-release -y && \
|
||||
yum install -y git python36 python3-pip gcc libffi-devel python36-devel openssl-devel gcc-c++ make jq gettext && \
|
||||
git clone https://github.com/cloud-bulldozer/kraken /root/kraken && \
|
||||
git clone https://github.com/redhat-chaos/krkn.git --branch v1.0.1 /root/kraken && \
|
||||
mkdir -p /root/.kube && cd /root/kraken && \
|
||||
pip3 install cryptography==3.3.2 && \
|
||||
pip3 install -r requirements.txt setuptools==40.3.0 urllib3==1.25.4
|
||||
|
||||
@@ -3,17 +3,17 @@
|
||||
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/chaos-kubox/krkn).
|
||||
|
||||
### Run containerized version
|
||||
Refer [instructions](https://github.com/chaos-kubox/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
|
||||
Refer [instructions](https://github.com/redhat-chaos/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
|
||||
|
||||
|
||||
### Run Custom Kraken Image
|
||||
Refer to [instructions](https://github.com/chaos-kubox/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
|
||||
Refer to [instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
|
||||
|
||||
|
||||
### Kraken as a KubeApp
|
||||
|
||||
To run containerized Kraken as a Kubernetes/OpenShift Deployment, follow these steps:
|
||||
1. Configure the [config.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) file according to your requirements.
|
||||
1. Configure the [config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) file according to your requirements.
|
||||
2. Create a namespace under which you want to run the kraken pod using `kubectl create ns <namespace>`.
|
||||
3. Switch to `<namespace>` namespace:
|
||||
- In Kubernetes, use `kubectl config set-context --current --namespace=<namespace>`
|
||||
|
||||
@@ -18,7 +18,7 @@ spec:
|
||||
privileged: true
|
||||
image: quay.io/chaos-kubox/krkn
|
||||
command: ["/bin/sh", "-c"]
|
||||
args: ["python3 run_kraken.py -c config/config.yaml"]
|
||||
args: ["python3.9 run_kraken.py -c config/config.yaml"]
|
||||
volumeMounts:
|
||||
- mountPath: "/root/.kube"
|
||||
name: config
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
## Alerts
|
||||
|
||||
Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports alerting based on the queries defined by the user and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) by setting the following:
|
||||
Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports alerting based on the queries defined by the user and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
|
||||
|
||||
```
|
||||
performance_monitoring:
|
||||
@@ -12,7 +12,7 @@ performance_monitoring:
|
||||
```
|
||||
|
||||
### Alert profile
|
||||
A couple of [alert profiles](https://github.com/chaos-kubox/krkn/tree/main/config) [alerts](https://github.com/chaos-kubox/krkn/blob/main/config/alerts) are shipped by default and can be tweaked to add more queries to alert on. The following are a few alerts examples:
|
||||
A couple of [alert profiles](https://github.com/redhat-chaos/krkn/tree/main/config) [alerts](https://github.com/redhat-chaos/krkn/blob/main/config/alerts) are shipped by default and can be tweaked to add more queries to alert on. The following are a few alerts examples:
|
||||
|
||||
```
|
||||
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
|
||||
|
||||
@@ -5,6 +5,7 @@ Supported Cloud Providers:
|
||||
* [Openstack](#openstack)
|
||||
* [Azure](#azure)
|
||||
* [Alibaba](#alibaba)
|
||||
* [VMware](#vmware)
|
||||
|
||||
## AWS
|
||||
|
||||
@@ -53,3 +54,15 @@ See the [Installation guide](https://www.alibabacloud.com/help/en/alibaba-cloud-
|
||||
Refer to [region and zone page](https://www.alibabacloud.com/help/en/elastic-compute-service/latest/regions-and-zones#concept-2459516) to get the region id for the region you are running on.
|
||||
|
||||
Set cloud_type to either alibaba or alicloud in your node scenario yaml file.
|
||||
|
||||
## VMware
|
||||
|
||||
Set the following environment variables
|
||||
|
||||
1. ```export VSPHERE_IP=<vSphere_client_IP_address>```
|
||||
|
||||
2. ```export VSPHERE_USERNAME=<vSphere_client_username>```
|
||||
|
||||
3. ```export VSPHERE_PASSWORD=<vSphere_client_password>```
|
||||
|
||||
These are the credentials that you would normally use to access the vSphere client.
|
||||
@@ -1,5 +1,5 @@
|
||||
#### Kubernetes/OpenShift cluster shut down scenario
|
||||
Scenario to shut down all the nodes including the masters and restart them after specified duration. Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to [cluster_shut_down_scenario](https://github.com/chaos-kubox/krkn/blob/main/scenarios/cluster_shut_down_scenario.yml) config file.
|
||||
Scenario to shut down all the nodes including the masters and restart them after specified duration. Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to [cluster_shut_down_scenario](https://github.com/redhat-chaos/krkn/blob/main/scenarios/cluster_shut_down_scenario.yml) config file.
|
||||
|
||||
Refer to [cloud setup](cloud_setup.md) to configure your cli properly for the cloud provider of the cluster you want to shut down.
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
### Config
|
||||
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at [config/config.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml).
|
||||
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at [config/config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml).
|
||||
|
||||
**NOTE**: [config](https://github.com/chaos-kubox/krkn/blob/main/config/config_performance.yaml) can be used if leveraging the [automated way](https://github.com/chaos-kubox/krkn#setting-up-infrastructure-dependencies) to install the infrastructure pieces.
|
||||
**NOTE**: [config](https://github.com/redhat-chaos/krkn/blob/main/config/config_performance.yaml) can be used if leveraging the [automated way](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies) to install the infrastructure pieces.
|
||||
|
||||
@@ -23,7 +23,7 @@ In all scenarios we do a post chaos check to wait and verify the specific compon
|
||||
Here there are two options:
|
||||
1. Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.
|
||||
|
||||
See [scenarios/post_action_etcd_container.py](https://github.com/chaos-kubox/krkn/blob/main/scenarios/post_action_etcd_container.py) for an example.
|
||||
See [scenarios/post_action_etcd_container.py](https://github.com/redhat-chaos/krkn/blob/main/scenarios/post_action_etcd_container.py) for an example.
|
||||
```
|
||||
- container_scenarios: # List of chaos pod scenarios to load.
|
||||
- - scenarios/container_etcd.yml
|
||||
|
||||
@@ -1,52 +1,26 @@
|
||||
## Getting Started Running Chaos Scenarios
|
||||
|
||||
#### Adding New Scenarios
|
||||
Adding a new scenario is as simple as adding a new config file under [scenarios directory](https://github.com/chaos-kubox/krkn/tree/main/scenarios) and defining it in the main kraken [config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml#L8).
|
||||
Adding a new scenario is as simple as adding a new config file under [scenarios directory](https://github.com/redhat-chaos/krkn/tree/main/scenarios) and defining it in the main kraken [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml#L8).
|
||||
You can either copy an existing yaml file and make it your own, or fill in one of the templates below to suit your needs.
|
||||
|
||||
### Templates
|
||||
#### Pod Scenario Yaml Template
|
||||
For example, for adding a pod level scenario for a new application, refer to the sample scenario below to know what fields are necessary and what to add in each location:
|
||||
```
|
||||
config:
|
||||
runStrategy:
|
||||
runs: <number of times to execute the scenario>
|
||||
#This will choose a random number to wait between min and max
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete pods example"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "<namespace>"
|
||||
selector: "<pod label>" # This can be left blank.
|
||||
filters:
|
||||
- randomSample:
|
||||
size: <number of pods to kill>
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "<namespace>"
|
||||
selector: "<pod label>" # This can be left blank.
|
||||
retries:
|
||||
retriesTimeout:
|
||||
# Amount of time to wait with retrying, before failing if pod count does not match expected
|
||||
# timeout: 180.
|
||||
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: <expected number of pods that match namespace and label"
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^<namespace>$
|
||||
label_selector: <pod label>
|
||||
kill: <number of pods to kill>
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^<namespace>$
|
||||
label_selector: <pod label>
|
||||
count: <expected number of pods that match namespace and label>
|
||||
```
|
||||
|
||||
More information on specific items that you can add to the pod killing scenarios can be found in the [powerfulseal policies](https://powerfulseal.github.io/powerfulseal/policies) documentation
|
||||
|
||||
|
||||
#### Node Scenario Yaml Template
|
||||
|
||||
```
|
||||
|
||||
@@ -90,18 +90,18 @@ We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
|
||||
|
||||
|
||||
### Tooling
|
||||
Now that we looked at the best practices, In this section, we will go through how [Kraken](https://github.com/chaos-kubox/krkn) - a chaos testing framework can help test the resilience of OpenShift and make sure the applications and services are following the best practices.
|
||||
Now that we looked at the best practices, In this section, we will go through how [Kraken](https://github.com/redhat-chaos/krkn) - a chaos testing framework can help test the resilience of OpenShift and make sure the applications and services are following the best practices.
|
||||
|
||||
#### Workflow
|
||||
Let us start by understanding the workflow of kraken: the user will start by running kraken by pointing to a specific OpenShift cluster using kubeconfig to be able to talk to the platform on top of which the OpenShift cluster is hosted. This can be done by either the oc/kubectl API or the cloud API. Based on the configuration of kraken, it will inject specific chaos scenarios as shown below, talk to [Cerberus](https://github.com/chaos-kubox/cerberus) to get the go/no-go signal representing the overall health of the cluster ( optional - can be turned off ), scrapes metrics from in-cluster prometheus given a metrics profile with the promql queries and stores them long term in Elasticsearch configured ( optional - can be turned off ), evaluates the promql expressions specified in the alerts profile ( optional - can be turned off ) and aggregated everything to set the pass/fail i.e. exits 0 or 1. More about the metrics collection, cerberus and metrics evaluation can be found in the next section.
|
||||
Let us start by understanding the workflow of kraken: the user will start by running kraken by pointing to a specific OpenShift cluster using kubeconfig to be able to talk to the platform on top of which the OpenShift cluster is hosted. This can be done by either the oc/kubectl API or the cloud API. Based on the configuration of kraken, it will inject specific chaos scenarios as shown below, talk to [Cerberus](https://github.com/redhat-chaos/cerberus) to get the go/no-go signal representing the overall health of the cluster ( optional - can be turned off ), scrapes metrics from in-cluster prometheus given a metrics profile with the promql queries and stores them long term in Elasticsearch configured ( optional - can be turned off ), evaluates the promql expressions specified in the alerts profile ( optional - can be turned off ) and aggregated everything to set the pass/fail i.e. exits 0 or 1. More about the metrics collection, cerberus and metrics evaluation can be found in the next section.
|
||||
|
||||

|
||||
|
||||
#### Cluster recovery checks, metrics evaluation and pass/fail criteria
|
||||
- Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and it’s extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where [Cerberus](https://github.com/chaos-kubox/cerberus) comes to the rescue.
|
||||
- Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and it’s extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where [Cerberus](https://github.com/redhat-chaos/cerberus) comes to the rescue.
|
||||
If the monitoring tool, cerberus is enabled it will consume the signal and continue running chaos or not based on that signal.
|
||||
|
||||
- Apart from checking the recovery and cluster health status, it’s equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Kraken has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found [here](https://github.com/chaos-kubox/krkn#alerts).
|
||||
- Apart from checking the recovery and cluster health status, it’s equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Kraken has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found [here](https://github.com/redhat-chaos/krkn#alerts).
|
||||
|
||||
- The overall pass or fail of kraken is based on the recovery of the specific component (within a certain amount of time), the cerberus health signal which tracks the health of the entire cluster and metrics evaluation from incluster prometheus.
|
||||
|
||||
@@ -112,17 +112,17 @@ If the monitoring tool, cerberus is enabled it will consume the signal and conti
|
||||
|
||||
Let us take a look at how to run the chaos scenarios on your OpenShift clusters using Kraken-hub - a lightweight wrapper around Kraken to ease the runs by providing the ability to run them by just running container images using podman with parameters set as environment variables. This eliminates the need to carry around and edit configuration files and makes it easy for any CI framework integration. Here are the scenarios supported:
|
||||
|
||||
- Pod Scenarios ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/pod-scenarios.md))
|
||||
- Pod Scenarios ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/pod-scenarios.md))
|
||||
- Disrupts OpenShift/Kubernetes and applications deployed as pods:
|
||||
- Helps understand the availability of the application, the initialization timing and recovery status.
|
||||
- [Demo](https://asciinema.org/a/452351?speed=3&theme=solarized-dark)
|
||||
|
||||
- Container Scenarios ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/container-scenarios.md))
|
||||
- Container Scenarios ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/container-scenarios.md))
|
||||
- Disrupts OpenShift/Kubernetes and applications deployed as containers running as part of a pod(s) using a specified kill signal to mimic failures:
|
||||
- Helps understand the impact and recovery timing when the program/process running in the containers are disrupted - hangs, paused, killed etc., using various kill signals, i.e. SIGHUP, SIGTERM, SIGKILL etc.
|
||||
- [Demo](https://asciinema.org/a/BXqs9JSGDSEKcydTIJ5LpPZBM?speed=3&theme=solarized-dark)
|
||||
|
||||
- Node Scenarios ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-scenarios.md))
|
||||
- Node Scenarios ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-scenarios.md))
|
||||
- Disrupts nodes as part of the cluster infrastructure by talking to the cloud API. AWS, Azure, GCP, OpenStack and Baremetal are the supported platforms as of now. Possible disruptions include:
|
||||
- Terminate nodes
|
||||
- Fork bomb inside the node
|
||||
@@ -131,18 +131,18 @@ Let us take a look at how to run the chaos scenarios on your OpenShift clusters
|
||||
- etc.
|
||||
- [Demo](https://asciinema.org/a/ANZY7HhPdWTNaWt4xMFanF6Q5)
|
||||
|
||||
- Zone Outages ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/zone-outages.md))
|
||||
- Zone Outages ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/zone-outages.md))
|
||||
- Creates outage of availability zone(s) in a targeted region in the public cloud where the OpenShift cluster is running by tweaking the network acl of the zone to simulate the failure, and that in turn will stop both ingress and egress traffic from all nodes in a particular zone for the specified duration and reverts it back to the previous state.
|
||||
- Helps understand the impact on both Kubernetes/OpenShift control plane as well as applications and services running on the worker nodes in that zone.
|
||||
- Currently, only set up for AWS cloud platform: 1 VPC and multiples subnets within the VPC can be specified.
|
||||
- [Demo](https://asciinema.org/a/452672?speed=3&theme=solarized-dark)
|
||||
|
||||
- Application Outages ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/application-outages.md))
|
||||
- Application Outages ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/application-outages.md))
|
||||
- Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during the downtime.
|
||||
- Helps understand how the dependent services react to the unavailability.
|
||||
- [Demo](https://asciinema.org/a/452403?speed=3&theme=solarized-dark)
|
||||
|
||||
- Power Outages ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/power-outages.md))
|
||||
- Power Outages ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/power-outages.md))
|
||||
- This scenario imitates a power outage by shutting down of the entire cluster for a specified duration of time, then restarts all the nodes after the specified time and checks the health of the cluster.
|
||||
- There are various use cases in the customer environments. For example, when some of the clusters are shutdown in cases where the applications are not needed to run in a particular time/season in order to save costs.
|
||||
- The nodes are stopped in parallel to mimic a power outage i.e., pulling off the plug
|
||||
@@ -151,24 +151,24 @@ Let us take a look at how to run the chaos scenarios on your OpenShift clusters
|
||||
- Resource Hog
|
||||
- Hogs CPU, Memory and IO on the targeted nodes
|
||||
- Helps understand if the application/system components have reserved resources to not get disrupted because of rogue applications, or get performance throttled.
|
||||
- CPU Hog ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-cpu-hog.md), [Demo](https://asciinema.org/a/452762))
|
||||
- Memory Hog ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-memory-hog.md), [Demo](https://asciinema.org/a/452742?speed=3&theme=solarized-dark))
|
||||
- IO Hog ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/node-io-hog.md))
|
||||
- CPU Hog ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-cpu-hog.md), [Demo](https://asciinema.org/a/452762))
|
||||
- Memory Hog ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-memory-hog.md), [Demo](https://asciinema.org/a/452742?speed=3&theme=solarized-dark))
|
||||
- IO Hog ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-io-hog.md))
|
||||
|
||||
- Time Skewing ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/time-scenarios.md))
|
||||
- Time Skewing ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/time-scenarios.md))
|
||||
- Manipulate the system time and/or date of specific pods/nodes.
|
||||
- Verify scheduling of objects so they continue to work.
|
||||
- Verify time gets reset properly.
|
||||
|
||||
- Namespace Failures ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/namespace-scenarios.md))
|
||||
- Namespace Failures ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/namespace-scenarios.md))
|
||||
- Delete namespaces for the specified duration.
|
||||
- Helps understand the impact on other components and tests/improves recovery time of the components in the targeted namespace.
|
||||
|
||||
- Persistent Volume Fill ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/pvc-scenarios.md))
|
||||
- Persistent Volume Fill ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/pvc-scenarios.md))
|
||||
- Fills up the persistent volumes, up to a given percentage, used by the pod for the specified duration.
|
||||
- Helps understand how an application deals when it is no longer able to write data to the disk. For example, kafka’s behavior when it is not able to commit data to the disk.
|
||||
|
||||
- Network Chaos ([Documentation](https://github.com/chaos-kubox/krkn-hub/blob/main/docs/network-chaos.md))
|
||||
- Network Chaos ([Documentation](https://github.com/redhat-chaos/krkn-hub/blob/main/docs/network-chaos.md))
|
||||
- Scenarios supported includes:
|
||||
- Network latency
|
||||
- Packet loss
|
||||
|
||||
@@ -9,28 +9,29 @@ The following ways are supported to run Kraken:
|
||||
**NOTE**: It is recommended to run Kraken external to the cluster ( Standalone or Containerized ) hitting the Kubernetes/OpenShift API as running it internal to the cluster might be disruptive to itself and also might not report back the results if the chaos leads to cluster's API server instability.
|
||||
|
||||
**NOTE**: To run Kraken on Power (ppc64le) architecture, build and run a containerized version by following the
|
||||
instructions given [here](https://github.com/chaos-kubox/krkn/blob/main/containers/build_own_image-README.md).
|
||||
instructions given [here](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md).
|
||||
|
||||
### Git
|
||||
|
||||
#### Clone the repository
|
||||
Pick the latest stable release to install [here](https://github.com/redhat-chaos/krkn/releases).
|
||||
```
|
||||
$ git clone https://github.com/openshift-scale/krkn.git
|
||||
$ git clone https://github.com/redhat-chaos/krkn.git --branch <release version>
|
||||
$ cd kraken
|
||||
```
|
||||
|
||||
#### Install the dependencies
|
||||
```
|
||||
$ python3 -m venv chaos
|
||||
$ python3.9 -m venv chaos
|
||||
$ source chaos/bin/activate
|
||||
$ pip3 install -r requirements.txt
|
||||
$ pip3.9 install -r requirements.txt
|
||||
```
|
||||
|
||||
**NOTE**: Make sure python3-devel and latest pip versions are installed on the system. The dependencies install has been tested with pip >= 21.1.3 versions.
|
||||
|
||||
#### Run
|
||||
```
|
||||
$ python3 run_kraken.py --config <config_file_location>
|
||||
$ python3.9 run_kraken.py --config <config_file_location>
|
||||
```
|
||||
|
||||
### Run containerized version
|
||||
@@ -50,8 +51,8 @@ $ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config
|
||||
$ podman logs -f kraken
|
||||
```
|
||||
|
||||
If you want to build your own kraken image see [here](https://github.com/chaos-kubox/krkn/blob/main/containers/build_own_image-README.md)
|
||||
If you want to build your own kraken image see [here](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md)
|
||||
|
||||
|
||||
### Run Kraken as a Kubernetes deployment
|
||||
Refer [Instructions](https://github.com/chaos-kubox/krkn/blob/main/containers/README.md) on how to deploy and run Kraken as a Kubernetes/OpenShift deployment.
|
||||
Refer [Instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/README.md) on how to deploy and run Kraken as a Kubernetes/OpenShift deployment.
|
||||
|
||||
@@ -36,6 +36,6 @@ The following are the start of scenarios for which a chaos scenario config exist
|
||||
|
||||
Scenario | Description | Working
|
||||
------------------------ |-----------------------------------------------------------------------------------------| ------------------------- |
|
||||
[Node CPU Hog](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_cpu_hog_engine.yaml) | Chaos scenario that hogs up the CPU on a defined node for a specific amount of time. | :heavy_check_mark: |
|
||||
[Node Memory Hog](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_mem_engine.yaml) | Chaos scenario that hogs up the memory on a defined node for a specific amount of time. | :heavy_check_mark: |
|
||||
[Node IO Hog](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_io_engine.yaml) | Chaos scenario that hogs up the IO on a defined node for a specific amount of time. | :heavy_check_mark: |
|
||||
[Node CPU Hog](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_cpu_hog_engine.yaml) | Chaos scenario that hogs up the CPU on a defined node for a specific amount of time. | :heavy_check_mark: |
|
||||
[Node Memory Hog](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_mem_engine.yaml) | Chaos scenario that hogs up the memory on a defined node for a specific amount of time. | :heavy_check_mark: |
|
||||
[Node IO Hog](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_io_engine.yaml) | Chaos scenario that hogs up the IO on a defined node for a specific amount of time. | :heavy_check_mark: |
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
There are cases where the state of the cluster and metrics on the cluster during the chaos test run need to be stored long term to review after the cluster is terminated, for example CI and automation test runs. To help with this, Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes them into Elasticsearch. The indexed metrics can be visualized with the help of Grafana.
|
||||
|
||||
It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Each run has a unique identifier ( uuid ) and all the metrics/documents in Elasticsearch will be associated with it. The uuid is generated automatically if not set in the config. This feature can be enabled in the [config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml) by setting the following:
|
||||
It uses [Kube-burner](https://github.com/cloud-bulldozer/kube-burner) under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Each run has a unique identifier ( uuid ) and all the metrics/documents in Elasticsearch will be associated with it. The uuid is generated automatically if not set in the config. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
|
||||
|
||||
```
|
||||
performance_monitoring:
|
||||
@@ -16,7 +16,7 @@ performance_monitoring:
|
||||
```
|
||||
|
||||
### Metrics profile
|
||||
A couple of [metric profiles](https://github.com/chaos-kubox/krkn/tree/main/config), [metrics.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/metrics.yaml), and [metrics-aggregated.yaml](https://github.com/chaos-kubox/krkn/blob/main/config/metrics-aggregated.yaml) are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:
|
||||
A couple of [metric profiles](https://github.com/redhat-chaos/krkn/tree/main/config), [metrics.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/metrics.yaml), and [metrics-aggregated.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/metrics-aggregated.yaml) are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:
|
||||
|
||||
```
|
||||
metrics:
|
||||
|
||||
@@ -16,7 +16,7 @@ Set to '^.*$' and label_selector to "" to randomly select any namespace in your
|
||||
|
||||
**sleep:** Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set
|
||||
|
||||
Refer to [namespace_scenarios_example](https://github.com/chaos-kubox/krkn/blob/main/scenarios/regex_namespace.yaml) config file.
|
||||
Refer to [namespace_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/regex_namespace.yaml) config file.
|
||||
|
||||
```
|
||||
scenarios:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
### Network chaos
|
||||
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node's host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
|
||||
|
||||
##### Sample scenario config
|
||||
##### Sample scenario config for egress traffic shaping
|
||||
```
|
||||
network_chaos: # Scenario to create an outage by simulating random variations in the network.
|
||||
duration: 300 # In seconds - duration network chaos will be applied.
|
||||
@@ -17,6 +17,29 @@ network_chaos: # Scenario to create an outage
|
||||
bandwidth: 100mbit
|
||||
```
|
||||
|
||||
##### Sample scenario config for ingress traffic shaping (using a plugin)
|
||||
'''
|
||||
- id: network_chaos
|
||||
config:
|
||||
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
|
||||
ip-10-0-128-153.us-west-2.compute.internal:
|
||||
- ens5
|
||||
- genev_sys_6081
|
||||
label_selector: node-role.kubernetes.io/master # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
|
||||
instance_count: 1 # Number of nodes to perform action/select that match the label selector
|
||||
kubeconfig_path: /root/.kube/config # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
|
||||
execution_type: parallel # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).
|
||||
network_params:
|
||||
latency: 50ms
|
||||
loss: '0.02'
|
||||
bandwidth: 100mbit
|
||||
wait_duration: 120
|
||||
test_duration: 60
|
||||
'''
|
||||
|
||||
Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
|
||||
|
||||
|
||||
##### Steps
|
||||
- Pick the nodes to introduce the network anomaly either from node_name or label_selector.
|
||||
- Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.
|
||||
|
||||
@@ -4,7 +4,7 @@ The following node chaos scenarios are supported:
|
||||
|
||||
1. **node_start_scenario**: Scenario to stop the node instance.
|
||||
2. **node_stop_scenario**: Scenario to stop the node instance.
|
||||
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance.
|
||||
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance. Not supported on VMware.
|
||||
4. **node_termination_scenario**: Scenario to terminate the node instance.
|
||||
5. **node_reboot_scenario**: Scenario to reboot the node instance.
|
||||
6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
|
||||
@@ -12,13 +12,14 @@ The following node chaos scenarios are supported:
|
||||
8. **node_crash_scenario**: Scenario to crash the node instance.
|
||||
9. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
|
||||
|
||||
|
||||
**NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
|
||||
|
||||
**NOTE**: node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario
|
||||
, node_reboot_scenario and stop_start_kubelet_scenario are supported only on AWS, Azure, OpenStack, BareMetal, GCP
|
||||
, and Alibaba as of now.
|
||||
, VMware and Alibaba as of now.
|
||||
|
||||
**NOTE**: Node scenarios are supported only when running the standalone version of Kraken until https://github.com/chaos-kubox/krkn/issues/106 gets fixed.
|
||||
**NOTE**: Node scenarios are supported only when running the standalone version of Kraken until https://github.com/redhat-chaos/krkn/issues/106 gets fixed.
|
||||
|
||||
|
||||
#### AWS
|
||||
@@ -64,13 +65,17 @@ How to set up Alibaba cli to run node scenarios is defined [here](cloud_setup.md
|
||||
. Releasing a node is 2 steps, stopping the node and then releasing it.
|
||||
|
||||
|
||||
#### VMware
|
||||
How to set up VMware vSphere to run node scenarios is defined [here](cloud_setup.md#vmware)
|
||||
|
||||
|
||||
#### General
|
||||
|
||||
**NOTE**: The `node_crash_scenario` and `stop_kubelet_scenario` scenario is supported independent of the cloud platform.
|
||||
|
||||
Use 'generic' or do not add the 'cloud_type' key to your scenario if your cluster is not set up using one of the current supported cloud types.
|
||||
|
||||
Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to [node_scenarios_example](https://github.com/chaos-kubox/krkn/blob/main/scenarios/node_scenarios_example.yml) config file.
|
||||
Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to [node_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/node_scenarios_example.yml) config file.
|
||||
|
||||
|
||||
```
|
||||
|
||||
@@ -1,14 +1,40 @@
|
||||
### Pod Scenarios
|
||||
Kraken consumes [Powerfulseal](https://github.com/powerfulseal/powerfulseal) under the hood to run the pod scenarios.
|
||||
These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.
|
||||
|
||||
Krkn recently replaced PowerfulSeal with its own internal pod scenarios using a plugin system. You can run pod scenarios by adding the following config to Krkn:
|
||||
|
||||
```yaml
|
||||
kraken:
|
||||
chaos_scenarios:
|
||||
- plugin_scenarios:
|
||||
- path/to/scenario.yaml
|
||||
```
|
||||
|
||||
You can then create the scenario file with the following contents:
|
||||
|
||||
```yaml
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^kube-system$
|
||||
label_selector: k8s-app=kube-scheduler
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^kube-system$
|
||||
label_selector: k8s-app=kube-scheduler
|
||||
count: 3
|
||||
```
|
||||
|
||||
Please adjust the schema reference to point to the [schema file](../scenarios/plugin.schema.json). This file will give you code completion and documentation for the available options in your IDE.
|
||||
|
||||
#### Pod Chaos Scenarios
|
||||
|
||||
The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.
|
||||
|
||||
Component | Description | Working
|
||||
------------------------ |----------------------------------------------------------------------------------------------| ------------------------- |
|
||||
[Etcd](https://github.com/chaos-kubox/krkn/blob/main/scenarios/etcd.yml) | Kills a single/multiple etcd replicas for the specified number of times in a loop. | :heavy_check_mark: |
|
||||
[Kube ApiServer](https://github.com/chaos-kubox/krkn/blob/main/scenarios/openshift-kube-apiserver.yml) | Kills a single/multiple kube-apiserver replicas for the specified number of times in a loop. | :heavy_check_mark: |
|
||||
[ApiServer](https://github.com/chaos-kubox/krkn/blob/main/scenarios/openshift-apiserver.yml) | Kills a single/multiple apiserver replicas for the specified number of times in a loop. | :heavy_check_mark: |
|
||||
[Prometheus](https://github.com/chaos-kubox/krkn/blob/main/scenarios/prometheus.yml) | Kills a single/multiple prometheus replicas for the specified number of times in a loop. | :heavy_check_mark: |
|
||||
[OpenShift System Pods](https://github.com/chaos-kubox/krkn/blob/main/scenarios/regex_openshift_pod_kill.yml) | Kills random pods running in the OpenShift system namespaces. | :heavy_check_mark: |
|
||||
| Component | Description | Working |
|
||||
| ------------------------ |-------------| -------- |
|
||||
| [Basic pod scenario](../scenarios/kube/pod.yml) | Kill a pod. | :heavy_check_mark: |
|
||||
| [Etcd](../scenarios/openshift/etcd.yml) | Kills a single/multiple etcd replicas. | :heavy_check_mark: |
|
||||
| [Kube ApiServer](../scenarios/openshift/openshift-kube-apiserver.yml)| Kills a single/multiple kube-apiserver replicas. | :heavy_check_mark: |
|
||||
| [ApiServer](../scenarios/openshift/openshift-apiserver.yml) | Kills a single/multiple apiserver replicas. | :heavy_check_mark: |
|
||||
| [Prometheus](../scenarios/openshift/prometheus.yml) | Kills a single/multiple prometheus replicas. | :heavy_check_mark: |
|
||||
| [OpenShift System Pods](../scenarios/openshift/regex_openshift_pod_kill.yml) | Kills random pods running in the OpenShift system namespaces. | :heavy_check_mark: |
|
||||
|
||||
@@ -16,7 +16,7 @@ Configuration Options:
|
||||
|
||||
**object_name:** List of the names of pods or nodes you want to skew.
|
||||
|
||||
Refer to [time_scenarios_example](https://github.com/chaos-kubox/krkn/blob/main/scenarios/time_scenarios_example.yml) config file.
|
||||
Refer to [time_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/time_scenarios_example.yml) config file.
|
||||
|
||||
```
|
||||
time_scenarios:
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
### Zone outage scenario
|
||||
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone. It tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state. Zone outage can be injected by placing the zone_outage config file under zone_outages option in the [kraken config](https://github.com/chaos-kubox/krkn/blob/main/config/config.yaml). Refer to [zone_outage_scenario](https://github.com/chaos-kubox/krkn/blob/main/scenarios/zone_outage.yaml) config file for the parameters that need to be defined.
|
||||
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone. It tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state. Zone outage can be injected by placing the zone_outage config file under zone_outages option in the [kraken config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml). Refer to [zone_outage_scenario](https://github.com/redhat-chaos/krkn/blob/main/scenarios/zone_outage.yaml) config file for the parameters that need to be defined.
|
||||
|
||||
Refer to [cloud setup](cloud_setup.md) to configure your cli properly for the cloud provider of the cluster you want to shut down.
|
||||
|
||||
|
||||
136
kraken/cerberus/setup.py
Normal file
136
kraken/cerberus/setup.py
Normal file
@@ -0,0 +1,136 @@
|
||||
import logging
|
||||
import requests
|
||||
import sys
|
||||
import json
|
||||
|
||||
|
||||
def get_status(config, start_time, end_time):
|
||||
"""
|
||||
Get cerberus status
|
||||
"""
|
||||
cerberus_status = True
|
||||
check_application_routes = False
|
||||
application_routes_status = True
|
||||
if config["cerberus"]["cerberus_enabled"]:
|
||||
cerberus_url = config["cerberus"]["cerberus_url"]
|
||||
check_application_routes = \
|
||||
config["cerberus"]["check_applicaton_routes"]
|
||||
if not cerberus_url:
|
||||
logging.error(
|
||||
"url where Cerberus publishes True/False signal "
|
||||
"is not provided."
|
||||
)
|
||||
sys.exit(1)
|
||||
cerberus_status = requests.get(cerberus_url, timeout=60).content
|
||||
cerberus_status = True if cerberus_status == b"True" else False
|
||||
|
||||
# Fail if the application routes monitored by cerberus
|
||||
# experience downtime during the chaos
|
||||
if check_application_routes:
|
||||
application_routes_status, unavailable_routes = application_status(
|
||||
cerberus_url,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
if not application_routes_status:
|
||||
logging.error(
|
||||
"Application routes: %s monitored by cerberus "
|
||||
"encountered downtime during the run, failing"
|
||||
% unavailable_routes
|
||||
)
|
||||
else:
|
||||
logging.info(
|
||||
"Application routes being monitored "
|
||||
"didn't encounter any downtime during the run!"
|
||||
)
|
||||
|
||||
if not cerberus_status:
|
||||
logging.error(
|
||||
"Received a no-go signal from Cerberus, looks like "
|
||||
"the cluster is unhealthy. Please check the Cerberus "
|
||||
"report for more details. Test failed."
|
||||
)
|
||||
|
||||
if not application_routes_status or not cerberus_status:
|
||||
sys.exit(1)
|
||||
else:
|
||||
logging.info(
|
||||
"Received a go signal from Ceberus, the cluster is healthy. "
|
||||
"Test passed."
|
||||
)
|
||||
return cerberus_status
|
||||
|
||||
|
||||
def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
|
||||
"""
|
||||
Publish kraken status to cerberus
|
||||
"""
|
||||
cerberus_status = get_status(config, start_time, end_time)
|
||||
if not cerberus_status:
|
||||
if failed_post_scenarios:
|
||||
if config["kraken"]["exit_on_failure"]:
|
||||
logging.info(
|
||||
"Cerberus status is not healthy and post action scenarios "
|
||||
"are still failing, exiting kraken run"
|
||||
)
|
||||
sys.exit(1)
|
||||
else:
|
||||
logging.info(
|
||||
"Cerberus status is not healthy and post action scenarios "
|
||||
"are still failing"
|
||||
)
|
||||
else:
|
||||
if failed_post_scenarios:
|
||||
if config["kraken"]["exit_on_failure"]:
|
||||
logging.info(
|
||||
"Cerberus status is healthy but post action scenarios "
|
||||
"are still failing, exiting kraken run"
|
||||
)
|
||||
sys.exit(1)
|
||||
else:
|
||||
logging.info(
|
||||
"Cerberus status is healthy but post action scenarios "
|
||||
"are still failing"
|
||||
)
|
||||
|
||||
|
||||
def application_status(cerberus_url, start_time, end_time):
|
||||
"""
|
||||
Check application availability
|
||||
"""
|
||||
if not cerberus_url:
|
||||
logging.error(
|
||||
"url where Cerberus publishes True/False signal is not provided."
|
||||
)
|
||||
sys.exit(1)
|
||||
else:
|
||||
duration = (end_time - start_time) / 60
|
||||
url = "{baseurl}/history?loopback={duration}".format(
|
||||
baseurl=cerberus_url,
|
||||
duration=str(duration)
|
||||
)
|
||||
logging.info(
|
||||
"Scraping the metrics for the test "
|
||||
"duration from cerberus url: %s" % url
|
||||
)
|
||||
try:
|
||||
failed_routes = []
|
||||
status = True
|
||||
metrics = requests.get(url, timeout=60).content
|
||||
metrics_json = json.loads(metrics)
|
||||
for entry in metrics_json["history"]["failures"]:
|
||||
if entry["component"] == "route":
|
||||
name = entry["name"]
|
||||
failed_routes.append(name)
|
||||
status = False
|
||||
else:
|
||||
continue
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to scrape metrics from cerberus API at %s: %s" % (
|
||||
url,
|
||||
e
|
||||
)
|
||||
)
|
||||
sys.exit(1)
|
||||
return status, set(failed_routes)
|
||||
@@ -1,6 +0,0 @@
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class CerberusConfig:
|
||||
cerberus_url: str
|
||||
@@ -1,13 +0,0 @@
|
||||
import requests as requests
|
||||
|
||||
from kraken.health.cerberus.config import CerberusConfig
|
||||
from kraken.health.health import HealthChecker, HealthCheckDecision
|
||||
|
||||
|
||||
class CerberusHealthChecker(HealthChecker):
|
||||
def __init__(self, config: CerberusConfig):
|
||||
self._config = config
|
||||
|
||||
def check(self) -> HealthCheckDecision:
|
||||
cerberus_status = requests.get(self._config.cerberus_url, timeout=60).content
|
||||
return HealthCheckDecision.GO if cerberus_status == b"True" else HealthCheckDecision.STOP
|
||||
@@ -1,14 +0,0 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class HealthCheckDecision(Enum):
|
||||
GO = "GO"
|
||||
PAUSE = "PAUSE"
|
||||
STOP = "STOP"
|
||||
|
||||
|
||||
class HealthChecker(ABC):
|
||||
@abstractmethod
|
||||
def check(self) -> HealthCheckDecision:
|
||||
pass
|
||||
@@ -1,12 +1,17 @@
|
||||
from kubernetes import client, config
|
||||
from kubernetes.stream import stream
|
||||
from kubernetes.client.rest import ApiException
|
||||
import logging
|
||||
import kraken.invoke.command as runcommand
|
||||
import sys
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
from kubernetes import client, config, utils, watch
|
||||
from kubernetes.client.rest import ApiException
|
||||
from kubernetes.dynamic.client import DynamicClient
|
||||
from kubernetes.stream import stream
|
||||
|
||||
from ..kubernetes.resources import (PVC, ChaosEngine, ChaosResult, Container,
|
||||
LitmusChaosObject, Pod, Volume,
|
||||
VolumeMount)
|
||||
|
||||
kraken_node_name = ""
|
||||
|
||||
|
||||
@@ -14,10 +19,19 @@ kraken_node_name = ""
|
||||
def initialize_clients(kubeconfig_path):
|
||||
global cli
|
||||
global batch_cli
|
||||
global watch_resource
|
||||
global api_client
|
||||
global dyn_client
|
||||
global custom_object_client
|
||||
try:
|
||||
config.load_kube_config(kubeconfig_path)
|
||||
cli = client.CoreV1Api()
|
||||
batch_cli = client.BatchV1Api()
|
||||
watch_resource = watch.Watch()
|
||||
api_client = client.ApiClient()
|
||||
custom_object_client = client.CustomObjectsApi()
|
||||
k8s_client = config.new_client_from_config()
|
||||
dyn_client = DynamicClient(k8s_client)
|
||||
except ApiException as e:
|
||||
logging.error("Failed to initialize kubernetes client: %s\n" % e)
|
||||
sys.exit(1)
|
||||
@@ -29,10 +43,12 @@ def get_host() -> str:
|
||||
|
||||
|
||||
def get_clusterversion_string() -> str:
|
||||
"""Returns clusterversion status text on OpenShift, empty string on other distributions"""
|
||||
"""
|
||||
Returns clusterversion status text on OpenShift, empty string
|
||||
on other distributions
|
||||
"""
|
||||
try:
|
||||
custom_objects_api = client.CustomObjectsApi()
|
||||
cvs = custom_objects_api.list_cluster_custom_object(
|
||||
cvs = custom_object_client.list_cluster_custom_object(
|
||||
"config.openshift.io",
|
||||
"v1",
|
||||
"clusterversions",
|
||||
@@ -54,11 +70,16 @@ def list_namespaces(label_selector=None):
|
||||
namespaces = []
|
||||
try:
|
||||
if label_selector:
|
||||
ret = cli.list_namespace(pretty=True, label_selector=label_selector)
|
||||
ret = cli.list_namespace(
|
||||
pretty=True,
|
||||
label_selector=label_selector
|
||||
)
|
||||
else:
|
||||
ret = cli.list_namespace(pretty=True)
|
||||
except ApiException as e:
|
||||
logging.error("Exception when calling CoreV1Api->list_namespaced_pod: %s\n" % e)
|
||||
logging.error(
|
||||
"Exception when calling CoreV1Api->list_namespaced_pod: %s\n" % e
|
||||
)
|
||||
raise e
|
||||
for namespace in ret.items:
|
||||
namespaces.append(namespace.metadata.name)
|
||||
@@ -71,7 +92,9 @@ def get_namespace_status(namespace_name):
|
||||
try:
|
||||
ret = cli.read_namespace_status(namespace_name)
|
||||
except ApiException as e:
|
||||
logging.error("Exception when calling CoreV1Api->read_namespace_status: %s\n" % e)
|
||||
logging.error(
|
||||
"Exception when calling CoreV1Api->read_namespace_status: %s\n" % e
|
||||
)
|
||||
return ret.status.phase
|
||||
|
||||
|
||||
@@ -79,7 +102,9 @@ def delete_namespace(namespace):
|
||||
"""Deletes a given namespace using kubernetes python client"""
|
||||
try:
|
||||
api_response = cli.delete_namespace(namespace)
|
||||
logging.debug("Namespace deleted. status='%s'" % str(api_response.status))
|
||||
logging.debug(
|
||||
"Namespace deleted. status='%s'" % str(api_response.status)
|
||||
)
|
||||
return api_response
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
@@ -105,7 +130,10 @@ def check_namespaces(namespaces, label_selectors=None):
|
||||
break
|
||||
invalid_namespaces = regex_namespaces - valid_regex
|
||||
if invalid_namespaces:
|
||||
raise Exception("There exists no namespaces matching: %s" % (invalid_namespaces))
|
||||
raise Exception(
|
||||
"There exists no namespaces matching: %s" %
|
||||
(invalid_namespaces)
|
||||
)
|
||||
return list(final_namespaces)
|
||||
except Exception as e:
|
||||
logging.info("%s" % (e))
|
||||
@@ -152,7 +180,11 @@ def list_pods(namespace, label_selector=None):
|
||||
pods = []
|
||||
try:
|
||||
if label_selector:
|
||||
ret = cli.list_namespaced_pod(namespace, pretty=True, label_selector=label_selector)
|
||||
ret = cli.list_namespaced_pod(
|
||||
namespace,
|
||||
pretty=True,
|
||||
label_selector=label_selector
|
||||
)
|
||||
else:
|
||||
ret = cli.list_namespaced_pod(namespace, pretty=True)
|
||||
except ApiException as e:
|
||||
@@ -170,7 +202,10 @@ def list_pods(namespace, label_selector=None):
|
||||
def get_all_pods(label_selector=None):
|
||||
pods = []
|
||||
if label_selector:
|
||||
ret = cli.list_pod_for_all_namespaces(pretty=True, label_selector=label_selector)
|
||||
ret = cli.list_pod_for_all_namespaces(
|
||||
pretty=True,
|
||||
label_selector=label_selector
|
||||
)
|
||||
else:
|
||||
ret = cli.list_pod_for_all_namespaces(pretty=True)
|
||||
for pod in ret.items:
|
||||
@@ -179,7 +214,13 @@ def get_all_pods(label_selector=None):
|
||||
|
||||
|
||||
# Execute command in pod
|
||||
def exec_cmd_in_pod(command, pod_name, namespace, container=None, base_command="bash"):
|
||||
def exec_cmd_in_pod(
|
||||
command,
|
||||
pod_name,
|
||||
namespace,
|
||||
container=None,
|
||||
base_command="bash"
|
||||
):
|
||||
|
||||
exec_command = [base_command, "-c", command]
|
||||
try:
|
||||
@@ -230,7 +271,10 @@ def create_pod(body, namespace, timeout=120):
|
||||
pod_stat = cli.create_namespaced_pod(body=body, namespace=namespace)
|
||||
end_time = time.time() + timeout
|
||||
while True:
|
||||
pod_stat = cli.read_namespaced_pod(name=body["metadata"]["name"], namespace=namespace)
|
||||
pod_stat = cli.read_namespaced_pod(
|
||||
name=body["metadata"]["name"],
|
||||
namespace=namespace
|
||||
)
|
||||
if pod_stat.status.phase == "Running":
|
||||
break
|
||||
if time.time() > end_time:
|
||||
@@ -250,7 +294,10 @@ def read_pod(name, namespace="default"):
|
||||
|
||||
def get_pod_log(name, namespace="default"):
|
||||
return cli.read_namespaced_pod_log(
|
||||
name=name, namespace=namespace, _return_http_data_only=True, _preload_content=False
|
||||
name=name,
|
||||
namespace=namespace,
|
||||
_return_http_data_only=True,
|
||||
_preload_content=False
|
||||
)
|
||||
|
||||
|
||||
@@ -268,7 +315,10 @@ def delete_job(name, namespace="default"):
|
||||
api_response = batch_cli.delete_namespaced_job(
|
||||
name=name,
|
||||
namespace=namespace,
|
||||
body=client.V1DeleteOptions(propagation_policy="Foreground", grace_period_seconds=0),
|
||||
body=client.V1DeleteOptions(
|
||||
propagation_policy="Foreground",
|
||||
grace_period_seconds=0
|
||||
),
|
||||
)
|
||||
logging.debug("Job deleted. status='%s'" % str(api_response.status))
|
||||
return api_response
|
||||
@@ -290,7 +340,10 @@ def delete_job(name, namespace="default"):
|
||||
|
||||
def create_job(body, namespace="default"):
|
||||
try:
|
||||
api_response = batch_cli.create_namespaced_job(body=body, namespace=namespace)
|
||||
api_response = batch_cli.create_namespaced_job(
|
||||
body=body,
|
||||
namespace=namespace
|
||||
)
|
||||
return api_response
|
||||
except ApiException as api:
|
||||
logging.warn(
|
||||
@@ -311,7 +364,10 @@ def create_job(body, namespace="default"):
|
||||
|
||||
def get_job_status(name, namespace="default"):
|
||||
try:
|
||||
return batch_cli.read_namespaced_job_status(name=name, namespace=namespace)
|
||||
return batch_cli.read_namespaced_job_status(
|
||||
name=name,
|
||||
namespace=namespace
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
@@ -321,22 +377,6 @@ def get_job_status(name, namespace="default"):
|
||||
raise
|
||||
|
||||
|
||||
# Obtain node status
|
||||
def get_node_status(node, timeout=60):
|
||||
try:
|
||||
node_info = cli.read_node_status(node, pretty=True, _request_timeout=timeout)
|
||||
except ApiException as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
CoreV1Api->read_node_status: %s\n"
|
||||
% e
|
||||
)
|
||||
return None
|
||||
for condition in node_info.status.conditions:
|
||||
if condition.type == "Ready":
|
||||
return condition.status
|
||||
|
||||
|
||||
# Monitor the status of the cluster nodes and set the status to true or false
|
||||
def monitor_nodes():
|
||||
nodes = list_nodes()
|
||||
@@ -375,7 +415,11 @@ def monitor_namespace(namespace):
|
||||
notready_pods = []
|
||||
for pod in pods:
|
||||
try:
|
||||
pod_info = cli.read_namespaced_pod_status(pod, namespace, pretty=True)
|
||||
pod_info = cli.read_namespaced_pod_status(
|
||||
pod,
|
||||
namespace,
|
||||
pretty=True
|
||||
)
|
||||
except ApiException as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
@@ -384,7 +428,11 @@ def monitor_namespace(namespace):
|
||||
)
|
||||
raise e
|
||||
pod_status = pod_info.status.phase
|
||||
if pod_status != "Running" and pod_status != "Completed" and pod_status != "Succeeded":
|
||||
if (
|
||||
pod_status != "Running" and
|
||||
pod_status != "Completed" and
|
||||
pod_status != "Succeeded"
|
||||
):
|
||||
notready_pods.append(pod)
|
||||
if len(notready_pods) != 0:
|
||||
status = False
|
||||
@@ -395,11 +443,328 @@ def monitor_namespace(namespace):
|
||||
|
||||
# Monitor component namespace
|
||||
def monitor_component(iteration, component_namespace):
|
||||
watch_component_status, failed_component_pods = monitor_namespace(component_namespace)
|
||||
logging.info("Iteration %s: %s: %s" % (iteration, component_namespace, watch_component_status))
|
||||
watch_component_status, failed_component_pods = \
|
||||
monitor_namespace(component_namespace)
|
||||
logging.info(
|
||||
"Iteration %s: %s: %s" % (
|
||||
iteration,
|
||||
component_namespace,
|
||||
watch_component_status
|
||||
)
|
||||
)
|
||||
return watch_component_status, failed_component_pods
|
||||
|
||||
|
||||
def apply_yaml(path, namespace='default'):
|
||||
"""
|
||||
Apply yaml config to create Kubernetes resources
|
||||
|
||||
Args:
|
||||
path (string)
|
||||
- Path to the YAML file
|
||||
namespace (string)
|
||||
- Namespace to create the resource
|
||||
|
||||
Returns:
|
||||
The object created
|
||||
"""
|
||||
|
||||
return utils.create_from_yaml(
|
||||
api_client,
|
||||
yaml_file=path,
|
||||
namespace=namespace
|
||||
)
|
||||
|
||||
|
||||
def get_pod_info(name: str, namespace: str = 'default') -> Pod:
|
||||
"""
|
||||
Function to retrieve information about a specific pod
|
||||
in a given namespace. The kubectl command is given by:
|
||||
kubectl get pods <name> -n <namespace>
|
||||
|
||||
Args:
|
||||
name (string)
|
||||
- Name of the pod
|
||||
|
||||
namespace (string)
|
||||
- Namespace to look for the pod
|
||||
|
||||
Returns:
|
||||
- Data class object of type Pod with the output of the above
|
||||
kubectl command in the given format if the pod exists
|
||||
- Returns None if the pod doesn't exist
|
||||
"""
|
||||
pod_exists = check_if_pod_exists(name=name, namespace=namespace)
|
||||
if pod_exists:
|
||||
response = cli.read_namespaced_pod(
|
||||
name=name,
|
||||
namespace=namespace,
|
||||
pretty='true'
|
||||
)
|
||||
container_list = []
|
||||
|
||||
# Create a list of containers present in the pod
|
||||
for container in response.spec.containers:
|
||||
volume_mount_list = []
|
||||
for volume_mount in container.volume_mounts:
|
||||
volume_mount_list.append(
|
||||
VolumeMount(
|
||||
name=volume_mount.name,
|
||||
mountPath=volume_mount.mount_path
|
||||
)
|
||||
)
|
||||
container_list.append(
|
||||
Container(
|
||||
name=container.name,
|
||||
image=container.image,
|
||||
volumeMounts=volume_mount_list
|
||||
)
|
||||
)
|
||||
|
||||
for i, container in enumerate(response.status.container_statuses):
|
||||
container_list[i].ready = container.ready
|
||||
|
||||
# Create a list of volumes associated with the pod
|
||||
volume_list = []
|
||||
for volume in response.spec.volumes:
|
||||
volume_name = volume.name
|
||||
pvc_name = (
|
||||
volume.persistent_volume_claim.claim_name
|
||||
if volume.persistent_volume_claim is not None
|
||||
else None
|
||||
)
|
||||
volume_list.append(Volume(name=volume_name, pvcName=pvc_name))
|
||||
|
||||
# Create the Pod data class object
|
||||
pod_info = Pod(
|
||||
name=response.metadata.name,
|
||||
podIP=response.status.pod_ip,
|
||||
namespace=response.metadata.namespace,
|
||||
containers=container_list,
|
||||
nodeName=response.spec.node_name,
|
||||
volumes=volume_list
|
||||
)
|
||||
return pod_info
|
||||
else:
|
||||
logging.error(
|
||||
"Pod '%s' doesn't exist in namespace '%s'" % (
|
||||
str(name),
|
||||
str(namespace)
|
||||
)
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
def get_litmus_chaos_object(
|
||||
kind: str,
|
||||
name: str,
|
||||
namespace: str
|
||||
) -> LitmusChaosObject:
|
||||
"""
|
||||
Function that returns an object of a custom resource type of
|
||||
the litmus project. Currently, only ChaosEngine and ChaosResult
|
||||
objects are supported.
|
||||
|
||||
Args:
|
||||
kind (string)
|
||||
- The custom resource type
|
||||
|
||||
namespace (string)
|
||||
- Namespace where the custom object is present
|
||||
|
||||
Returns:
|
||||
Data class object of a subclass of LitmusChaosObject
|
||||
"""
|
||||
|
||||
group = 'litmuschaos.io'
|
||||
version = 'v1alpha1'
|
||||
|
||||
if kind.lower() == 'chaosengine':
|
||||
plural = 'chaosengines'
|
||||
response = custom_object_client.get_namespaced_custom_object(
|
||||
group=group,
|
||||
plural=plural,
|
||||
version=version,
|
||||
namespace=namespace,
|
||||
name=name
|
||||
)
|
||||
try:
|
||||
engine_status = response['status']['engineStatus']
|
||||
exp_status = response['status']['experiments'][0]['status']
|
||||
except Exception:
|
||||
engine_status = 'Not Initialized'
|
||||
exp_status = 'Not Initialized'
|
||||
custom_object = ChaosEngine(
|
||||
kind='ChaosEngine',
|
||||
group=group,
|
||||
namespace=namespace,
|
||||
name=name,
|
||||
plural=plural,
|
||||
version=version,
|
||||
engineStatus=engine_status,
|
||||
expStatus=exp_status
|
||||
)
|
||||
elif kind.lower() == 'chaosresult':
|
||||
plural = 'chaosresults'
|
||||
response = custom_object_client.get_namespaced_custom_object(
|
||||
group=group,
|
||||
plural=plural,
|
||||
version=version,
|
||||
namespace=namespace,
|
||||
name=name
|
||||
)
|
||||
try:
|
||||
verdict = response['status']['experimentStatus']['verdict']
|
||||
fail_step = response['status']['experimentStatus']['failStep']
|
||||
except Exception:
|
||||
verdict = 'N/A'
|
||||
fail_step = 'N/A'
|
||||
custom_object = ChaosResult(
|
||||
kind='ChaosResult',
|
||||
group=group,
|
||||
namespace=namespace,
|
||||
name=name,
|
||||
plural=plural,
|
||||
version=version,
|
||||
verdict=verdict,
|
||||
failStep=fail_step
|
||||
)
|
||||
else:
|
||||
logging.error("Invalid litmus chaos custom resource name")
|
||||
custom_object = None
|
||||
return custom_object
|
||||
|
||||
|
||||
def check_if_namespace_exists(name: str) -> bool:
|
||||
"""
|
||||
Function that checks if a namespace exists by parsing through
|
||||
the list of projects.
|
||||
Args:
|
||||
name (string)
|
||||
- Namespace name
|
||||
|
||||
Returns:
|
||||
Boolean value indicating whether the namespace exists or not
|
||||
"""
|
||||
|
||||
v1_projects = dyn_client.resources.get(
|
||||
api_version='project.openshift.io/v1',
|
||||
kind='Project'
|
||||
)
|
||||
project_list = v1_projects.get()
|
||||
return True if name in str(project_list) else False
|
||||
|
||||
|
||||
def check_if_pod_exists(name: str, namespace: str) -> bool:
|
||||
"""
|
||||
Function that checks if a pod exists in the given namespace
|
||||
Args:
|
||||
name (string)
|
||||
- Pod name
|
||||
|
||||
namespace (string)
|
||||
- Namespace name
|
||||
|
||||
Returns:
|
||||
Boolean value indicating whether the pod exists or not
|
||||
"""
|
||||
|
||||
namespace_exists = check_if_namespace_exists(namespace)
|
||||
if namespace_exists:
|
||||
pod_list = list_pods(namespace=namespace)
|
||||
if name in pod_list:
|
||||
return True
|
||||
else:
|
||||
logging.error("Namespace '%s' doesn't exist" % str(namespace))
|
||||
return False
|
||||
|
||||
|
||||
def check_if_pvc_exists(name: str, namespace: str) -> bool:
|
||||
"""
|
||||
Function that checks if a namespace exists by parsing through
|
||||
the list of projects.
|
||||
Args:
|
||||
name (string)
|
||||
- PVC name
|
||||
|
||||
namespace (string)
|
||||
- Namespace name
|
||||
|
||||
Returns:
|
||||
Boolean value indicating whether the Persistent Volume Claim
|
||||
exists or not.
|
||||
"""
|
||||
namespace_exists = check_if_namespace_exists(namespace)
|
||||
if namespace_exists:
|
||||
response = cli.list_namespaced_persistent_volume_claim(
|
||||
namespace=namespace
|
||||
)
|
||||
pvc_list = [pvc.metadata.name for pvc in response.items]
|
||||
if name in pvc_list:
|
||||
return True
|
||||
else:
|
||||
logging.error("Namespace '%s' doesn't exist" % str(namespace))
|
||||
return False
|
||||
|
||||
|
||||
def get_pvc_info(name: str, namespace: str) -> PVC:
|
||||
"""
|
||||
Function to retrieve information about a Persistent Volume Claim in a
|
||||
given namespace
|
||||
|
||||
Args:
|
||||
name (string)
|
||||
- Name of the persistent volume claim
|
||||
|
||||
namespace (string)
|
||||
- Namespace where the persistent volume claim is present
|
||||
|
||||
Returns:
|
||||
- A PVC data class containing the name, capacity, volume name,
|
||||
namespace and associated pod names of the PVC if the PVC exists
|
||||
- Returns None if the PVC doesn't exist
|
||||
"""
|
||||
|
||||
pvc_exists = check_if_pvc_exists(name=name, namespace=namespace)
|
||||
if pvc_exists:
|
||||
pvc_info_response = cli.read_namespaced_persistent_volume_claim(
|
||||
name=name,
|
||||
namespace=namespace,
|
||||
pretty=True
|
||||
)
|
||||
pod_list_response = cli.list_namespaced_pod(namespace=namespace)
|
||||
|
||||
capacity = pvc_info_response.status.capacity['storage']
|
||||
volume_name = pvc_info_response.spec.volume_name
|
||||
|
||||
# Loop through all pods in the namespace to find associated PVCs
|
||||
pvc_pod_list = []
|
||||
for pod in pod_list_response.items:
|
||||
for volume in pod.spec.volumes:
|
||||
if (
|
||||
volume.persistent_volume_claim is not None
|
||||
and volume.persistent_volume_claim.claim_name == name
|
||||
):
|
||||
pvc_pod_list.append(pod.metadata.name)
|
||||
|
||||
pvc_info = PVC(
|
||||
name=name,
|
||||
capacity=capacity,
|
||||
volumeName=volume_name,
|
||||
podNames=pvc_pod_list,
|
||||
namespace=namespace
|
||||
)
|
||||
return pvc_info
|
||||
else:
|
||||
logging.error(
|
||||
"PVC '%s' doesn't exist in namespace '%s'" % (
|
||||
str(name),
|
||||
str(namespace)
|
||||
)
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
# Find the node kraken is deployed on
|
||||
# Set global kraken node to not delete
|
||||
def find_kraken_node():
|
||||
@@ -415,16 +780,40 @@ def find_kraken_node():
|
||||
if kraken_pod_name:
|
||||
# get kraken-deployment pod, find node name
|
||||
try:
|
||||
node_name = runcommand.invoke(
|
||||
"kubectl get pods/"
|
||||
+ str(kraken_pod_name)
|
||||
+ ' -o jsonpath="{.spec.nodeName}"'
|
||||
+ " -n"
|
||||
+ str(kraken_project)
|
||||
)
|
||||
|
||||
node_name = get_pod_info(kraken_pod_name, kraken_project).nodeName
|
||||
global kraken_node_name
|
||||
kraken_node_name = node_name
|
||||
except Exception as e:
|
||||
logging.info("%s" % (e))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
# Watch for a specific node status
|
||||
def watch_node_status(node, status, timeout, resource_version):
|
||||
count = timeout
|
||||
for event in watch_resource.stream(
|
||||
cli.list_node,
|
||||
field_selector=f"metadata.name={node}",
|
||||
timeout_seconds=timeout,
|
||||
resource_version=f"{resource_version}"
|
||||
):
|
||||
conditions = [
|
||||
status
|
||||
for status in event["object"].status.conditions
|
||||
if status.type == "Ready"
|
||||
]
|
||||
if conditions[0].status == status:
|
||||
watch_resource.stop()
|
||||
break
|
||||
else:
|
||||
count -= 1
|
||||
logging.info(
|
||||
"Status of node " + node + ": " + str(conditions[0].status)
|
||||
)
|
||||
if not count:
|
||||
watch_resource.stop()
|
||||
|
||||
|
||||
# Get the resource version for the specified node
|
||||
def get_node_resource_version(node):
|
||||
return cli.read_node(name=node).metadata.resource_version
|
||||
|
||||
@@ -1,125 +0,0 @@
|
||||
import unittest
|
||||
from dataclasses import dataclass
|
||||
from typing import Dict, List
|
||||
from kubernetes import config, client
|
||||
from kubernetes.client.models import V1Pod, V1PodSpec, V1ObjectMeta, V1Container
|
||||
from kubernetes.client.exceptions import ApiException
|
||||
|
||||
|
||||
@dataclass
|
||||
class Pod:
|
||||
"""
|
||||
A pod is a simplified representation of a Kubernetes pod. We only extract the data we need in krkn.
|
||||
"""
|
||||
name: str
|
||||
namespace: str
|
||||
labels: Dict[str, str]
|
||||
|
||||
|
||||
class Client:
|
||||
"""
|
||||
This is the implementation of all Kubernetes API calls used in Krkn.
|
||||
"""
|
||||
|
||||
def __init__(self, kubeconfig_path: str = None):
|
||||
# Note: this function replicates much of the functionality already represented in the Kubernetes Python client,
|
||||
# but in an object-oriented manner. This allows for creating multiple clients and accessing multiple clusters
|
||||
# with minimal effort if needed, which the procedural implementation doesn't allow.
|
||||
if kubeconfig_path is None:
|
||||
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
|
||||
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
|
||||
|
||||
if kubeconfig.config is None:
|
||||
raise config.ConfigException(
|
||||
'Invalid kube-config file: %s. '
|
||||
'No configuration found.' % kubeconfig_path)
|
||||
loader = config.kube_config.KubeConfigLoader(
|
||||
config_dict=kubeconfig.config,
|
||||
)
|
||||
client_config = client.Configuration()
|
||||
loader.load_and_set(client_config)
|
||||
self.client = client.ApiClient(configuration=client_config)
|
||||
self.core_v1 = client.CoreV1Api(self.client)
|
||||
|
||||
@staticmethod
|
||||
def _convert_pod(pod: V1Pod) -> Pod:
|
||||
return Pod(
|
||||
name=pod.metadata.name,
|
||||
namespace=pod.metadata.namespace,
|
||||
labels=pod.metadata.labels
|
||||
)
|
||||
|
||||
def create_test_pod(self) -> Pod:
|
||||
"""
|
||||
create_test_pod creates a test pod in the default namespace that can be safely killed.
|
||||
"""
|
||||
return self._convert_pod(self.core_v1.create_namespaced_pod(
|
||||
"default",
|
||||
V1Pod(
|
||||
metadata=V1ObjectMeta(
|
||||
generate_name="test-",
|
||||
),
|
||||
spec=V1PodSpec(
|
||||
containers=[
|
||||
V1Container(
|
||||
name="test",
|
||||
image="alpine",
|
||||
tty=True,
|
||||
)
|
||||
]
|
||||
),
|
||||
)
|
||||
))
|
||||
|
||||
def list_all_pods(self, label_selector: str = None) -> List[Pod]:
|
||||
"""
|
||||
list_all_pods lists all pods in all namespaces, possibly with a label selector applied.
|
||||
"""
|
||||
try:
|
||||
pod_response = self.core_v1.list_pod_for_all_namespaces(watch=False, label_selector=label_selector)
|
||||
pod_list: List[client.models.V1Pod] = pod_response.items
|
||||
result: List[Pod] = []
|
||||
for pod in pod_list:
|
||||
result.append(self._convert_pod(pod))
|
||||
return result
|
||||
except ApiException as e:
|
||||
if e.status == 404:
|
||||
raise NotFoundException(e)
|
||||
raise
|
||||
|
||||
def get_pod(self, name: str, namespace: str = "default") -> Pod:
|
||||
"""
|
||||
get_pod returns a pod based on the name and a namespace.
|
||||
"""
|
||||
try:
|
||||
return self._convert_pod(self.core_v1.read_namespaced_pod(name, namespace))
|
||||
except ApiException as e:
|
||||
if e.status == 404:
|
||||
raise NotFoundException(e)
|
||||
raise
|
||||
|
||||
def remove_pod(self, name: str, namespace: str = "default"):
|
||||
"""
|
||||
remove_pod removes a pod based on the name and namespace. A NotFoundException is raised if the pod doesn't
|
||||
exist.
|
||||
"""
|
||||
try:
|
||||
self.core_v1.delete_namespaced_pod(name, namespace)
|
||||
except ApiException as e:
|
||||
if e.status == 404:
|
||||
raise NotFoundException(e)
|
||||
raise
|
||||
|
||||
|
||||
class NotFoundException(Exception):
|
||||
"""
|
||||
NotFoundException is an exception specific to the scenario Kubernetes abstraction and is thrown when a specific
|
||||
resource (e.g. a pod) cannot be found.
|
||||
"""
|
||||
|
||||
def __init__(self, cause: Exception):
|
||||
self.__cause__ = cause
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
74
kraken/kubernetes/resources.py
Normal file
74
kraken/kubernetes/resources.py
Normal file
@@ -0,0 +1,74 @@
|
||||
from dataclasses import dataclass
|
||||
from typing import List
|
||||
|
||||
|
||||
@dataclass(frozen=True, order=False)
|
||||
class Volume:
|
||||
"""Data class to hold information regarding volumes in a pod"""
|
||||
name: str
|
||||
pvcName: str
|
||||
|
||||
|
||||
@dataclass(order=False)
|
||||
class VolumeMount:
|
||||
"""Data class to hold information regarding volume mounts"""
|
||||
name: str
|
||||
mountPath: str
|
||||
|
||||
|
||||
@dataclass(frozen=True, order=False)
|
||||
class PVC:
|
||||
"""Data class to hold information regarding persistent volume claims"""
|
||||
name: str
|
||||
capacity: str
|
||||
volumeName: str
|
||||
podNames: List[str]
|
||||
namespace: str
|
||||
|
||||
|
||||
@dataclass(order=False)
|
||||
class Container:
|
||||
"""Data class to hold information regarding containers in a pod"""
|
||||
image: str
|
||||
name: str
|
||||
volumeMounts: List[VolumeMount]
|
||||
ready: bool = False
|
||||
|
||||
|
||||
@dataclass(frozen=True, order=False)
|
||||
class Pod:
|
||||
"""Data class to hold information regarding a pod"""
|
||||
name: str
|
||||
podIP: str
|
||||
namespace: str
|
||||
containers: List[Container]
|
||||
nodeName: str
|
||||
volumes: List[Volume]
|
||||
|
||||
|
||||
@dataclass(frozen=True, order=False)
|
||||
class LitmusChaosObject:
|
||||
"""Data class to hold information regarding a custom object of litmus project"""
|
||||
kind: str
|
||||
group: str
|
||||
namespace: str
|
||||
name: str
|
||||
plural: str
|
||||
version: str
|
||||
|
||||
|
||||
@dataclass(frozen=True, order=False)
|
||||
class ChaosEngine(LitmusChaosObject):
|
||||
"""Data class to hold information regarding a ChaosEngine object"""
|
||||
engineStatus: str
|
||||
expStatus: str
|
||||
|
||||
|
||||
@dataclass(frozen=True, order=False)
|
||||
class ChaosResult(LitmusChaosObject):
|
||||
"""Data class to hold information regarding a ChaosResult object"""
|
||||
verdict: str
|
||||
failStep: str
|
||||
|
||||
|
||||
|
||||
@@ -1,42 +0,0 @@
|
||||
import unittest
|
||||
|
||||
from kraken.scenarios import kube
|
||||
|
||||
|
||||
class TestClient(unittest.TestCase):
|
||||
def test_list_all_pods(self):
|
||||
c = kube.Client()
|
||||
pod = c.create_test_pod()
|
||||
self.addCleanup(lambda: self._remove_pod(c, pod.name, pod.namespace))
|
||||
pods = c.list_all_pods()
|
||||
for pod in pods:
|
||||
if pod.name == pod.name and pod.namespace == pod.namespace:
|
||||
return
|
||||
self.fail("The created pod %s was not in the pod list." % pod.name)
|
||||
|
||||
def test_get_pod(self):
|
||||
c = kube.Client()
|
||||
pod = c.create_test_pod()
|
||||
self.addCleanup(lambda: c.remove_pod(pod.name, pod.namespace))
|
||||
pod2 = c.get_pod(pod.name, pod.namespace)
|
||||
assert pod2.name == pod.name
|
||||
assert pod2.namespace == pod.namespace
|
||||
|
||||
def test_get_pod_notfound(self):
|
||||
c = kube.Client()
|
||||
try:
|
||||
c.get_pod("non-existent-pod")
|
||||
self.fail("Fetching a non-existent pod did not result in a NotFoundException.")
|
||||
except kube.NotFoundException:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
def _remove_pod(c: kube.Client, pod_name: str, pod_namespace: str):
|
||||
try:
|
||||
c.remove_pod(pod_name, pod_namespace)
|
||||
except kube.NotFoundException:
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
@@ -1,4 +1,5 @@
|
||||
import kraken.invoke.command as runcommand
|
||||
import kraken.kubernetes.client as kubecli
|
||||
import logging
|
||||
import time
|
||||
import sys
|
||||
@@ -86,18 +87,17 @@ def deploy_all_experiments(version_string, namespace):
|
||||
|
||||
|
||||
def wait_for_initialized(engine_name, experiment_name, namespace):
|
||||
chaos_engine = runcommand.invoke(
|
||||
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.engineStatus}'" % (engine_name, namespace)
|
||||
)
|
||||
|
||||
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
|
||||
namespace=namespace).engineStatus
|
||||
engine_status = chaos_engine.strip()
|
||||
max_tries = 30
|
||||
engine_counter = 0
|
||||
while engine_status.lower() != "initialized":
|
||||
time.sleep(10)
|
||||
logging.info("Waiting for " + experiment_name + " to be initialized")
|
||||
chaos_engine = runcommand.invoke(
|
||||
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.engineStatus}'" % (engine_name, namespace)
|
||||
)
|
||||
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
|
||||
namespace=namespace).engineStatus
|
||||
engine_status = chaos_engine.strip()
|
||||
if engine_counter >= max_tries:
|
||||
logging.error("Chaos engine " + experiment_name + " took longer than 5 minutes to be initialized")
|
||||
@@ -117,18 +117,16 @@ def wait_for_status(engine_name, expected_status, experiment_name, namespace):
|
||||
if not response:
|
||||
logging.info("Chaos engine never initialized, exiting")
|
||||
return False
|
||||
chaos_engine = runcommand.invoke(
|
||||
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.experiments[0].status}'" % (engine_name, namespace)
|
||||
)
|
||||
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
|
||||
namespace=namespace).expStatus
|
||||
engine_status = chaos_engine.strip()
|
||||
max_tries = 30
|
||||
engine_counter = 0
|
||||
while engine_status.lower() != expected_status:
|
||||
time.sleep(10)
|
||||
logging.info("Waiting for " + experiment_name + " to be " + expected_status)
|
||||
chaos_engine = runcommand.invoke(
|
||||
"kubectl get chaosengines/%s -n %s -o jsonpath='{.status.experiments[0].status}'" % (engine_name, namespace)
|
||||
)
|
||||
chaos_engine = kubecli.get_litmus_chaos_object(kind='chaosengine', name=engine_name,
|
||||
namespace=namespace).expStatus
|
||||
engine_status = chaos_engine.strip()
|
||||
if engine_counter >= max_tries:
|
||||
logging.error("Chaos engine " + experiment_name + " took longer than 5 minutes to be " + expected_status)
|
||||
@@ -151,20 +149,14 @@ def check_experiment(engine_name, experiment_name, namespace):
|
||||
else:
|
||||
sys.exit(1)
|
||||
|
||||
chaos_result = runcommand.invoke(
|
||||
"kubectl get chaosresult %s"
|
||||
"-%s -n %s -o "
|
||||
"jsonpath='{.status.experimentStatus.verdict}'" % (engine_name, experiment_name, namespace)
|
||||
)
|
||||
chaos_result = kubecli.get_litmus_chaos_object(kind='chaosresult', name=engine_name+'-'+experiment_name,
|
||||
namespace=namespace).verdict
|
||||
if chaos_result == "Pass":
|
||||
logging.info("Engine " + str(engine_name) + " finished with status " + str(chaos_result))
|
||||
return True
|
||||
else:
|
||||
chaos_result = runcommand.invoke(
|
||||
"kubectl get chaosresult %s"
|
||||
"-%s -n %s -o jsonpath="
|
||||
"'{.status.experimentStatus.failStep}'" % (engine_name, experiment_name, namespace)
|
||||
)
|
||||
chaos_result = kubecli.get_litmus_chaos_object(kind='chaosresult', name=engine_name+'-'+experiment_name,
|
||||
namespace=namespace).failStep
|
||||
logging.info("Chaos scenario:" + engine_name + " failed with error: " + str(chaos_result))
|
||||
logging.info(
|
||||
"See 'kubectl get chaosresult %s"
|
||||
@@ -176,8 +168,7 @@ def check_experiment(engine_name, experiment_name, namespace):
|
||||
# Delete all chaos engines in a given namespace
|
||||
def delete_chaos_experiments(namespace):
|
||||
|
||||
namespace_exists = runcommand.invoke("oc get project -o name | grep -c " + namespace + " | xargs")
|
||||
if namespace_exists.strip() != "0":
|
||||
if kubecli.check_if_namespace_exists(namespace):
|
||||
chaos_exp_exists = runcommand.invoke_no_exit("kubectl get chaosexperiment")
|
||||
if "returned non-zero exit status 1" not in chaos_exp_exists:
|
||||
logging.info("Deleting all litmus experiments")
|
||||
@@ -187,8 +178,7 @@ def delete_chaos_experiments(namespace):
|
||||
# Delete all chaos engines in a given namespace
|
||||
def delete_chaos(namespace):
|
||||
|
||||
namespace_exists = runcommand.invoke("oc get project -o name | grep -c " + namespace + " | xargs")
|
||||
if namespace_exists.strip() != "0":
|
||||
if kubecli.check_if_namespace_exists(namespace):
|
||||
logging.info("Deleting all litmus run objects")
|
||||
chaos_engine_exists = runcommand.invoke_no_exit("kubectl get chaosengine")
|
||||
if "returned non-zero exit status 1" not in chaos_engine_exists:
|
||||
@@ -201,8 +191,8 @@ def delete_chaos(namespace):
|
||||
|
||||
|
||||
def uninstall_litmus(version, litmus_namespace):
|
||||
namespace_exists = runcommand.invoke("oc get project -o name | grep -c " + litmus_namespace + " | xargs")
|
||||
if namespace_exists.strip() != "0":
|
||||
|
||||
if kubecli.check_if_namespace_exists(litmus_namespace):
|
||||
logging.info("Uninstalling Litmus operator")
|
||||
runcommand.invoke_no_exit(
|
||||
"kubectl delete -n %s -f "
|
||||
|
||||
@@ -107,10 +107,7 @@ def verify_interface(test_interface, nodelst, template):
|
||||
interface_lst = output[:-1].split(",")
|
||||
for interface in test_interface:
|
||||
if interface not in interface_lst:
|
||||
logging.error(
|
||||
"Interface %s not found in node %s interface list %s" % (interface, nodelst[pod_index]),
|
||||
interface_lst,
|
||||
)
|
||||
logging.error("Interface %s not found in node %s interface list %s" % (interface, nodelst[pod_index], interface_lst))
|
||||
sys.exit(1)
|
||||
return test_interface
|
||||
finally:
|
||||
|
||||
@@ -5,7 +5,6 @@ import paramiko
|
||||
import kraken.kubernetes.client as kubecli
|
||||
import kraken.invoke.command as runcommand
|
||||
|
||||
|
||||
node_general = False
|
||||
|
||||
|
||||
@@ -30,30 +29,22 @@ def get_node(node_name, label_selector, instance_kill_count):
|
||||
return nodes_to_return
|
||||
|
||||
|
||||
# Wait till node status becomes Ready
|
||||
# Wait until the node status becomes Ready
|
||||
def wait_for_ready_status(node, timeout):
|
||||
for _ in range(timeout):
|
||||
if kubecli.get_node_status(node) == "Ready":
|
||||
break
|
||||
time.sleep(3)
|
||||
if kubecli.get_node_status(node) != "Ready":
|
||||
raise Exception("Node condition status isn't Ready")
|
||||
resource_version = kubecli.get_node_resource_version(node)
|
||||
kubecli.watch_node_status(node, "True", timeout, resource_version)
|
||||
|
||||
|
||||
# Wait till node status becomes NotReady
|
||||
# Wait until the node status becomes Not Ready
|
||||
def wait_for_not_ready_status(node, timeout):
|
||||
resource_version = kubecli.get_node_resource_version(node)
|
||||
kubecli.watch_node_status(node, "False", timeout, resource_version)
|
||||
|
||||
|
||||
# Wait until the node status becomes Unknown
|
||||
def wait_for_unknown_status(node, timeout):
|
||||
for _ in range(timeout):
|
||||
try:
|
||||
node_status = kubecli.get_node_status(node, timeout)
|
||||
if node_status is None or node_status == "Unknown":
|
||||
break
|
||||
except Exception:
|
||||
logging.error("Encountered error while getting node status, waiting 3 seconds and retrying")
|
||||
time.sleep(3)
|
||||
node_status = kubecli.get_node_status(node, timeout)
|
||||
logging.info("node status " + str(node_status))
|
||||
if node_status is not None and node_status != "Unknown":
|
||||
raise Exception("Node condition status isn't Unknown after %s seconds" % str(timeout))
|
||||
resource_version = kubecli.get_node_resource_version(node)
|
||||
kubecli.watch_node_status(node, "Unknown", timeout, resource_version)
|
||||
|
||||
|
||||
# Get the ip of the cluster node
|
||||
@@ -74,7 +65,11 @@ def check_service_status(node, service, ssh_private_key, timeout):
|
||||
i += sleeper
|
||||
logging.info("Trying to ssh to instance: %s" % (node))
|
||||
connection = ssh.connect(
|
||||
node, username="root", key_filename=ssh_private_key, timeout=800, banner_timeout=400
|
||||
node,
|
||||
username="root",
|
||||
key_filename=ssh_private_key,
|
||||
timeout=800,
|
||||
banner_timeout=400,
|
||||
)
|
||||
if connection is None:
|
||||
break
|
||||
|
||||
200
kraken/plugins/__init__.py
Normal file
200
kraken/plugins/__init__.py
Normal file
@@ -0,0 +1,200 @@
|
||||
import dataclasses
|
||||
import json
|
||||
import logging
|
||||
from os.path import abspath
|
||||
from typing import List, Dict
|
||||
|
||||
from arcaflow_plugin_sdk import schema, serialization, jsonschema
|
||||
import kraken.plugins.vmware.vmware_plugin as vmware_plugin
|
||||
from kraken.plugins.pod_plugin import kill_pods, wait_for_pods
|
||||
from kraken.plugins.run_python_plugin import run_python_file
|
||||
from kraken.plugins.network.ingress_shaping import network_chaos
|
||||
|
||||
|
||||
@dataclasses.dataclass
|
||||
class PluginStep:
|
||||
schema: schema.StepSchema
|
||||
error_output_ids: List[str]
|
||||
|
||||
def render_output(self, output_id: str, output_data) -> str:
|
||||
return json.dumps({
|
||||
"output_id": output_id,
|
||||
"output_data": self.schema.outputs[output_id].serialize(output_data),
|
||||
}, indent='\t')
|
||||
|
||||
|
||||
class Plugins:
|
||||
"""
|
||||
Plugins is a class that can run plugins sequentially. The output is rendered to the standard output and the process
|
||||
is aborted if a step fails.
|
||||
"""
|
||||
steps_by_id: Dict[str, PluginStep]
|
||||
|
||||
def __init__(self, steps: List[PluginStep]):
|
||||
self.steps_by_id = dict()
|
||||
for step in steps:
|
||||
if step.schema.id in self.steps_by_id:
|
||||
raise Exception(
|
||||
"Duplicate step ID: {}".format(step.schema.id)
|
||||
)
|
||||
self.steps_by_id[step.schema.id] = step
|
||||
|
||||
def run(self, file: str, kubeconfig_path: str):
|
||||
"""
|
||||
Run executes a series of steps
|
||||
"""
|
||||
data = serialization.load_from_file(abspath(file))
|
||||
if not isinstance(data, list):
|
||||
raise Exception(
|
||||
"Invalid scenario configuration file: {} expected list, found {}".format(file, type(data).__name__)
|
||||
)
|
||||
i = 0
|
||||
for entry in data:
|
||||
if not isinstance(entry, dict):
|
||||
raise Exception(
|
||||
"Invalid scenario configuration file: {} expected a list of dict's, found {} on step {}".format(
|
||||
file,
|
||||
type(entry).__name__,
|
||||
i
|
||||
)
|
||||
)
|
||||
if "id" not in entry:
|
||||
raise Exception(
|
||||
"Invalid scenario configuration file: {} missing 'id' field on step {}".format(
|
||||
file,
|
||||
i,
|
||||
)
|
||||
)
|
||||
if "config" not in entry:
|
||||
raise Exception(
|
||||
"Invalid scenario configuration file: {} missing 'config' field on step {}".format(
|
||||
file,
|
||||
i,
|
||||
)
|
||||
)
|
||||
|
||||
if entry["id"] not in self.steps_by_id:
|
||||
raise Exception(
|
||||
"Invalid step {} in {} ID: {} expected one of: {}".format(
|
||||
i,
|
||||
file,
|
||||
entry["id"],
|
||||
', '.join(self.steps_by_id.keys())
|
||||
)
|
||||
)
|
||||
step = self.steps_by_id[entry["id"]]
|
||||
unserialized_input = step.schema.input.unserialize(entry["config"])
|
||||
if "kubeconfig_path" in step.schema.input.properties:
|
||||
unserialized_input.kubeconfig_path = kubeconfig_path
|
||||
output_id, output_data = step.schema(unserialized_input)
|
||||
logging.info(step.render_output(output_id, output_data) + "\n")
|
||||
if output_id in step.error_output_ids:
|
||||
raise Exception(
|
||||
"Step {} in {} ({}) failed".format(i, file, step.schema.id)
|
||||
)
|
||||
i = i + 1
|
||||
|
||||
def json_schema(self):
|
||||
"""
|
||||
This function generates a JSON schema document and renders it from the steps passed.
|
||||
"""
|
||||
result = {
|
||||
"$id": "https://github.com/redhat-chaos/krkn/",
|
||||
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
||||
"title": "Kraken Arcaflow scenarios",
|
||||
"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
|
||||
"type": "array",
|
||||
"minContains": 1,
|
||||
"items": {
|
||||
"oneOf": [
|
||||
|
||||
]
|
||||
}
|
||||
}
|
||||
for step_id in self.steps_by_id.keys():
|
||||
step = self.steps_by_id[step_id]
|
||||
step_input = jsonschema.step_input(step.schema)
|
||||
del step_input["$id"]
|
||||
del step_input["$schema"]
|
||||
del step_input["title"]
|
||||
del step_input["description"]
|
||||
result["items"]["oneOf"].append({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string",
|
||||
"const": step_id,
|
||||
},
|
||||
"config": step_input,
|
||||
},
|
||||
"required": [
|
||||
"id",
|
||||
"config",
|
||||
]
|
||||
})
|
||||
return json.dumps(result, indent="\t")
|
||||
|
||||
|
||||
PLUGINS = Plugins(
|
||||
[
|
||||
PluginStep(
|
||||
kill_pods,
|
||||
[
|
||||
"error",
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
wait_for_pods,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
run_python_file,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
vmware_plugin.node_start,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
vmware_plugin.node_stop,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
vmware_plugin.node_reboot,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
vmware_plugin.node_terminate,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
),
|
||||
PluginStep(
|
||||
network_chaos,
|
||||
[
|
||||
"error"
|
||||
]
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def run(scenarios: List[str], kubeconfig_path: str, failed_post_scenarios: List[str]) -> List[str]:
|
||||
for scenario in scenarios:
|
||||
try:
|
||||
PLUGINS.run(scenario, kubeconfig_path)
|
||||
except Exception as e:
|
||||
failed_post_scenarios.append(scenario)
|
||||
logging.error("Error while running {}: {}".format(scenario, e))
|
||||
return failed_post_scenarios
|
||||
return failed_post_scenarios
|
||||
4
kraken/plugins/__main__.py
Normal file
4
kraken/plugins/__main__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
from kraken.plugins import PLUGINS
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(PLUGINS.json_schema())
|
||||
@@ -4,8 +4,24 @@ import sys
|
||||
import json
|
||||
|
||||
|
||||
# Get cerberus status
|
||||
def get_status(config, start_time, end_time):
|
||||
"""
|
||||
Function to get Cerberus status
|
||||
|
||||
Args:
|
||||
config
|
||||
- Kraken config dictionary
|
||||
|
||||
start_time
|
||||
- The time when chaos is injected
|
||||
|
||||
end_time
|
||||
- The time when chaos is removed
|
||||
|
||||
Returns:
|
||||
Cerberus status
|
||||
"""
|
||||
|
||||
cerberus_status = True
|
||||
check_application_routes = False
|
||||
application_routes_status = True
|
||||
@@ -43,8 +59,24 @@ def get_status(config, start_time, end_time):
|
||||
return cerberus_status
|
||||
|
||||
|
||||
# Function to publish kraken status to cerberus
|
||||
def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
|
||||
"""
|
||||
Function to publish Kraken status to Cerberus
|
||||
|
||||
Args:
|
||||
config
|
||||
- Kraken config dictionary
|
||||
|
||||
failed_post_scenarios
|
||||
- String containing the failed post scenarios
|
||||
|
||||
start_time
|
||||
- The time when chaos is injected
|
||||
|
||||
end_time
|
||||
- The time when chaos is removed
|
||||
"""
|
||||
|
||||
cerberus_status = get_status(config, start_time, end_time)
|
||||
if not cerberus_status:
|
||||
if failed_post_scenarios:
|
||||
@@ -66,8 +98,24 @@ def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
|
||||
logging.info("Cerberus status is healthy but post action scenarios " "are still failing")
|
||||
|
||||
|
||||
# Check application availability
|
||||
def application_status(cerberus_url, start_time, end_time):
|
||||
"""
|
||||
Function to check application availability
|
||||
|
||||
Args:
|
||||
cerberus_url
|
||||
- url where Cerberus publishes True/False signal
|
||||
|
||||
start_time
|
||||
- The time when chaos is injected
|
||||
|
||||
end_time
|
||||
- The time when chaos is removed
|
||||
|
||||
Returns:
|
||||
Application status and failed routes
|
||||
"""
|
||||
|
||||
if not cerberus_url:
|
||||
logging.error("url where Cerberus publishes True/False signal is not provided.")
|
||||
sys.exit(1)
|
||||
937
kraken/plugins/network/ingress_shaping.py
Normal file
937
kraken/plugins/network/ingress_shaping.py
Normal file
@@ -0,0 +1,937 @@
|
||||
from dataclasses import dataclass, field
|
||||
import yaml
|
||||
import logging
|
||||
import time
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
from traceback import format_exc
|
||||
from jinja2 import Environment, FileSystemLoader
|
||||
from . import kubernetes_functions as kube_helper
|
||||
from . import cerberus
|
||||
import typing
|
||||
from arcaflow_plugin_sdk import validation, plugin
|
||||
from kubernetes.client.api.core_v1_api import CoreV1Api as CoreV1Api
|
||||
from kubernetes.client.api.batch_v1_api import BatchV1Api as BatchV1Api
|
||||
|
||||
|
||||
@dataclass
|
||||
class NetworkScenarioConfig:
|
||||
|
||||
node_interface_name: typing.Dict[
|
||||
str, typing.List[str]
|
||||
] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Node Interface Name",
|
||||
"description":
|
||||
"Dictionary with node names as key and values as a list of "
|
||||
"their test interfaces. "
|
||||
"Required if label_selector is not set.",
|
||||
}
|
||||
)
|
||||
|
||||
label_selector: typing.Annotated[
|
||||
typing.Optional[str], validation.required_if_not("node_interface_name")
|
||||
] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Label selector",
|
||||
"description":
|
||||
"Kubernetes label selector for the target nodes. "
|
||||
"Required if node_interface_name is not set.\n"
|
||||
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
|
||||
"for details.",
|
||||
}
|
||||
)
|
||||
|
||||
test_duration: typing.Annotated[
|
||||
typing.Optional[int],
|
||||
validation.min(1)
|
||||
] = field(
|
||||
default=120,
|
||||
metadata={
|
||||
"name": "Test duration",
|
||||
"description":
|
||||
"Duration for which each step of the ingress chaos testing "
|
||||
"is to be performed.",
|
||||
},
|
||||
)
|
||||
|
||||
wait_duration: typing.Annotated[
|
||||
typing.Optional[int],
|
||||
validation.min(1)
|
||||
] = field(
|
||||
default=300,
|
||||
metadata={
|
||||
"name": "Wait Duration",
|
||||
"description":
|
||||
"Wait duration for finishing a test and its cleanup."
|
||||
"Ensure that it is significantly greater than wait_duration"
|
||||
}
|
||||
)
|
||||
|
||||
instance_count: typing.Annotated[
|
||||
typing.Optional[int],
|
||||
validation.min(1)
|
||||
] = field(
|
||||
default=1,
|
||||
metadata={
|
||||
"name": "Instance Count",
|
||||
"description":
|
||||
"Number of nodes to perform action/select that match "
|
||||
"the label selector.",
|
||||
}
|
||||
)
|
||||
|
||||
kubeconfig_path: typing.Optional[str] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Kubeconfig path",
|
||||
"description":
|
||||
"Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
|
||||
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
|
||||
"for details.",
|
||||
}
|
||||
)
|
||||
|
||||
execution_type: typing.Optional[str] = field(
|
||||
default='parallel',
|
||||
metadata={
|
||||
"name": "Execution Type",
|
||||
"description":
|
||||
"The order in which the ingress filters are applied. "
|
||||
"Execution type can be 'serial' or 'parallel'"
|
||||
}
|
||||
)
|
||||
|
||||
network_params: typing.Dict[str, str] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Network Parameters",
|
||||
"description":
|
||||
"The network filters that are applied on the interface. "
|
||||
"The currently supported filters are latency, "
|
||||
"loss and bandwidth"
|
||||
}
|
||||
)
|
||||
|
||||
kraken_config: typing.Optional[str] = field(
|
||||
default='',
|
||||
metadata={
|
||||
"name": "Kraken Config",
|
||||
"description":
|
||||
"Path to the config file of Kraken. "
|
||||
"Set this field if you wish to publish status onto Cerberus"
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class NetworkScenarioSuccessOutput:
|
||||
filter_direction: str = field(
|
||||
metadata={
|
||||
"name": "Filter Direction",
|
||||
"description":
|
||||
"Direction in which the traffic control filters are applied "
|
||||
"on the test interfaces"
|
||||
}
|
||||
)
|
||||
|
||||
test_interfaces: typing.Dict[str, typing.List[str]] = field(
|
||||
metadata={
|
||||
"name": "Test Interfaces",
|
||||
"description":
|
||||
"Dictionary of nodes and their interfaces on which "
|
||||
"the chaos experiment was performed"
|
||||
}
|
||||
)
|
||||
|
||||
network_parameters: typing.Dict[str, str] = field(
|
||||
metadata={
|
||||
"name": "Network Parameters",
|
||||
"description":
|
||||
"The network filters that are applied on the interfaces"
|
||||
}
|
||||
)
|
||||
|
||||
execution_type: str = field(
|
||||
metadata={
|
||||
"name": "Execution Type",
|
||||
"description": "The order in which the filters are applied"
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class NetworkScenarioErrorOutput:
|
||||
error: str = field(
|
||||
metadata={
|
||||
"name": "Error",
|
||||
"description":
|
||||
"Error message when there is a run-time error during "
|
||||
"the execution of the scenario"
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def get_default_interface(
|
||||
node: str,
|
||||
pod_template,
|
||||
cli: CoreV1Api
|
||||
) -> str:
|
||||
"""
|
||||
Function that returns a random interface from a node
|
||||
|
||||
Args:
|
||||
node (string)
|
||||
- Node from which the interface is to be returned
|
||||
|
||||
pod_template (jinja2.environment.Template)
|
||||
- The YAML template used to instantiate a pod to query
|
||||
the node's interface
|
||||
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
Returns:
|
||||
Default interface (string) belonging to the node
|
||||
"""
|
||||
|
||||
pod_body = yaml.safe_load(pod_template.render(nodename=node))
|
||||
logging.info("Creating pod to query interface on node %s" % node)
|
||||
kube_helper.create_pod(cli, pod_body, "default", 300)
|
||||
|
||||
try:
|
||||
cmd = ["ip", "r"]
|
||||
output = kube_helper.exec_cmd_in_pod(cli, cmd, "fedtools", "default")
|
||||
|
||||
if not output:
|
||||
logging.error("Exception occurred while executing command in pod")
|
||||
sys.exit(1)
|
||||
|
||||
routes = output.split('\n')
|
||||
for route in routes:
|
||||
if 'default' in route:
|
||||
default_route = route
|
||||
break
|
||||
|
||||
interfaces = [default_route.split()[4]]
|
||||
|
||||
finally:
|
||||
logging.info("Deleting pod to query interface on node")
|
||||
kube_helper.delete_pod(cli, "fedtools", "default")
|
||||
|
||||
return interfaces
|
||||
|
||||
|
||||
def verify_interface(
|
||||
input_interface_list: typing.List[str],
|
||||
node: str,
|
||||
pod_template,
|
||||
cli: CoreV1Api
|
||||
) -> typing.List[str]:
|
||||
"""
|
||||
Function that verifies whether a list of interfaces is present in the node.
|
||||
If the list is empty, it fetches the interface of the default route
|
||||
|
||||
Args:
|
||||
input_interface_list (List of strings)
|
||||
- The interfaces to be checked on the node
|
||||
|
||||
node (string):
|
||||
- Node on which input_interface_list is to be verified
|
||||
|
||||
pod_template (jinja2.environment.Template)
|
||||
- The YAML template used to instantiate a pod to query
|
||||
the node's interfaces
|
||||
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
Returns:
|
||||
The interface list for the node
|
||||
"""
|
||||
pod_body = yaml.safe_load(pod_template.render(nodename=node))
|
||||
logging.info("Creating pod to query interface on node %s" % node)
|
||||
kube_helper.create_pod(cli, pod_body, "default", 300)
|
||||
try:
|
||||
if input_interface_list == []:
|
||||
cmd = ["ip", "r"]
|
||||
output = kube_helper.exec_cmd_in_pod(
|
||||
cli,
|
||||
cmd,
|
||||
"fedtools",
|
||||
"default"
|
||||
)
|
||||
|
||||
if not output:
|
||||
logging.error(
|
||||
"Exception occurred while executing command in pod"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
routes = output.split('\n')
|
||||
for route in routes:
|
||||
if 'default' in route:
|
||||
default_route = route
|
||||
break
|
||||
|
||||
input_interface_list = [default_route.split()[4]]
|
||||
|
||||
else:
|
||||
cmd = ["ip", "-br", "addr", "show"]
|
||||
output = kube_helper.exec_cmd_in_pod(
|
||||
cli,
|
||||
cmd,
|
||||
"fedtools",
|
||||
"default"
|
||||
)
|
||||
|
||||
if not output:
|
||||
logging.error(
|
||||
"Exception occurred while executing command in pod"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
interface_ip = output.split('\n')
|
||||
node_interface_list = [
|
||||
interface.split()[0] for interface in interface_ip[:-1]
|
||||
]
|
||||
|
||||
for interface in input_interface_list:
|
||||
if interface not in node_interface_list:
|
||||
logging.error(
|
||||
"Interface %s not found in node %s interface list %s" %
|
||||
(interface, node, node_interface_list)
|
||||
)
|
||||
raise Exception(
|
||||
"Interface %s not found in node %s interface list %s" %
|
||||
(interface, node, node_interface_list)
|
||||
)
|
||||
finally:
|
||||
logging.info("Deleteing pod to query interface on node")
|
||||
kube_helper.delete_pod(cli, "fedtools", "default")
|
||||
|
||||
return input_interface_list
|
||||
|
||||
|
||||
def get_node_interfaces(
|
||||
node_interface_dict: typing.Dict[str, typing.List[str]],
|
||||
label_selector: str,
|
||||
instance_count: int,
|
||||
pod_template,
|
||||
cli: CoreV1Api
|
||||
) -> typing.Dict[str, typing.List[str]]:
|
||||
|
||||
"""
|
||||
Function that is used to process the input dictionary with the nodes and
|
||||
its test interfaces.
|
||||
|
||||
If the dictionary is empty, the label selector is used to select the nodes,
|
||||
and then a random interface on each node is chosen as a test interface.
|
||||
|
||||
If the dictionary is not empty, it is filtered to include the nodes which
|
||||
are active and then their interfaces are verified to be present
|
||||
|
||||
Args:
|
||||
node_interface_dict (Dictionary with keys as node name and value as
|
||||
a list of interface names)
|
||||
- Nodes and their interfaces for the scenario
|
||||
|
||||
label_selector (string):
|
||||
- Label selector to get nodes if node_interface_dict is empty
|
||||
|
||||
instance_count (int):
|
||||
- Number of nodes to fetch in case node_interface_dict is empty
|
||||
|
||||
pod_template (jinja2.environment.Template)
|
||||
- The YAML template used to instantiate a pod to query
|
||||
the node's interfaces
|
||||
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
Returns:
|
||||
Filtered dictionary containing the test nodes and their test interfaces
|
||||
"""
|
||||
if not node_interface_dict:
|
||||
if not label_selector:
|
||||
raise Exception(
|
||||
"If node names and interfaces aren't provided, "
|
||||
"then the label selector must be provided"
|
||||
)
|
||||
nodes = kube_helper.get_node(None, label_selector, instance_count, cli)
|
||||
node_interface_dict = {}
|
||||
for node in nodes:
|
||||
node_interface_dict[node] = get_default_interface(
|
||||
node,
|
||||
pod_template,
|
||||
cli
|
||||
)
|
||||
else:
|
||||
node_name_list = node_interface_dict.keys()
|
||||
filtered_node_list = []
|
||||
|
||||
for node in node_name_list:
|
||||
filtered_node_list.extend(
|
||||
kube_helper.get_node(node, label_selector, instance_count, cli)
|
||||
)
|
||||
|
||||
for node in filtered_node_list:
|
||||
node_interface_dict[node] = verify_interface(
|
||||
node_interface_dict[node], node, pod_template, cli
|
||||
)
|
||||
|
||||
return node_interface_dict
|
||||
|
||||
|
||||
def apply_ingress_filter(
|
||||
cfg: NetworkScenarioConfig,
|
||||
interface_list: typing.List[str],
|
||||
node: str,
|
||||
pod_template,
|
||||
job_template,
|
||||
batch_cli: BatchV1Api,
|
||||
cli: CoreV1Api,
|
||||
create_interfaces: bool = True,
|
||||
param_selector: str = 'all'
|
||||
) -> str:
|
||||
|
||||
"""
|
||||
Function that applies the filters to shape incoming traffic to
|
||||
the provided node's interfaces.
|
||||
This is done by adding a virtual interface before each physical interface
|
||||
and then performing egress traffic control on the virtual interface
|
||||
|
||||
Args:
|
||||
cfg (NetworkScenarioConfig)
|
||||
- Configurations used in this scenario
|
||||
|
||||
interface_list (List of strings)
|
||||
- The interfaces on the node on which the filter is applied
|
||||
|
||||
node (string):
|
||||
- Node on which the interfaces in interface_list are present
|
||||
|
||||
pod_template (jinja2.environment.Template))
|
||||
- The YAML template used to instantiate a pod to create
|
||||
virtual interfaces on the node
|
||||
|
||||
job_template (jinja2.environment.Template))
|
||||
- The YAML template used to instantiate a job to apply and remove
|
||||
the filters on the interfaces
|
||||
|
||||
batch_cli
|
||||
- Object to interact with Kubernetes Python client's BatchV1 API
|
||||
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
param_selector (string)
|
||||
- Used to specify what kind of filter to apply. Useful during
|
||||
serial execution mode. Default value is 'all'
|
||||
|
||||
Returns:
|
||||
The name of the job created that executes the commands on a node
|
||||
for ingress chaos scenario
|
||||
"""
|
||||
|
||||
network_params = cfg.network_params
|
||||
if param_selector != 'all':
|
||||
network_params = {param_selector: cfg.network_params[param_selector]}
|
||||
|
||||
if create_interfaces:
|
||||
create_virtual_interfaces(cli, interface_list, node, pod_template)
|
||||
|
||||
exec_cmd = get_ingress_cmd(
|
||||
interface_list, network_params, duration=cfg.test_duration
|
||||
)
|
||||
logging.info("Executing %s on node %s" % (exec_cmd, node))
|
||||
job_body = yaml.safe_load(
|
||||
job_template.render(
|
||||
jobname=str(hash(node))[:5],
|
||||
nodename=node,
|
||||
cmd=exec_cmd
|
||||
)
|
||||
)
|
||||
api_response = kube_helper.create_job(batch_cli, job_body)
|
||||
|
||||
if api_response is None:
|
||||
raise Exception("Error creating job")
|
||||
|
||||
return job_body["metadata"]["name"]
|
||||
|
||||
|
||||
def create_virtual_interfaces(
|
||||
cli: CoreV1Api,
|
||||
interface_list: typing.List[str],
|
||||
node: str,
|
||||
pod_template
|
||||
) -> None:
|
||||
"""
|
||||
Function that creates a privileged pod and uses it to create
|
||||
virtual interfaces on the node
|
||||
|
||||
Args:
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
interface_list (List of strings)
|
||||
- The list of interfaces on the node for which virtual interfaces
|
||||
are to be created
|
||||
|
||||
node (string)
|
||||
- The node on which the virtual interfaces are created
|
||||
|
||||
pod_template (jinja2.environment.Template))
|
||||
- The YAML template used to instantiate a pod to create
|
||||
virtual interfaces on the node
|
||||
"""
|
||||
pod_body = yaml.safe_load(
|
||||
pod_template.render(nodename=node)
|
||||
)
|
||||
kube_helper.create_pod(cli, pod_body, "default", 300)
|
||||
logging.info(
|
||||
"Creating {0} virtual interfaces on node {1} using a pod".format(
|
||||
len(interface_list),
|
||||
node
|
||||
)
|
||||
)
|
||||
create_ifb(cli, len(interface_list), 'modtools')
|
||||
logging.info("Deleting pod used to create virtual interfaces")
|
||||
kube_helper.delete_pod(cli, "modtools", "default")
|
||||
|
||||
|
||||
def delete_virtual_interfaces(
|
||||
cli: CoreV1Api,
|
||||
node_list: typing.List[str],
|
||||
pod_template
|
||||
):
|
||||
"""
|
||||
Function that creates a privileged pod and uses it to delete all
|
||||
virtual interfaces on the specified nodes
|
||||
|
||||
Args:
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
node_list (List of strings)
|
||||
- The list of nodes on which the list of virtual interfaces are
|
||||
to be deleted
|
||||
|
||||
node (string)
|
||||
- The node on which the virtual interfaces are created
|
||||
|
||||
pod_template (jinja2.environment.Template))
|
||||
- The YAML template used to instantiate a pod to delete
|
||||
virtual interfaces on the node
|
||||
"""
|
||||
|
||||
for node in node_list:
|
||||
pod_body = yaml.safe_load(
|
||||
pod_template.render(nodename=node)
|
||||
)
|
||||
kube_helper.create_pod(cli, pod_body, "default", 300)
|
||||
logging.info(
|
||||
"Deleting all virtual interfaces on node {0}".format(node)
|
||||
)
|
||||
delete_ifb(cli, 'modtools')
|
||||
kube_helper.delete_pod(cli, "modtools", "default")
|
||||
|
||||
|
||||
def create_ifb(cli: CoreV1Api, number: int, pod_name: str):
|
||||
"""
|
||||
Function that creates virtual interfaces in a pod.
|
||||
Makes use of modprobe commands
|
||||
"""
|
||||
|
||||
exec_command = [
|
||||
'chroot', '/host',
|
||||
'modprobe', 'ifb', 'numifbs=' + str(number)
|
||||
]
|
||||
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
|
||||
|
||||
for i in range(0, number):
|
||||
exec_command = ['chroot', '/host', 'ip', 'link', 'set', 'dev']
|
||||
exec_command += ['ifb' + str(i), 'up']
|
||||
kube_helper.exec_cmd_in_pod(
|
||||
cli,
|
||||
exec_command,
|
||||
pod_name,
|
||||
'default'
|
||||
)
|
||||
|
||||
|
||||
def delete_ifb(cli: CoreV1Api, pod_name: str):
|
||||
"""
|
||||
Function that deletes all virtual interfaces in a pod.
|
||||
Makes use of modprobe command
|
||||
"""
|
||||
|
||||
exec_command = ['chroot', '/host', 'modprobe', '-r', 'ifb']
|
||||
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
|
||||
|
||||
|
||||
def get_job_pods(cli: CoreV1Api, api_response):
|
||||
"""
|
||||
Function that gets the pod corresponding to the job
|
||||
|
||||
Args:
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
api_response
|
||||
- The API response for the job status
|
||||
|
||||
Returns
|
||||
Pod corresponding to the job
|
||||
"""
|
||||
|
||||
controllerUid = api_response.metadata.labels["controller-uid"]
|
||||
pod_label_selector = "controller-uid=" + controllerUid
|
||||
pods_list = kube_helper.list_pods(
|
||||
cli,
|
||||
label_selector=pod_label_selector,
|
||||
namespace="default"
|
||||
)
|
||||
|
||||
return pods_list[0]
|
||||
|
||||
|
||||
def wait_for_job(
|
||||
batch_cli: BatchV1Api,
|
||||
job_list: typing.List[str],
|
||||
timeout: int = 300
|
||||
) -> None:
|
||||
"""
|
||||
Function that waits for a list of jobs to finish within a time period
|
||||
|
||||
Args:
|
||||
batch_cli (BatchV1Api)
|
||||
- Object to interact with Kubernetes Python client's BatchV1 API
|
||||
|
||||
job_list (List of strings)
|
||||
- The list of jobs to check for completion
|
||||
|
||||
timeout (int)
|
||||
- Max duration to wait for checking whether the jobs are completed
|
||||
"""
|
||||
|
||||
wait_time = time.time() + timeout
|
||||
count = 0
|
||||
job_len = len(job_list)
|
||||
while count != job_len:
|
||||
for job_name in job_list:
|
||||
try:
|
||||
api_response = kube_helper.get_job_status(
|
||||
batch_cli,
|
||||
job_name,
|
||||
namespace="default"
|
||||
)
|
||||
if (
|
||||
api_response.status.succeeded is not None or
|
||||
api_response.status.failed is not None
|
||||
):
|
||||
count += 1
|
||||
job_list.remove(job_name)
|
||||
except Exception:
|
||||
logging.warn("Exception in getting job status")
|
||||
if time.time() > wait_time:
|
||||
raise Exception(
|
||||
"Jobs did not complete within "
|
||||
"the {0}s timeout period".format(timeout)
|
||||
)
|
||||
time.sleep(5)
|
||||
|
||||
|
||||
def delete_jobs(
|
||||
cli: CoreV1Api,
|
||||
batch_cli: BatchV1Api,
|
||||
job_list: typing.List[str]
|
||||
):
|
||||
"""
|
||||
Function that deletes jobs
|
||||
|
||||
Args:
|
||||
cli (CoreV1Api)
|
||||
- Object to interact with Kubernetes Python client's CoreV1 API
|
||||
|
||||
batch_cli (BatchV1Api)
|
||||
- Object to interact with Kubernetes Python client's BatchV1 API
|
||||
|
||||
job_list (List of strings)
|
||||
- The list of jobs to delete
|
||||
"""
|
||||
|
||||
for job_name in job_list:
|
||||
try:
|
||||
api_response = kube_helper.get_job_status(
|
||||
batch_cli,
|
||||
job_name,
|
||||
namespace="default"
|
||||
)
|
||||
if api_response.status.failed is not None:
|
||||
pod_name = get_job_pods(cli, api_response)
|
||||
pod_stat = kube_helper.read_pod(
|
||||
cli,
|
||||
name=pod_name,
|
||||
namespace="default"
|
||||
)
|
||||
logging.error(pod_stat.status.container_statuses)
|
||||
pod_log_response = kube_helper.get_pod_log(
|
||||
cli,
|
||||
name=pod_name,
|
||||
namespace="default"
|
||||
)
|
||||
pod_log = pod_log_response.data.decode("utf-8")
|
||||
logging.error(pod_log)
|
||||
except Exception as e:
|
||||
logging.warn("Exception in getting job status: %s" % str(e))
|
||||
api_response = kube_helper.delete_job(
|
||||
batch_cli,
|
||||
name=job_name,
|
||||
namespace="default"
|
||||
)
|
||||
|
||||
|
||||
def get_ingress_cmd(
|
||||
interface_list: typing.List[str],
|
||||
network_parameters: typing.Dict[str, str],
|
||||
duration: int = 300
|
||||
):
|
||||
"""
|
||||
Function that returns the commands to the ingress traffic shaping on
|
||||
the node.
|
||||
First, the virtual interfaces created are linked to the test interfaces
|
||||
such that there is a one-to-one mapping between a virtual interface and
|
||||
a test interface.
|
||||
Then, incoming traffic to each test interface is forced to first pass
|
||||
through the corresponding virtual interface.
|
||||
Linux's tc commands are then used to performing egress traffic control
|
||||
on the virtual interface. Since the outbound traffic from
|
||||
the virtual interface passes through the test interface, this is
|
||||
effectively ingress traffic control.
|
||||
After a certain time interval, the traffic is restored to normal
|
||||
|
||||
Args:
|
||||
interface_list (List of strings)
|
||||
- Test interface list
|
||||
|
||||
network_parameters (Dictionary with key and value as string)
|
||||
- Loss/Delay/Bandwidth and their corresponding values
|
||||
|
||||
duration (int)
|
||||
- Duration for which the traffic control is to be done
|
||||
|
||||
Returns:
|
||||
The traffic shaping commands as a string
|
||||
"""
|
||||
|
||||
tc_set = tc_unset = tc_ls = ""
|
||||
param_map = {"latency": "delay", "loss": "loss", "bandwidth": "rate"}
|
||||
|
||||
interface_pattern = re.compile(r"^[a-z0-9\-\@\_]+$")
|
||||
ifb_pattern = re.compile(r"^ifb[0-9]+$")
|
||||
|
||||
for i, interface in enumerate(interface_list):
|
||||
if not interface_pattern.match(interface):
|
||||
logging.error(
|
||||
"Interface name can only consist of alphanumeric characters"
|
||||
)
|
||||
raise Exception(
|
||||
"Interface '{0}' does not match the required regex pattern :"
|
||||
r" ^[a-z0-9\-\@\_]+$".format(interface)
|
||||
)
|
||||
|
||||
ifb_name = "ifb{0}".format(i)
|
||||
if not ifb_pattern.match(ifb_name):
|
||||
logging.error("Invalid IFB name")
|
||||
raise Exception(
|
||||
"Interface '{0}' is an invalid IFB name. IFB name should "
|
||||
"follow the regex pattern ^ifb[0-9]+$".format(ifb_name)
|
||||
)
|
||||
|
||||
tc_set += "tc qdisc add dev {0} handle ffff: ingress;".format(
|
||||
interface
|
||||
)
|
||||
tc_set += "tc filter add dev {0} parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev {1};".format( # noqa
|
||||
interface,
|
||||
ifb_name
|
||||
)
|
||||
tc_set = "{0} tc qdisc add dev {1} root netem".format(tc_set, ifb_name)
|
||||
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, ifb_name)
|
||||
tc_unset += "tc qdisc del dev {0} handle ffff: ingress;".format(
|
||||
interface
|
||||
)
|
||||
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, ifb_name)
|
||||
|
||||
for parameter in network_parameters.keys():
|
||||
tc_set += " {0} {1} ".format(
|
||||
param_map[parameter],
|
||||
network_parameters[parameter]
|
||||
)
|
||||
tc_set += ";"
|
||||
|
||||
exec_cmd = "{0} {1} sleep {2};{3} sleep 20;{4}".format(
|
||||
tc_set,
|
||||
tc_ls,
|
||||
duration,
|
||||
tc_unset,
|
||||
tc_ls
|
||||
)
|
||||
|
||||
return exec_cmd
|
||||
|
||||
|
||||
@plugin.step(
|
||||
id="network_chaos",
|
||||
name="Network Ingress",
|
||||
description="Applies filters to ihe ingress side of node(s) interfaces",
|
||||
outputs={
|
||||
"success": NetworkScenarioSuccessOutput,
|
||||
"error": NetworkScenarioErrorOutput
|
||||
},
|
||||
)
|
||||
def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
|
||||
str,
|
||||
typing.Union[
|
||||
NetworkScenarioSuccessOutput,
|
||||
NetworkScenarioErrorOutput
|
||||
]
|
||||
]:
|
||||
|
||||
"""
|
||||
Function that performs the ingress network chaos scenario based
|
||||
on the provided configuration
|
||||
|
||||
Args:
|
||||
cfg (NetworkScenarioConfig)
|
||||
- The object containing the configuration for the scenario
|
||||
|
||||
Returns
|
||||
A 'success' or 'error' message along with their details
|
||||
"""
|
||||
|
||||
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
|
||||
env = Environment(loader=file_loader)
|
||||
job_template = env.get_template("job.j2")
|
||||
pod_interface_template = env.get_template("pod_interface.j2")
|
||||
pod_module_template = env.get_template("pod_module.j2")
|
||||
cli, batch_cli = kube_helper.setup_kubernetes(cfg.kubeconfig_path)
|
||||
|
||||
try:
|
||||
node_interface_dict = get_node_interfaces(
|
||||
cfg.node_interface_name,
|
||||
cfg.label_selector,
|
||||
cfg.instance_count,
|
||||
pod_interface_template,
|
||||
cli
|
||||
)
|
||||
except Exception:
|
||||
return "error", NetworkScenarioErrorOutput(
|
||||
format_exc()
|
||||
)
|
||||
job_list = []
|
||||
publish = False
|
||||
if cfg.kraken_config:
|
||||
failed_post_scenarios = ""
|
||||
try:
|
||||
with open(cfg.kraken_config, "r") as f:
|
||||
config = yaml.full_load(f)
|
||||
except Exception:
|
||||
logging.error(
|
||||
"Error reading Kraken config from %s" % cfg.kraken_config
|
||||
)
|
||||
return "error", NetworkScenarioErrorOutput(
|
||||
format_exc()
|
||||
)
|
||||
publish = True
|
||||
|
||||
try:
|
||||
if cfg.execution_type == 'parallel':
|
||||
for node in node_interface_dict:
|
||||
job_list.append(
|
||||
apply_ingress_filter(
|
||||
cfg,
|
||||
node_interface_dict[node],
|
||||
node,
|
||||
pod_module_template,
|
||||
job_template,
|
||||
batch_cli,
|
||||
cli
|
||||
)
|
||||
)
|
||||
logging.info("Waiting for parallel job to finish")
|
||||
start_time = int(time.time())
|
||||
wait_for_job(batch_cli, job_list[:], cfg.wait_duration)
|
||||
end_time = int(time.time())
|
||||
if publish:
|
||||
cerberus.publish_kraken_status(
|
||||
config,
|
||||
failed_post_scenarios,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
|
||||
elif cfg.execution_type == 'serial':
|
||||
create_interfaces = True
|
||||
for param in cfg.network_params:
|
||||
for node in node_interface_dict:
|
||||
job_list.append(
|
||||
apply_ingress_filter(
|
||||
cfg,
|
||||
node_interface_dict[node],
|
||||
node,
|
||||
pod_module_template,
|
||||
job_template,
|
||||
batch_cli,
|
||||
cli,
|
||||
create_interfaces=create_interfaces,
|
||||
param_selector=param
|
||||
)
|
||||
)
|
||||
logging.info("Waiting for serial job to finish")
|
||||
start_time = int(time.time())
|
||||
wait_for_job(batch_cli, job_list[:], cfg.wait_duration)
|
||||
logging.info("Deleting jobs")
|
||||
delete_jobs(cli, batch_cli, job_list[:])
|
||||
job_list = []
|
||||
logging.info(
|
||||
"Waiting for wait_duration : %ss" % cfg.wait_duration
|
||||
)
|
||||
time.sleep(cfg.wait_duration)
|
||||
end_time = int(time.time())
|
||||
if publish:
|
||||
cerberus.publish_kraken_status(
|
||||
config,
|
||||
failed_post_scenarios,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
create_interfaces = False
|
||||
else:
|
||||
|
||||
return "error", NetworkScenarioErrorOutput(
|
||||
"Invalid execution type - serial and parallel are "
|
||||
"the only accepted types"
|
||||
)
|
||||
return "success", NetworkScenarioSuccessOutput(
|
||||
filter_direction="ingress",
|
||||
test_interfaces=node_interface_dict,
|
||||
network_parameters=cfg.network_params,
|
||||
execution_type=cfg.execution_type
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error("Network Chaos exiting due to Exception - %s" % e)
|
||||
return "error", NetworkScenarioErrorOutput(
|
||||
format_exc()
|
||||
)
|
||||
finally:
|
||||
delete_virtual_interfaces(
|
||||
cli,
|
||||
node_interface_dict.keys(),
|
||||
pod_module_template
|
||||
)
|
||||
logging.info("Deleting jobs(if any)")
|
||||
delete_jobs(cli, batch_cli, job_list[:])
|
||||
25
kraken/plugins/network/job.j2
Normal file
25
kraken/plugins/network/job.j2
Normal file
@@ -0,0 +1,25 @@
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: chaos-{{jobname}}
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
nodeName: {{nodename}}
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: networkchaos
|
||||
image: docker.io/fedora/tools
|
||||
command: ["/bin/sh", "-c", "{{cmd}}"]
|
||||
securityContext:
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /lib/modules
|
||||
name: lib-modules
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: lib-modules
|
||||
hostPath:
|
||||
path: /lib/modules
|
||||
restartPolicy: Never
|
||||
backoffLimit: 0
|
||||
284
kraken/plugins/network/kubernetes_functions.py
Normal file
284
kraken/plugins/network/kubernetes_functions.py
Normal file
@@ -0,0 +1,284 @@
|
||||
from kubernetes import config, client
|
||||
from kubernetes.client.rest import ApiException
|
||||
from kubernetes.stream import stream
|
||||
import sys
|
||||
import time
|
||||
import logging
|
||||
import random
|
||||
|
||||
def setup_kubernetes(kubeconfig_path):
|
||||
"""
|
||||
Sets up the Kubernetes client
|
||||
"""
|
||||
|
||||
if kubeconfig_path is None:
|
||||
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
|
||||
config.load_kube_config(kubeconfig_path)
|
||||
cli = client.CoreV1Api()
|
||||
batch_cli = client.BatchV1Api()
|
||||
|
||||
return cli, batch_cli
|
||||
|
||||
|
||||
def create_job(batch_cli, body, namespace="default"):
|
||||
"""
|
||||
Function used to create a job from a YAML config
|
||||
"""
|
||||
|
||||
try:
|
||||
api_response = batch_cli.create_namespaced_job(body=body, namespace=namespace)
|
||||
return api_response
|
||||
except ApiException as api:
|
||||
logging.warn(
|
||||
"Exception when calling \
|
||||
BatchV1Api->create_job: %s"
|
||||
% api
|
||||
)
|
||||
if api.status == 409:
|
||||
logging.warn("Job already present")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
BatchV1Api->create_namespaced_job: %s"
|
||||
% e
|
||||
)
|
||||
raise
|
||||
|
||||
|
||||
def delete_pod(cli, name, namespace):
|
||||
"""
|
||||
Function that deletes a pod and waits until deletion is complete
|
||||
"""
|
||||
|
||||
try:
|
||||
cli.delete_namespaced_pod(name=name, namespace=namespace)
|
||||
while cli.read_namespaced_pod(name=name, namespace=namespace):
|
||||
time.sleep(1)
|
||||
except ApiException as e:
|
||||
if e.status == 404:
|
||||
logging.info("Pod deleted")
|
||||
else:
|
||||
logging.error("Failed to delete pod %s" % e)
|
||||
raise e
|
||||
|
||||
|
||||
def create_pod(cli, body, namespace, timeout=120):
|
||||
"""
|
||||
Function used to create a pod from a YAML config
|
||||
"""
|
||||
|
||||
try:
|
||||
pod_stat = None
|
||||
pod_stat = cli.create_namespaced_pod(body=body, namespace=namespace)
|
||||
end_time = time.time() + timeout
|
||||
while True:
|
||||
pod_stat = cli.read_namespaced_pod(name=body["metadata"]["name"], namespace=namespace)
|
||||
if pod_stat.status.phase == "Running":
|
||||
break
|
||||
if time.time() > end_time:
|
||||
raise Exception("Starting pod failed")
|
||||
time.sleep(1)
|
||||
except Exception as e:
|
||||
logging.error("Pod creation failed %s" % e)
|
||||
if pod_stat:
|
||||
logging.error(pod_stat.status.container_statuses)
|
||||
delete_pod(cli, body["metadata"]["name"], namespace)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def exec_cmd_in_pod(cli, command, pod_name, namespace, container=None):
|
||||
"""
|
||||
Function used to execute a command in a running pod
|
||||
"""
|
||||
|
||||
exec_command = command
|
||||
try:
|
||||
if container:
|
||||
ret = stream(
|
||||
cli.connect_get_namespaced_pod_exec,
|
||||
pod_name,
|
||||
namespace,
|
||||
container=container,
|
||||
command=exec_command,
|
||||
stderr=True,
|
||||
stdin=False,
|
||||
stdout=True,
|
||||
tty=False,
|
||||
)
|
||||
else:
|
||||
ret = stream(
|
||||
cli.connect_get_namespaced_pod_exec,
|
||||
pod_name,
|
||||
namespace,
|
||||
command=exec_command,
|
||||
stderr=True,
|
||||
stdin=False,
|
||||
stdout=True,
|
||||
tty=False,
|
||||
)
|
||||
except Exception as e:
|
||||
return False
|
||||
|
||||
return ret
|
||||
|
||||
|
||||
def create_ifb(cli, number, pod_name):
|
||||
"""
|
||||
Function that creates virtual interfaces in a pod. Makes use of modprobe commands
|
||||
"""
|
||||
|
||||
exec_command = ['chroot', '/host', 'modprobe', 'ifb','numifbs=' + str(number)]
|
||||
resp = exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
|
||||
|
||||
for i in range(0, number):
|
||||
exec_command = ['chroot', '/host','ip','link','set','dev']
|
||||
exec_command+= ['ifb' + str(i), 'up']
|
||||
resp = exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
|
||||
|
||||
|
||||
def delete_ifb(cli, pod_name):
|
||||
"""
|
||||
Function that deletes all virtual interfaces in a pod. Makes use of modprobe command
|
||||
"""
|
||||
|
||||
exec_command = ['chroot', '/host', 'modprobe', '-r', 'ifb']
|
||||
resp = exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
|
||||
|
||||
|
||||
def list_pods(cli, namespace, label_selector=None):
|
||||
"""
|
||||
Function used to list pods in a given namespace and having a certain label
|
||||
"""
|
||||
|
||||
pods = []
|
||||
try:
|
||||
if label_selector:
|
||||
ret = cli.list_namespaced_pod(namespace, pretty=True, label_selector=label_selector)
|
||||
else:
|
||||
ret = cli.list_namespaced_pod(namespace, pretty=True)
|
||||
except ApiException as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
CoreV1Api->list_namespaced_pod: %s\n"
|
||||
% e
|
||||
)
|
||||
raise e
|
||||
for pod in ret.items:
|
||||
pods.append(pod.metadata.name)
|
||||
|
||||
return pods
|
||||
|
||||
|
||||
def get_job_status(batch_cli, name, namespace="default"):
|
||||
"""
|
||||
Function that retrieves the status of a running job in a given namespace
|
||||
"""
|
||||
|
||||
try:
|
||||
return batch_cli.read_namespaced_job_status(name=name, namespace=namespace)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
BatchV1Api->read_namespaced_job_status: %s"
|
||||
% e
|
||||
)
|
||||
raise
|
||||
|
||||
|
||||
def get_pod_log(cli, name, namespace="default"):
|
||||
"""
|
||||
Function that retrieves the logs of a running pod in a given namespace
|
||||
"""
|
||||
|
||||
return cli.read_namespaced_pod_log(
|
||||
name=name, namespace=namespace, _return_http_data_only=True, _preload_content=False
|
||||
)
|
||||
|
||||
|
||||
def read_pod(cli, name, namespace="default"):
|
||||
"""
|
||||
Function that retrieves the info of a running pod in a given namespace
|
||||
"""
|
||||
|
||||
return cli.read_namespaced_pod(name=name, namespace=namespace)
|
||||
|
||||
|
||||
|
||||
def delete_job(batch_cli, name, namespace="default"):
|
||||
"""
|
||||
Deletes a job with the input name and namespace
|
||||
"""
|
||||
|
||||
try:
|
||||
api_response = batch_cli.delete_namespaced_job(
|
||||
name=name,
|
||||
namespace=namespace,
|
||||
body=client.V1DeleteOptions(propagation_policy="Foreground", grace_period_seconds=0),
|
||||
)
|
||||
logging.debug("Job deleted. status='%s'" % str(api_response.status))
|
||||
return api_response
|
||||
except ApiException as api:
|
||||
logging.warn(
|
||||
"Exception when calling \
|
||||
BatchV1Api->create_namespaced_job: %s"
|
||||
% api
|
||||
)
|
||||
logging.warn("Job already deleted\n")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
BatchV1Api->delete_namespaced_job: %s\n"
|
||||
% e
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def list_ready_nodes(cli, label_selector=None):
|
||||
"""
|
||||
Returns a list of ready nodes
|
||||
"""
|
||||
|
||||
nodes = []
|
||||
try:
|
||||
if label_selector:
|
||||
ret = cli.list_node(pretty=True, label_selector=label_selector)
|
||||
else:
|
||||
ret = cli.list_node(pretty=True)
|
||||
except ApiException as e:
|
||||
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
|
||||
raise e
|
||||
for node in ret.items:
|
||||
for cond in node.status.conditions:
|
||||
if str(cond.type) == "Ready" and str(cond.status) == "True":
|
||||
nodes.append(node.metadata.name)
|
||||
|
||||
return nodes
|
||||
|
||||
|
||||
def get_node(node_name, label_selector, instance_kill_count, cli):
|
||||
"""
|
||||
Returns active node(s) on which the scenario can be performed
|
||||
"""
|
||||
|
||||
if node_name in list_ready_nodes(cli):
|
||||
return [node_name]
|
||||
elif node_name:
|
||||
logging.info(
|
||||
"Node with provided node_name does not exist or the node might "
|
||||
"be in NotReady state."
|
||||
)
|
||||
nodes = list_ready_nodes(cli, label_selector)
|
||||
if not nodes:
|
||||
raise Exception("Ready nodes with the provided label selector do not exist")
|
||||
logging.info(
|
||||
"Ready nodes with the label selector %s: %s" % (label_selector, nodes)
|
||||
)
|
||||
number_of_nodes = len(nodes)
|
||||
if instance_kill_count == number_of_nodes:
|
||||
return nodes
|
||||
nodes_to_return = []
|
||||
for i in range(instance_kill_count):
|
||||
node_to_add = nodes[random.randint(0, len(nodes) - 1)]
|
||||
nodes_to_return.append(node_to_add)
|
||||
nodes.remove(node_to_add)
|
||||
return nodes_to_return
|
||||
16
kraken/plugins/network/pod_interface.j2
Normal file
16
kraken/plugins/network/pod_interface.j2
Normal file
@@ -0,0 +1,16 @@
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: fedtools
|
||||
spec:
|
||||
hostNetwork: true
|
||||
nodeName: {{nodename}}
|
||||
containers:
|
||||
- name: fedtools
|
||||
image: docker.io/fedora/tools
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- "trap : TERM INT; sleep infinity & wait"
|
||||
securityContext:
|
||||
privileged: true
|
||||
30
kraken/plugins/network/pod_module.j2
Normal file
30
kraken/plugins/network/pod_module.j2
Normal file
@@ -0,0 +1,30 @@
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: modtools
|
||||
spec:
|
||||
nodeName: {{nodename}}
|
||||
containers:
|
||||
- name: modtools
|
||||
image: docker.io/fedora/tools
|
||||
imagePullPolicy: IfNotPresent
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- "trap : TERM INT; sleep infinity & wait"
|
||||
tty: true
|
||||
stdin: true
|
||||
stdinOnce: true
|
||||
securityContext:
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- name: host
|
||||
mountPath: /host
|
||||
volumes:
|
||||
- name: host
|
||||
hostPath:
|
||||
path: /
|
||||
hostNetwork: true
|
||||
hostIPC: true
|
||||
hostPID: true
|
||||
restartPolicy: Never
|
||||
269
kraken/plugins/pod_plugin.py
Executable file
269
kraken/plugins/pod_plugin.py
Executable file
@@ -0,0 +1,269 @@
|
||||
#!/usr/bin/env python
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import typing
|
||||
from dataclasses import dataclass, field
|
||||
import random
|
||||
from datetime import datetime
|
||||
from traceback import format_exc
|
||||
|
||||
from kubernetes import config, client
|
||||
from kubernetes.client import V1PodList, V1Pod, ApiException, V1DeleteOptions
|
||||
from arcaflow_plugin_sdk import validation, plugin, schema
|
||||
|
||||
|
||||
def setup_kubernetes(kubeconfig_path):
|
||||
if kubeconfig_path is None:
|
||||
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
|
||||
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
|
||||
|
||||
if kubeconfig.config is None:
|
||||
raise Exception(
|
||||
'Invalid kube-config file: %s. '
|
||||
'No configuration found.' % kubeconfig_path
|
||||
)
|
||||
loader = config.kube_config.KubeConfigLoader(
|
||||
config_dict=kubeconfig.config,
|
||||
)
|
||||
client_config = client.Configuration()
|
||||
loader.load_and_set(client_config)
|
||||
return client.ApiClient(configuration=client_config)
|
||||
|
||||
|
||||
def _find_pods(core_v1, label_selector, name_pattern, namespace_pattern):
|
||||
pods: typing.List[V1Pod] = []
|
||||
_continue = None
|
||||
finished = False
|
||||
while not finished:
|
||||
pod_response: V1PodList = core_v1.list_pod_for_all_namespaces(
|
||||
watch=False,
|
||||
label_selector=label_selector
|
||||
)
|
||||
for pod in pod_response.items:
|
||||
pod: V1Pod
|
||||
if (name_pattern is None or name_pattern.match(pod.metadata.name)) and \
|
||||
namespace_pattern.match(pod.metadata.namespace):
|
||||
pods.append(pod)
|
||||
_continue = pod_response.metadata._continue
|
||||
if _continue is None:
|
||||
finished = True
|
||||
return pods
|
||||
|
||||
|
||||
@dataclass
|
||||
class Pod:
|
||||
namespace: str
|
||||
name: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class PodKillSuccessOutput:
|
||||
pods: typing.Dict[int, Pod] = field(metadata={
|
||||
"name": "Pods removed",
|
||||
"description": "Map between timestamps and the pods removed. The timestamp is provided in nanoseconds."
|
||||
})
|
||||
|
||||
|
||||
@dataclass
|
||||
class PodWaitSuccessOutput:
|
||||
pods: typing.List[Pod] = field(metadata={
|
||||
"name": "Pods",
|
||||
"description": "List of pods that have been found to run."
|
||||
})
|
||||
|
||||
|
||||
@dataclass
|
||||
class PodErrorOutput:
|
||||
error: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class KillPodConfig:
|
||||
"""
|
||||
This is a configuration structure specific to pod kill scenario. It describes which pod from which
|
||||
namespace(s) to select for killing and how many pods to kill.
|
||||
"""
|
||||
|
||||
namespace_pattern: re.Pattern = field(metadata={
|
||||
"name": "Namespace pattern",
|
||||
"description": "Regular expression for target pod namespaces."
|
||||
})
|
||||
|
||||
name_pattern: typing.Annotated[
|
||||
typing.Optional[re.Pattern],
|
||||
validation.required_if_not("label_selector")
|
||||
] = field(default=None, metadata={
|
||||
"name": "Name pattern",
|
||||
"description": "Regular expression for target pods. Required if label_selector is not set."
|
||||
})
|
||||
|
||||
kill: typing.Annotated[int, validation.min(1)] = field(
|
||||
default=1,
|
||||
metadata={"name": "Number of pods to kill", "description": "How many pods should we attempt to kill?"}
|
||||
)
|
||||
|
||||
label_selector: typing.Annotated[
|
||||
typing.Optional[str],
|
||||
validation.min(1),
|
||||
validation.required_if_not("name_pattern")
|
||||
] = field(default=None, metadata={
|
||||
"name": "Label selector",
|
||||
"description": "Kubernetes label selector for the target pods. Required if name_pattern is not set.\n"
|
||||
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details."
|
||||
})
|
||||
|
||||
kubeconfig_path: typing.Optional[str] = field(default=None, metadata={
|
||||
"name": "Kubeconfig path",
|
||||
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
|
||||
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for "
|
||||
"details."
|
||||
})
|
||||
|
||||
timeout: int = field(default=180, metadata={
|
||||
"name": "Timeout",
|
||||
"description": "Timeout to wait for the target pod(s) to be removed in seconds."
|
||||
})
|
||||
|
||||
backoff: int = field(default=1, metadata={
|
||||
"name": "Backoff",
|
||||
"description": "How many seconds to wait between checks for the target pod status."
|
||||
})
|
||||
|
||||
|
||||
@plugin.step(
|
||||
"kill-pods",
|
||||
"Kill pods",
|
||||
"Kill pods as specified by parameters",
|
||||
{"success": PodKillSuccessOutput, "error": PodErrorOutput}
|
||||
)
|
||||
def kill_pods(cfg: KillPodConfig) -> typing.Tuple[str, typing.Union[PodKillSuccessOutput, PodErrorOutput]]:
|
||||
try:
|
||||
with setup_kubernetes(None) as cli:
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
|
||||
# region Select target pods
|
||||
pods = _find_pods(core_v1, cfg.label_selector, cfg.name_pattern, cfg.namespace_pattern)
|
||||
if len(pods) < cfg.kill:
|
||||
return "error", PodErrorOutput(
|
||||
"Not enough pods match the criteria, expected {} but found only {} pods".format(cfg.kill, len(pods))
|
||||
)
|
||||
random.shuffle(pods)
|
||||
# endregion
|
||||
|
||||
# region Remove pods
|
||||
killed_pods: typing.Dict[int, Pod] = {}
|
||||
watch_pods: typing.List[Pod] = []
|
||||
for i in range(cfg.kill):
|
||||
pod = pods[i]
|
||||
core_v1.delete_namespaced_pod(pod.metadata.name, pod.metadata.namespace, body=V1DeleteOptions(
|
||||
grace_period_seconds=0,
|
||||
))
|
||||
p = Pod(
|
||||
pod.metadata.namespace,
|
||||
pod.metadata.name
|
||||
)
|
||||
killed_pods[int(time.time_ns())] = p
|
||||
watch_pods.append(p)
|
||||
# endregion
|
||||
|
||||
# region Wait for pods to be removed
|
||||
start_time = time.time()
|
||||
while len(watch_pods) > 0:
|
||||
time.sleep(cfg.backoff)
|
||||
new_watch_pods: typing.List[Pod] = []
|
||||
for p in watch_pods:
|
||||
try:
|
||||
read_pod = core_v1.read_namespaced_pod(p.name, p.namespace)
|
||||
new_watch_pods.append(p)
|
||||
except ApiException as e:
|
||||
if e.status != 404:
|
||||
raise
|
||||
watch_pods = new_watch_pods
|
||||
current_time = time.time()
|
||||
if current_time - start_time > cfg.timeout:
|
||||
return "error", PodErrorOutput("Timeout while waiting for pods to be removed.")
|
||||
return "success", PodKillSuccessOutput(killed_pods)
|
||||
# endregion
|
||||
except Exception:
|
||||
return "error", PodErrorOutput(
|
||||
format_exc()
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class WaitForPodsConfig:
|
||||
"""
|
||||
WaitForPodsConfig is a configuration structure for wait-for-pod steps.
|
||||
"""
|
||||
|
||||
namespace_pattern: re.Pattern
|
||||
|
||||
name_pattern: typing.Annotated[
|
||||
typing.Optional[re.Pattern],
|
||||
validation.required_if_not("label_selector")
|
||||
] = None
|
||||
|
||||
label_selector: typing.Annotated[
|
||||
typing.Optional[str],
|
||||
validation.min(1),
|
||||
validation.required_if_not("name_pattern")
|
||||
] = None
|
||||
|
||||
count: typing.Annotated[int, validation.min(1)] = field(
|
||||
default=1,
|
||||
metadata={"name": "Pod count", "description": "Wait for at least this many pods to exist"}
|
||||
)
|
||||
|
||||
timeout: typing.Annotated[int, validation.min(1)] = field(
|
||||
default=180,
|
||||
metadata={"name": "Timeout", "description": "How many seconds to wait for?"}
|
||||
)
|
||||
|
||||
backoff: int = field(default=1, metadata={
|
||||
"name": "Backoff",
|
||||
"description": "How many seconds to wait between checks for the target pod status."
|
||||
})
|
||||
|
||||
kubeconfig_path: typing.Optional[str] = None
|
||||
|
||||
|
||||
@plugin.step(
|
||||
"wait-for-pods",
|
||||
"Wait for pods",
|
||||
"Wait for the specified number of pods to be present",
|
||||
{"success": PodWaitSuccessOutput, "error": PodErrorOutput}
|
||||
)
|
||||
def wait_for_pods(cfg: WaitForPodsConfig) -> typing.Tuple[str, typing.Union[PodWaitSuccessOutput, PodErrorOutput]]:
|
||||
try:
|
||||
with setup_kubernetes(None) as cli:
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
|
||||
timeout = False
|
||||
start_time = datetime.now()
|
||||
while not timeout:
|
||||
pods = _find_pods(core_v1, cfg.label_selector, cfg.name_pattern, cfg.namespace_pattern)
|
||||
if len(pods) >= cfg.count:
|
||||
return "success", \
|
||||
PodWaitSuccessOutput(list(map(lambda p: Pod(p.metadata.namespace, p.metadata.name), pods)))
|
||||
|
||||
time.sleep(cfg.backoff)
|
||||
|
||||
now_time = datetime.now()
|
||||
|
||||
time_diff = now_time - start_time
|
||||
if time_diff.seconds > cfg.timeout:
|
||||
return "error", PodErrorOutput(
|
||||
"timeout while waiting for pods to come up"
|
||||
)
|
||||
except Exception:
|
||||
return "error", PodErrorOutput(
|
||||
format_exc()
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(plugin.run(plugin.build_schema(
|
||||
kill_pods,
|
||||
wait_for_pods,
|
||||
)))
|
||||
50
kraken/plugins/run_python_plugin.py
Normal file
50
kraken/plugins/run_python_plugin.py
Normal file
@@ -0,0 +1,50 @@
|
||||
import dataclasses
|
||||
import subprocess
|
||||
import sys
|
||||
import typing
|
||||
|
||||
from arcaflow_plugin_sdk import plugin
|
||||
|
||||
|
||||
@dataclasses.dataclass
|
||||
class RunPythonFileInput:
|
||||
filename: str
|
||||
|
||||
|
||||
@dataclasses.dataclass
|
||||
class RunPythonFileOutput:
|
||||
stdout: str
|
||||
stderr: str
|
||||
|
||||
|
||||
@dataclasses.dataclass
|
||||
class RunPythonFileError:
|
||||
exit_code: int
|
||||
stdout: str
|
||||
stderr: str
|
||||
|
||||
|
||||
@plugin.step(
|
||||
id="run_python",
|
||||
name="Run a Python script",
|
||||
description="Run a specified Python script",
|
||||
outputs={"success": RunPythonFileOutput, "error": RunPythonFileError}
|
||||
)
|
||||
def run_python_file(params: RunPythonFileInput) -> typing.Tuple[
|
||||
str,
|
||||
typing.Union[RunPythonFileOutput, RunPythonFileError]
|
||||
]:
|
||||
run_results = subprocess.run(
|
||||
[sys.executable, params.filename],
|
||||
capture_output=True
|
||||
)
|
||||
if run_results.returncode == 0:
|
||||
return "success", RunPythonFileOutput(
|
||||
str(run_results.stdout, 'utf-8'),
|
||||
str(run_results.stderr, 'utf-8')
|
||||
)
|
||||
return "error", RunPythonFileError(
|
||||
run_results.returncode,
|
||||
str(run_results.stdout, 'utf-8'),
|
||||
str(run_results.stderr, 'utf-8')
|
||||
)
|
||||
179
kraken/plugins/vmware/kubernetes_functions.py
Normal file
179
kraken/plugins/vmware/kubernetes_functions.py
Normal file
@@ -0,0 +1,179 @@
|
||||
from kubernetes import config, client
|
||||
from kubernetes.client.rest import ApiException
|
||||
import logging
|
||||
import random
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class Actions(Enum):
|
||||
"""
|
||||
This enumeration indicates different kinds of node operations
|
||||
"""
|
||||
|
||||
START = "Start"
|
||||
STOP = "Stop"
|
||||
TERMINATE = "Terminate"
|
||||
REBOOT = "Reboot"
|
||||
|
||||
|
||||
def setup_kubernetes(kubeconfig_path):
|
||||
"""
|
||||
Sets up the Kubernetes client
|
||||
"""
|
||||
|
||||
if kubeconfig_path is None:
|
||||
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
|
||||
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
|
||||
|
||||
if kubeconfig.config is None:
|
||||
raise Exception(
|
||||
"Invalid kube-config file: %s. " "No configuration found." % kubeconfig_path
|
||||
)
|
||||
loader = config.kube_config.KubeConfigLoader(
|
||||
config_dict=kubeconfig.config,
|
||||
)
|
||||
client_config = client.Configuration()
|
||||
loader.load_and_set(client_config)
|
||||
return client.ApiClient(configuration=client_config)
|
||||
|
||||
|
||||
def list_killable_nodes(core_v1, label_selector=None):
|
||||
"""
|
||||
Returns a list of nodes that can be stopped/reset/released
|
||||
"""
|
||||
|
||||
nodes = []
|
||||
try:
|
||||
if label_selector:
|
||||
ret = core_v1.list_node(pretty=True, label_selector=label_selector)
|
||||
else:
|
||||
ret = core_v1.list_node(pretty=True)
|
||||
except ApiException as e:
|
||||
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
|
||||
raise e
|
||||
for node in ret.items:
|
||||
for cond in node.status.conditions:
|
||||
if str(cond.type) == "Ready" and str(cond.status) == "True":
|
||||
nodes.append(node.metadata.name)
|
||||
return nodes
|
||||
|
||||
|
||||
def list_startable_nodes(core_v1, label_selector=None):
|
||||
"""
|
||||
Returns a list of nodes that can be started
|
||||
"""
|
||||
|
||||
nodes = []
|
||||
try:
|
||||
if label_selector:
|
||||
ret = core_v1.list_node(pretty=True, label_selector=label_selector)
|
||||
else:
|
||||
ret = core_v1.list_node(pretty=True)
|
||||
except ApiException as e:
|
||||
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
|
||||
raise e
|
||||
for node in ret.items:
|
||||
for cond in node.status.conditions:
|
||||
if str(cond.type) == "Ready" and str(cond.status) != "True":
|
||||
nodes.append(node.metadata.name)
|
||||
return nodes
|
||||
|
||||
|
||||
def get_node_list(cfg, action, core_v1):
|
||||
"""
|
||||
Returns a list of nodes to be used in the node scenarios. The list returned is constructed as follows:
|
||||
- If the key 'name' is present in the node scenario config, the value is extracted and split into
|
||||
a list
|
||||
- Each node in the list is fed to the get_node function which checks if the node is killable or
|
||||
fetches the node using the label selector
|
||||
"""
|
||||
|
||||
def get_node(node_name, label_selector, instance_kill_count, action, core_v1):
|
||||
list_nodes_func = (
|
||||
list_startable_nodes if action == Actions.START else list_killable_nodes
|
||||
)
|
||||
if node_name in list_nodes_func(core_v1):
|
||||
return [node_name]
|
||||
elif node_name:
|
||||
logging.info(
|
||||
"Node with provided node_name does not exist or the node might "
|
||||
"be in NotReady state."
|
||||
)
|
||||
nodes = list_nodes_func(core_v1, label_selector)
|
||||
if not nodes:
|
||||
raise Exception("Ready nodes with the provided label selector do not exist")
|
||||
logging.info(
|
||||
"Ready nodes with the label selector %s: %s" % (label_selector, nodes)
|
||||
)
|
||||
number_of_nodes = len(nodes)
|
||||
if instance_kill_count == number_of_nodes:
|
||||
return nodes
|
||||
nodes_to_return = []
|
||||
for i in range(instance_kill_count):
|
||||
node_to_add = nodes[random.randint(0, len(nodes) - 1)]
|
||||
nodes_to_return.append(node_to_add)
|
||||
nodes.remove(node_to_add)
|
||||
return nodes_to_return
|
||||
|
||||
if cfg.name:
|
||||
input_nodes = cfg.name.split(",")
|
||||
else:
|
||||
input_nodes = [""]
|
||||
scenario_nodes = set()
|
||||
|
||||
if cfg.skip_openshift_checks:
|
||||
scenario_nodes = input_nodes
|
||||
else:
|
||||
for node in input_nodes:
|
||||
nodes = get_node(
|
||||
node, cfg.label_selector, cfg.instance_count, action, core_v1
|
||||
)
|
||||
scenario_nodes.update(nodes)
|
||||
|
||||
return list(scenario_nodes)
|
||||
|
||||
|
||||
def watch_node_status(node, status, timeout, watch_resource, core_v1):
|
||||
"""
|
||||
Monitor the status of a node for change
|
||||
"""
|
||||
count = timeout
|
||||
for event in watch_resource.stream(
|
||||
core_v1.list_node,
|
||||
field_selector=f"metadata.name={node}",
|
||||
timeout_seconds=timeout,
|
||||
):
|
||||
conditions = [
|
||||
status
|
||||
for status in event["object"].status.conditions
|
||||
if status.type == "Ready"
|
||||
]
|
||||
if conditions[0].status == status:
|
||||
watch_resource.stop()
|
||||
break
|
||||
else:
|
||||
count -= 1
|
||||
logging.info("Status of node " + node + ": " + str(conditions[0].status))
|
||||
if not count:
|
||||
watch_resource.stop()
|
||||
|
||||
|
||||
def wait_for_ready_status(node, timeout, watch_resource, core_v1):
|
||||
"""
|
||||
Wait until the node status becomes Ready
|
||||
"""
|
||||
watch_node_status(node, "True", timeout, watch_resource, core_v1)
|
||||
|
||||
|
||||
def wait_for_not_ready_status(node, timeout, watch_resource, core_v1):
|
||||
"""
|
||||
Wait until the node status becomes Not Ready
|
||||
"""
|
||||
watch_node_status(node, "False", timeout, watch_resource, core_v1)
|
||||
|
||||
|
||||
def wait_for_unknown_status(node, timeout, watch_resource, core_v1):
|
||||
"""
|
||||
Wait until the node status becomes Unknown
|
||||
"""
|
||||
watch_node_status(node, "Unknown", timeout, watch_resource, core_v1)
|
||||
770
kraken/plugins/vmware/vmware_plugin.py
Normal file
770
kraken/plugins/vmware/vmware_plugin.py
Normal file
@@ -0,0 +1,770 @@
|
||||
#!/usr/bin/env python
|
||||
import logging
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
import typing
|
||||
from dataclasses import dataclass, field
|
||||
from os import environ
|
||||
from traceback import format_exc
|
||||
|
||||
import requests
|
||||
from arcaflow_plugin_sdk import plugin, validation
|
||||
from com.vmware.vapi.std.errors_client import (AlreadyInDesiredState,
|
||||
NotAllowedInCurrentState)
|
||||
from com.vmware.vcenter.vm_client import Power
|
||||
from com.vmware.vcenter_client import VM, ResourcePool
|
||||
from kubernetes import client, watch
|
||||
from vmware.vapi.vsphere.client import create_vsphere_client
|
||||
|
||||
from kraken.plugins.vmware import kubernetes_functions as kube_helper
|
||||
|
||||
|
||||
class vSphere:
|
||||
def __init__(self, verify=True):
|
||||
"""
|
||||
Initialize the vSphere client by using the the env variables:
|
||||
'VSPHERE_IP', 'VSPHERE_USERNAME', 'VSPHERE_PASSWORD'
|
||||
"""
|
||||
self.server = environ.get("VSPHERE_IP")
|
||||
self.username = environ.get("VSPHERE_USERNAME")
|
||||
self.password = environ.get("VSPHERE_PASSWORD")
|
||||
session = self.get_unverified_session() if not verify else None
|
||||
self.credentials_present = (
|
||||
True if self.server and self.username and self.password else False
|
||||
)
|
||||
if not self.credentials_present:
|
||||
raise Exception(
|
||||
"Environmental variables "
|
||||
"'VSPHERE_IP', 'VSPHERE_USERNAME', "
|
||||
"'VSPHERE_PASSWORD' are not set"
|
||||
)
|
||||
self.client = create_vsphere_client(
|
||||
server=self.server,
|
||||
username=self.username,
|
||||
password=self.password,
|
||||
session=session,
|
||||
)
|
||||
|
||||
def get_unverified_session(self):
|
||||
"""
|
||||
Returns an unverified session object
|
||||
"""
|
||||
|
||||
session = requests.session()
|
||||
session.verify = False
|
||||
requests.packages.urllib3.disable_warnings()
|
||||
return session
|
||||
|
||||
def get_vm(self, instance_id):
|
||||
"""
|
||||
Returns the VM ID corresponding to the VM Name (instance_id)
|
||||
If there are multiple matches, this only returns the first one
|
||||
"""
|
||||
|
||||
names = set([instance_id])
|
||||
vms = self.client.vcenter.VM.list(VM.FilterSpec(names=names))
|
||||
|
||||
if len(vms) == 0:
|
||||
logging.info("VM with name ({}) not found", instance_id)
|
||||
return None
|
||||
vm = vms[0].vm
|
||||
|
||||
return vm
|
||||
|
||||
def release_instances(self, instance_id):
|
||||
"""
|
||||
Deletes the VM whose name is given by 'instance_id'
|
||||
"""
|
||||
|
||||
vm = self.get_vm(instance_id)
|
||||
if not vm:
|
||||
raise Exception(
|
||||
"VM with the name ({}) does not exist."
|
||||
"Please create the vm first.".format(instance_id)
|
||||
)
|
||||
state = self.client.vcenter.vm.Power.get(vm)
|
||||
if state == Power.Info(state=Power.State.POWERED_ON):
|
||||
self.client.vcenter.vm.Power.stop(vm)
|
||||
elif state == Power.Info(state=Power.State.SUSPENDED):
|
||||
self.client.vcenter.vm.Power.start(vm)
|
||||
self.client.vcenter.vm.Power.stop(vm)
|
||||
self.client.vcenter.VM.delete(vm)
|
||||
logging.info("Deleted VM -- '{}-({})'", instance_id, vm)
|
||||
|
||||
def reboot_instances(self, instance_id):
|
||||
"""
|
||||
Reboots the VM whose name is given by 'instance_id'.
|
||||
@Returns: True if successful, or False if the VM is not powered on
|
||||
"""
|
||||
|
||||
vm = self.get_vm(instance_id)
|
||||
try:
|
||||
self.client.vcenter.vm.Power.reset(vm)
|
||||
logging.info("Reset VM -- '{}-({})'", instance_id, vm)
|
||||
return True
|
||||
except NotAllowedInCurrentState:
|
||||
logging.info(
|
||||
"VM '{}'-'({})' is not Powered On. Cannot reset it",
|
||||
instance_id,
|
||||
vm
|
||||
)
|
||||
return False
|
||||
|
||||
def stop_instances(self, instance_id):
|
||||
"""
|
||||
Stops the VM whose name is given by 'instance_id'.
|
||||
@Returns: True if successful, or False if the VM is already powered off
|
||||
"""
|
||||
|
||||
vm = self.get_vm(instance_id)
|
||||
try:
|
||||
self.client.vcenter.vm.Power.stop(vm)
|
||||
logging.info("Stopped VM -- '{}-({})'", instance_id, vm)
|
||||
return True
|
||||
except AlreadyInDesiredState:
|
||||
logging.info(
|
||||
"VM '{}'-'({})' is already Powered Off", instance_id, vm
|
||||
)
|
||||
return False
|
||||
|
||||
def start_instances(self, instance_id):
|
||||
"""
|
||||
Stops the VM whose name is given by 'instance_id'.
|
||||
@Returns: True if successful, or False if the VM is already powered on
|
||||
"""
|
||||
|
||||
vm = self.get_vm(instance_id)
|
||||
try:
|
||||
self.client.vcenter.vm.Power.start(vm)
|
||||
logging.info("Started VM -- '{}-({})'", instance_id, vm)
|
||||
return True
|
||||
except AlreadyInDesiredState:
|
||||
logging.info(
|
||||
"VM '{}'-'({})' is already Powered On", instance_id, vm
|
||||
)
|
||||
return False
|
||||
|
||||
def list_instances(self, datacenter):
|
||||
"""
|
||||
@Returns: a list of VMs present in the datacenter
|
||||
"""
|
||||
|
||||
datacenter_filter = self.client.vcenter.Datacenter.FilterSpec(
|
||||
names=set([datacenter])
|
||||
)
|
||||
datacenter_summaries = self.client.vcenter.Datacenter.list(
|
||||
datacenter_filter
|
||||
)
|
||||
try:
|
||||
datacenter_id = datacenter_summaries[0].datacenter
|
||||
except IndexError:
|
||||
logging.error("Datacenter '{}' doesn't exist", datacenter)
|
||||
sys.exit(1)
|
||||
|
||||
vm_filter = self.client.vcenter.VM.FilterSpec(
|
||||
datacenters={datacenter_id}
|
||||
)
|
||||
vm_summaries = self.client.vcenter.VM.list(vm_filter)
|
||||
vm_names = []
|
||||
for vm in vm_summaries:
|
||||
vm_names.append({"vm_name": vm.name, "vm_id": vm.vm})
|
||||
return vm_names
|
||||
|
||||
def get_datacenter_list(self):
|
||||
"""
|
||||
Returns a dictionary containing all the datacenter names and IDs
|
||||
"""
|
||||
|
||||
datacenter_summaries = self.client.vcenter.Datacenter.list()
|
||||
datacenter_names = [
|
||||
{
|
||||
"datacenter_id": datacenter.datacenter,
|
||||
"datacenter_name": datacenter.name
|
||||
}
|
||||
for datacenter in datacenter_summaries
|
||||
]
|
||||
return datacenter_names
|
||||
|
||||
def get_datastore_list(self, datacenter=None):
|
||||
"""
|
||||
@Returns: a dictionary containing all the datastore names and
|
||||
IDs belonging to a specific datacenter
|
||||
"""
|
||||
|
||||
datastore_filter = self.client.vcenter.Datastore.FilterSpec(
|
||||
datacenters={datacenter}
|
||||
)
|
||||
datastore_summaries = self.client.vcenter.Datastore.list(
|
||||
datastore_filter
|
||||
)
|
||||
datastore_names = []
|
||||
for datastore in datastore_summaries:
|
||||
datastore_names.append(
|
||||
{
|
||||
"datastore_name": datastore.name,
|
||||
"datastore_id": datastore.datastore
|
||||
}
|
||||
)
|
||||
return datastore_names
|
||||
|
||||
def get_folder_list(self, datacenter=None):
|
||||
"""
|
||||
@Returns: a dictionary containing all the folder names and
|
||||
IDs belonging to a specific datacenter
|
||||
"""
|
||||
|
||||
folder_filter = self.client.vcenter.Folder.FilterSpec(
|
||||
datacenters={datacenter}
|
||||
)
|
||||
folder_summaries = self.client.vcenter.Folder.list(folder_filter)
|
||||
folder_names = []
|
||||
for folder in folder_summaries:
|
||||
folder_names.append(
|
||||
{"folder_name": folder.name, "folder_id": folder.folder}
|
||||
)
|
||||
return folder_names
|
||||
|
||||
def get_resource_pool(self, datacenter, resource_pool_name=None):
|
||||
"""
|
||||
Returns the identifier of the resource pool with the given name or the
|
||||
first resource pool in the datacenter if the name is not provided.
|
||||
"""
|
||||
|
||||
names = set([resource_pool_name]) if resource_pool_name else None
|
||||
filter_spec = ResourcePool.FilterSpec(
|
||||
datacenters=set([datacenter]), names=names
|
||||
)
|
||||
resource_pool_summaries = self.client.vcenter.ResourcePool.list(
|
||||
filter_spec
|
||||
)
|
||||
if len(resource_pool_summaries) > 0:
|
||||
resource_pool = resource_pool_summaries[0].resource_pool
|
||||
return resource_pool
|
||||
else:
|
||||
logging.error(
|
||||
"ResourcePool not found in Datacenter '{}'",
|
||||
datacenter
|
||||
)
|
||||
return None
|
||||
|
||||
def create_default_vm(self, guest_os="RHEL_7_64", max_attempts=10):
|
||||
"""
|
||||
Creates a default VM with 2 GB memory, 1 CPU and 16 GB disk space in a
|
||||
random datacenter. Accepts the guest OS as a parameter. Since the VM
|
||||
placement is random, it might fail due to resource constraints.
|
||||
So, this function tries for upto 'max_attempts' to create the VM
|
||||
"""
|
||||
|
||||
def create_vm(vm_name, resource_pool, folder, datastore, guest_os):
|
||||
"""
|
||||
Creates a VM and returns its ID and name. Requires the VM name,
|
||||
resource pool name, folder name, datastore and the guest OS
|
||||
"""
|
||||
|
||||
placement_spec = VM.PlacementSpec(
|
||||
folder=folder, resource_pool=resource_pool, datastore=datastore
|
||||
)
|
||||
vm_create_spec = VM.CreateSpec(
|
||||
name=vm_name, guest_os=guest_os, placement=placement_spec
|
||||
)
|
||||
|
||||
vm_id = self.client.vcenter.VM.create(vm_create_spec)
|
||||
return vm_id
|
||||
|
||||
for _ in range(max_attempts):
|
||||
try:
|
||||
datacenter_list = self.get_datacenter_list()
|
||||
# random generator not used for
|
||||
# security/cryptographic purposes in this loop
|
||||
datacenter = random.choice(datacenter_list) # nosec
|
||||
resource_pool = self.get_resource_pool(
|
||||
datacenter["datacenter_id"]
|
||||
)
|
||||
folder = random.choice( # nosec
|
||||
self.get_folder_list(datacenter["datacenter_id"])
|
||||
)["folder_id"]
|
||||
datastore = random.choice( # nosec
|
||||
self.get_datastore_list(datacenter["datacenter_id"])
|
||||
)["datastore_id"]
|
||||
vm_name = "Test-" + str(time.time_ns())
|
||||
return (
|
||||
create_vm(
|
||||
vm_name,
|
||||
resource_pool,
|
||||
folder,
|
||||
datastore,
|
||||
guest_os
|
||||
),
|
||||
vm_name,
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Default VM could not be created, retrying. "
|
||||
"Error was: %s",
|
||||
str(e)
|
||||
)
|
||||
logging.error(
|
||||
"Default VM could not be created in %s attempts. "
|
||||
"Check your VMware resources",
|
||||
max_attempts
|
||||
)
|
||||
return None, None
|
||||
|
||||
def get_vm_status(self, instance_id):
|
||||
"""
|
||||
Returns the status of the VM whose name is given by 'instance_id'
|
||||
"""
|
||||
|
||||
try:
|
||||
vm = self.get_vm(instance_id)
|
||||
state = self.client.vcenter.vm.Power.get(vm).state
|
||||
logging.info("Check instance %s status", instance_id)
|
||||
return state
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to get node instance status %s. Encountered following "
|
||||
"exception: %s.", instance_id, e
|
||||
)
|
||||
return None
|
||||
|
||||
def wait_until_released(self, instance_id, timeout):
|
||||
"""
|
||||
Waits until the VM is deleted or until the timeout. Returns True if
|
||||
the VM is successfully deleted, else returns False
|
||||
"""
|
||||
|
||||
time_counter = 0
|
||||
vm = self.get_vm(instance_id)
|
||||
while vm is not None:
|
||||
vm = self.get_vm(instance_id)
|
||||
logging.info(
|
||||
"VM %s is still being deleted, "
|
||||
"sleeping for 5 seconds",
|
||||
instance_id
|
||||
)
|
||||
time.sleep(5)
|
||||
time_counter += 5
|
||||
if time_counter >= timeout:
|
||||
logging.info(
|
||||
"VM %s is still not deleted in allotted time",
|
||||
instance_id
|
||||
)
|
||||
return False
|
||||
return True
|
||||
|
||||
def wait_until_running(self, instance_id, timeout):
|
||||
"""
|
||||
Waits until the VM switches to POWERED_ON state or until the timeout.
|
||||
Returns True if the VM switches to POWERED_ON, else returns False
|
||||
"""
|
||||
|
||||
time_counter = 0
|
||||
status = self.get_vm_status(instance_id)
|
||||
while status != Power.State.POWERED_ON:
|
||||
status = self.get_vm_status(instance_id)
|
||||
logging.info(
|
||||
"VM %s is still not running, "
|
||||
"sleeping for 5 seconds",
|
||||
instance_id
|
||||
)
|
||||
time.sleep(5)
|
||||
time_counter += 5
|
||||
if time_counter >= timeout:
|
||||
logging.info(
|
||||
"VM %s is still not ready in allotted time",
|
||||
instance_id
|
||||
)
|
||||
return False
|
||||
return True
|
||||
|
||||
def wait_until_stopped(self, instance_id, timeout):
|
||||
"""
|
||||
Waits until the VM switches to POWERED_OFF state or until the timeout.
|
||||
Returns True if the VM switches to POWERED_OFF, else returns False
|
||||
"""
|
||||
|
||||
time_counter = 0
|
||||
status = self.get_vm_status(instance_id)
|
||||
while status != Power.State.POWERED_OFF:
|
||||
status = self.get_vm_status(instance_id)
|
||||
logging.info(
|
||||
"VM %s is still not running, "
|
||||
"sleeping for 5 seconds",
|
||||
instance_id
|
||||
)
|
||||
time.sleep(5)
|
||||
time_counter += 5
|
||||
if time_counter >= timeout:
|
||||
logging.info(
|
||||
"VM %s is still not ready in allotted time",
|
||||
instance_id
|
||||
)
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
@dataclass
|
||||
class Node:
|
||||
name: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class NodeScenarioSuccessOutput:
|
||||
|
||||
nodes: typing.Dict[int, Node] = field(
|
||||
metadata={
|
||||
"name": "Nodes started/stopped/terminated/rebooted",
|
||||
"description": "Map between timestamps and the pods "
|
||||
"started/stopped/terminated/rebooted. "
|
||||
"The timestamp is provided in nanoseconds",
|
||||
}
|
||||
)
|
||||
action: kube_helper.Actions = field(
|
||||
metadata={
|
||||
"name": "The action performed on the node",
|
||||
"description": "The action performed or attempted to be "
|
||||
"performed on the node. Possible values"
|
||||
"are : Start, Stop, Terminate, Reboot",
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class NodeScenarioErrorOutput:
|
||||
|
||||
error: str
|
||||
action: kube_helper.Actions = field(
|
||||
metadata={
|
||||
"name": "The action performed on the node",
|
||||
"description": "The action attempted to be performed on the node. "
|
||||
"Possible values are : Start Stop, Terminate, Reboot",
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class NodeScenarioConfig:
|
||||
|
||||
name: typing.Annotated[
|
||||
typing.Optional[str],
|
||||
validation.required_if_not("label_selector"),
|
||||
validation.required_if("skip_openshift_checks"),
|
||||
] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Name",
|
||||
"description": "Name(s) for target nodes. "
|
||||
"Required if label_selector is not set.",
|
||||
},
|
||||
)
|
||||
|
||||
runs: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
|
||||
default=1,
|
||||
metadata={
|
||||
"name": "Number of runs per node",
|
||||
"description": "Number of times to inject each scenario under "
|
||||
"actions (will perform on same node each time)",
|
||||
},
|
||||
)
|
||||
|
||||
label_selector: typing.Annotated[
|
||||
typing.Optional[str],
|
||||
validation.min(1),
|
||||
validation.required_if_not("name")
|
||||
] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Label selector",
|
||||
"description": "Kubernetes label selector for the target nodes. "
|
||||
"Required if name is not set.\n"
|
||||
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
|
||||
"for details.",
|
||||
},
|
||||
)
|
||||
|
||||
timeout: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
|
||||
default=180,
|
||||
metadata={
|
||||
"name": "Timeout",
|
||||
"description": "Timeout to wait for the target pod(s) "
|
||||
"to be removed in seconds.",
|
||||
},
|
||||
)
|
||||
|
||||
instance_count: typing.Annotated[
|
||||
typing.Optional[int],
|
||||
validation.min(1)
|
||||
] = field(
|
||||
default=1,
|
||||
metadata={
|
||||
"name": "Instance Count",
|
||||
"description": "Number of nodes to perform action/select "
|
||||
"that match the label selector.",
|
||||
},
|
||||
)
|
||||
|
||||
skip_openshift_checks: typing.Optional[bool] = field(
|
||||
default=False,
|
||||
metadata={
|
||||
"name": "Skip Openshift Checks",
|
||||
"description": "Skip checking the status of the openshift nodes.",
|
||||
},
|
||||
)
|
||||
|
||||
verify_session: bool = field(
|
||||
default=True,
|
||||
metadata={
|
||||
"name": "Verify API Session",
|
||||
"description": "Verifies the vSphere client session. "
|
||||
"It is enabled by default",
|
||||
},
|
||||
)
|
||||
|
||||
kubeconfig_path: typing.Optional[str] = field(
|
||||
default=None,
|
||||
metadata={
|
||||
"name": "Kubeconfig path",
|
||||
"description": "Path to your Kubeconfig file. "
|
||||
"Defaults to ~/.kube/config.\n"
|
||||
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
|
||||
"for details.",
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@plugin.step(
|
||||
id="node_start_scenario",
|
||||
name="Start the node",
|
||||
description="Start the node(s) by starting the VMware VM "
|
||||
"on which the node is configured",
|
||||
outputs={
|
||||
"success": NodeScenarioSuccessOutput,
|
||||
"error": NodeScenarioErrorOutput
|
||||
},
|
||||
)
|
||||
def node_start(
|
||||
cfg: NodeScenarioConfig,
|
||||
) -> typing.Tuple[
|
||||
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
|
||||
]:
|
||||
with kube_helper.setup_kubernetes(None) as cli:
|
||||
vsphere = vSphere(verify=cfg.verify_session)
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
watch_resource = watch.Watch()
|
||||
node_list = kube_helper.get_node_list(
|
||||
cfg,
|
||||
kube_helper.Actions.START,
|
||||
core_v1
|
||||
)
|
||||
nodes_started = {}
|
||||
for name in node_list:
|
||||
try:
|
||||
for _ in range(cfg.runs):
|
||||
logging.info("Starting node_start_scenario injection")
|
||||
logging.info("Starting the node %s ", name)
|
||||
vm_started = vsphere.start_instances(name)
|
||||
if vm_started:
|
||||
vsphere.wait_until_running(name, cfg.timeout)
|
||||
if not cfg.skip_openshift_checks:
|
||||
kube_helper.wait_for_ready_status(
|
||||
name, cfg.timeout, watch_resource, core_v1
|
||||
)
|
||||
nodes_started[int(time.time_ns())] = Node(name=name)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in running state", name
|
||||
)
|
||||
logging.info(
|
||||
"node_start_scenario has been successfully injected!"
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error("Failed to start node instance. Test Failed")
|
||||
logging.error(
|
||||
"node_start_scenario injection failed! "
|
||||
"Error was: %s", str(e)
|
||||
)
|
||||
return "error", NodeScenarioErrorOutput(
|
||||
format_exc(), kube_helper.Actions.START
|
||||
)
|
||||
|
||||
return "success", NodeScenarioSuccessOutput(
|
||||
nodes_started, kube_helper.Actions.START
|
||||
)
|
||||
|
||||
|
||||
@plugin.step(
|
||||
id="node_stop_scenario",
|
||||
name="Stop the node",
|
||||
description="Stop the node(s) by starting the VMware VM "
|
||||
"on which the node is configured",
|
||||
outputs={
|
||||
"success": NodeScenarioSuccessOutput,
|
||||
"error": NodeScenarioErrorOutput
|
||||
},
|
||||
)
|
||||
def node_stop(
|
||||
cfg: NodeScenarioConfig,
|
||||
) -> typing.Tuple[
|
||||
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
|
||||
]:
|
||||
with kube_helper.setup_kubernetes(None) as cli:
|
||||
vsphere = vSphere(verify=cfg.verify_session)
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
watch_resource = watch.Watch()
|
||||
node_list = kube_helper.get_node_list(
|
||||
cfg,
|
||||
kube_helper.Actions.STOP,
|
||||
core_v1
|
||||
)
|
||||
nodes_stopped = {}
|
||||
for name in node_list:
|
||||
try:
|
||||
for _ in range(cfg.runs):
|
||||
logging.info("Starting node_stop_scenario injection")
|
||||
logging.info("Stopping the node %s ", name)
|
||||
vm_stopped = vsphere.stop_instances(name)
|
||||
if vm_stopped:
|
||||
vsphere.wait_until_stopped(name, cfg.timeout)
|
||||
if not cfg.skip_openshift_checks:
|
||||
kube_helper.wait_for_ready_status(
|
||||
name, cfg.timeout, watch_resource, core_v1
|
||||
)
|
||||
nodes_stopped[int(time.time_ns())] = Node(name=name)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in stopped state", name
|
||||
)
|
||||
logging.info(
|
||||
"node_stop_scenario has been successfully injected!"
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error("Failed to stop node instance. Test Failed")
|
||||
logging.error(
|
||||
"node_stop_scenario injection failed! "
|
||||
"Error was: %s", str(e)
|
||||
)
|
||||
return "error", NodeScenarioErrorOutput(
|
||||
format_exc(), kube_helper.Actions.STOP
|
||||
)
|
||||
|
||||
return "success", NodeScenarioSuccessOutput(
|
||||
nodes_stopped, kube_helper.Actions.STOP
|
||||
)
|
||||
|
||||
|
||||
@plugin.step(
|
||||
id="node_reboot_scenario",
|
||||
name="Reboot VMware VM",
|
||||
description="Reboot the node(s) by starting the VMware VM "
|
||||
"on which the node is configured",
|
||||
outputs={
|
||||
"success": NodeScenarioSuccessOutput,
|
||||
"error": NodeScenarioErrorOutput
|
||||
},
|
||||
)
|
||||
def node_reboot(
|
||||
cfg: NodeScenarioConfig,
|
||||
) -> typing.Tuple[
|
||||
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
|
||||
]:
|
||||
with kube_helper.setup_kubernetes(None) as cli:
|
||||
vsphere = vSphere(verify=cfg.verify_session)
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
watch_resource = watch.Watch()
|
||||
node_list = kube_helper.get_node_list(
|
||||
cfg,
|
||||
kube_helper.Actions.REBOOT,
|
||||
core_v1
|
||||
)
|
||||
nodes_rebooted = {}
|
||||
for name in node_list:
|
||||
try:
|
||||
for _ in range(cfg.runs):
|
||||
logging.info("Starting node_reboot_scenario injection")
|
||||
logging.info("Rebooting the node %s ", name)
|
||||
vsphere.reboot_instances(name)
|
||||
if not cfg.skip_openshift_checks:
|
||||
kube_helper.wait_for_unknown_status(
|
||||
name, cfg.timeout, watch_resource, core_v1
|
||||
)
|
||||
kube_helper.wait_for_ready_status(
|
||||
name, cfg.timeout, watch_resource, core_v1
|
||||
)
|
||||
nodes_rebooted[int(time.time_ns())] = Node(name=name)
|
||||
logging.info(
|
||||
"Node with instance ID: %s has rebooted "
|
||||
"successfully", name
|
||||
)
|
||||
logging.info(
|
||||
"node_reboot_scenario has been successfully injected!"
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error("Failed to reboot node instance. Test Failed")
|
||||
logging.error(
|
||||
"node_reboot_scenario injection failed! "
|
||||
"Error was: %s", str(e)
|
||||
)
|
||||
return "error", NodeScenarioErrorOutput(
|
||||
format_exc(), kube_helper.Actions.REBOOT
|
||||
)
|
||||
|
||||
return "success", NodeScenarioSuccessOutput(
|
||||
nodes_rebooted, kube_helper.Actions.REBOOT
|
||||
)
|
||||
|
||||
|
||||
@plugin.step(
|
||||
id="node_terminate_scenario",
|
||||
name="Reboot VMware VM",
|
||||
description="Wait for the specified number of pods to be present",
|
||||
outputs={
|
||||
"success": NodeScenarioSuccessOutput,
|
||||
"error": NodeScenarioErrorOutput
|
||||
},
|
||||
)
|
||||
def node_terminate(
|
||||
cfg: NodeScenarioConfig,
|
||||
) -> typing.Tuple[
|
||||
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
|
||||
]:
|
||||
with kube_helper.setup_kubernetes(None) as cli:
|
||||
vsphere = vSphere(verify=cfg.verify_session)
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
node_list = kube_helper.get_node_list(
|
||||
cfg, kube_helper.Actions.TERMINATE, core_v1
|
||||
)
|
||||
nodes_terminated = {}
|
||||
for name in node_list:
|
||||
try:
|
||||
for _ in range(cfg.runs):
|
||||
logging.info(
|
||||
"Starting node_termination_scenario injection "
|
||||
"by first stopping the node"
|
||||
)
|
||||
vsphere.stop_instances(name)
|
||||
vsphere.wait_until_stopped(name, cfg.timeout)
|
||||
logging.info(
|
||||
"Releasing the node with instance ID: %s ", name
|
||||
)
|
||||
vsphere.release_instances(name)
|
||||
vsphere.wait_until_released(name, cfg.timeout)
|
||||
nodes_terminated[int(time.time_ns())] = Node(name=name)
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been released", name
|
||||
)
|
||||
logging.info(
|
||||
"node_terminate_scenario has been "
|
||||
"successfully injected!"
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error("Failed to terminate node instance. Test Failed")
|
||||
logging.error(
|
||||
"node_terminate_scenario injection failed! "
|
||||
"Error was: %s", str(e)
|
||||
)
|
||||
return "error", NodeScenarioErrorOutput(
|
||||
format_exc(), kube_helper.Actions.TERMINATE
|
||||
)
|
||||
|
||||
return "success", NodeScenarioSuccessOutput(
|
||||
nodes_terminated, kube_helper.Actions.TERMINATE
|
||||
)
|
||||
@@ -1,5 +1,8 @@
|
||||
import logging
|
||||
import kraken.invoke.command as runcommand
|
||||
|
||||
from arcaflow_plugin_sdk import serialization
|
||||
from kraken.plugins import pod_plugin
|
||||
|
||||
import kraken.cerberus.setup as cerberus
|
||||
import kraken.post_actions.actions as post_actions
|
||||
import kraken.kubernetes.client as kubecli
|
||||
@@ -20,20 +23,30 @@ def run(kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_dur
|
||||
try:
|
||||
# capture start time
|
||||
start_time = int(time.time())
|
||||
scenario_logs = runcommand.invoke(
|
||||
"powerfulseal autonomous --use-pod-delete-instead-"
|
||||
"of-ssh-kill --policy-file %s --kubeconfig %s "
|
||||
"--no-cloud --inventory-kubernetes --headless" % (pod_scenario[0], kubeconfig_path)
|
||||
)
|
||||
|
||||
input = serialization.load_from_file(pod_scenario)
|
||||
|
||||
s = pod_plugin.get_schema()
|
||||
input_data: pod_plugin.KillPodConfig = s.unserialize_input("pod", input)
|
||||
|
||||
if kubeconfig_path is not None:
|
||||
input_data.kubeconfig_path = kubeconfig_path
|
||||
|
||||
output_id, output_data = s.call_step("pod", input_data)
|
||||
|
||||
if output_id == "error":
|
||||
data: pod_plugin.PodErrorOutput = output_data
|
||||
logging.error("Failed to run pod scenario: {}".format(data.error))
|
||||
else:
|
||||
data: pod_plugin.PodSuccessOutput = output_data
|
||||
for pod in data.pods:
|
||||
print("Deleted pod {} in namespace {}\n".format(pod.pod_name, pod.pod_namespace))
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to run scenario: %s. Encountered the following " "exception: %s" % (pod_scenario[0], e)
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Display pod scenario logs/actions
|
||||
print(scenario_logs)
|
||||
|
||||
logging.info("Scenario: %s has been successfully injected!" % (pod_scenario[0]))
|
||||
logging.info("Waiting for the specified duration: %s" % (wait_duration))
|
||||
time.sleep(wait_duration)
|
||||
@@ -119,14 +132,13 @@ def container_killing_in_pod(cont_scenario):
|
||||
container_pod_list = []
|
||||
for pod in pods:
|
||||
if type(pod) == list:
|
||||
container_names = runcommand.invoke(
|
||||
'kubectl get pods %s -n %s -o jsonpath="{.spec.containers[*].name}"' % (pod[0], pod[1])
|
||||
).split(" ")
|
||||
pod_output = kubecli.get_pod_info(pod[0], pod[1])
|
||||
container_names = [container.name for container in pod_output.containers]
|
||||
|
||||
container_pod_list.append([pod[0], pod[1], container_names])
|
||||
else:
|
||||
container_names = runcommand.invoke(
|
||||
'oc get pods %s -n %s -o jsonpath="{.spec.containers[*].name}"' % (pod, namespace)
|
||||
).split(" ")
|
||||
pod_output = kubecli.get_pod_info(pod, namespace)
|
||||
container_names = [container.name for container in pod_output.containers]
|
||||
container_pod_list.append([pod, namespace, container_names])
|
||||
|
||||
killed_count = 0
|
||||
@@ -176,13 +188,11 @@ def check_failed_containers(killed_container_list, wait_time):
|
||||
while timer <= wait_time:
|
||||
for killed_container in killed_container_list:
|
||||
# pod namespace contain name
|
||||
pod_output = runcommand.invoke(
|
||||
"kubectl get pods %s -n %s -o yaml" % (killed_container[0], killed_container[1])
|
||||
)
|
||||
pod_output_yaml = yaml.full_load(pod_output)
|
||||
for statuses in pod_output_yaml["status"]["containerStatuses"]:
|
||||
if statuses["name"] == killed_container[2]:
|
||||
if str(statuses["ready"]).lower() == "true":
|
||||
pod_output = kubecli.get_pod_info(killed_container[0], killed_container[1])
|
||||
|
||||
for container in pod_output.containers:
|
||||
if container.name == killed_container[2]:
|
||||
if container.ready:
|
||||
container_ready.append(killed_container)
|
||||
if len(container_ready) != 0:
|
||||
for item in container_ready:
|
||||
|
||||
@@ -5,21 +5,7 @@ import kraken.invoke.command as runcommand
|
||||
def run(kubeconfig_path, scenario, pre_action_output=""):
|
||||
|
||||
if scenario.endswith(".yaml") or scenario.endswith(".yml"):
|
||||
action_output = runcommand.invoke(
|
||||
"powerfulseal autonomous "
|
||||
"--use-pod-delete-instead-of-ssh-kill"
|
||||
" --policy-file %s --kubeconfig %s --no-cloud"
|
||||
" --inventory-kubernetes --headless" % (scenario, kubeconfig_path)
|
||||
)
|
||||
# read output to make sure no error
|
||||
if "ERROR" in action_output:
|
||||
action_output.split("ERROR")[1].split("\n")[0]
|
||||
if not pre_action_output:
|
||||
logging.info("Powerful seal pre action check failed for " + str(scenario))
|
||||
return False
|
||||
else:
|
||||
logging.info(scenario + " post action checks passed")
|
||||
|
||||
logging.error("Powerfulseal support has recently been removed. Please switch to using plugins instead.")
|
||||
elif scenario.endswith(".py"):
|
||||
action_output = runcommand.invoke("python3 " + scenario).strip()
|
||||
if pre_action_output:
|
||||
|
||||
@@ -9,5 +9,8 @@ def instance(distribution, prometheus_url, prometheus_bearer_token):
|
||||
)
|
||||
prometheus_url = "https://" + url
|
||||
if distribution == "openshift" and not prometheus_bearer_token:
|
||||
prometheus_bearer_token = runcommand.invoke("oc -n openshift-monitoring " "sa get-token prometheus-k8s")
|
||||
prometheus_bearer_token = runcommand.invoke(
|
||||
"oc -n openshift-monitoring sa get-token prometheus-k8s "
|
||||
"|| oc create token -n openshift-monitoring prometheus-k8s"
|
||||
)
|
||||
return prometheus_url, prometheus_bearer_token
|
||||
|
||||
@@ -1,17 +1,19 @@
|
||||
import sys
|
||||
import yaml
|
||||
import re
|
||||
import json
|
||||
import logging
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import kraken.cerberus.setup as cerberus
|
||||
import kraken.kubernetes.client as kubecli
|
||||
import kraken.invoke.command as runcommand
|
||||
|
||||
# Reads the scenario config and creates a temp file to fill up the PVC
|
||||
import yaml
|
||||
|
||||
from ..cerberus import setup as cerberus
|
||||
from ..kubernetes import client as kubecli
|
||||
|
||||
|
||||
def run(scenarios_list, config):
|
||||
"""
|
||||
Reads the scenario config and creates a temp file to fill up the PVC
|
||||
"""
|
||||
failed_post_scenarios = ""
|
||||
for app_config in scenarios_list:
|
||||
if len(app_config) > 1:
|
||||
@@ -21,169 +23,265 @@ def run(scenarios_list, config):
|
||||
pvc_name = scenario_config.get("pvc_name", "")
|
||||
pod_name = scenario_config.get("pod_name", "")
|
||||
namespace = scenario_config.get("namespace", "")
|
||||
target_fill_percentage = scenario_config.get("fill_percentage", "50")
|
||||
target_fill_percentage = scenario_config.get(
|
||||
"fill_percentage", "50"
|
||||
)
|
||||
duration = scenario_config.get("duration", 60)
|
||||
|
||||
logging.info(
|
||||
"""Input params:
|
||||
pvc_name: '%s'\npod_name: '%s'\nnamespace: '%s'\ntarget_fill_percentage: '%s%%'\nduration: '%ss'"""
|
||||
% (str(pvc_name), str(pod_name), str(namespace), str(target_fill_percentage), str(duration))
|
||||
"Input params:\n"
|
||||
"pvc_name: '%s'\n"
|
||||
"pod_name: '%s'\n"
|
||||
"namespace: '%s'\n"
|
||||
"target_fill_percentage: '%s%%'\nduration: '%ss'"
|
||||
% (
|
||||
str(pvc_name),
|
||||
str(pod_name),
|
||||
str(namespace),
|
||||
str(target_fill_percentage),
|
||||
str(duration)
|
||||
)
|
||||
)
|
||||
|
||||
# Check input params
|
||||
if namespace is None:
|
||||
logging.error("You must specify the namespace where the PVC is")
|
||||
logging.error(
|
||||
"You must specify the namespace where the PVC is"
|
||||
)
|
||||
sys.exit(1)
|
||||
if pvc_name is None and pod_name is None:
|
||||
logging.error("You must specify the pvc_name or the pod_name")
|
||||
logging.error(
|
||||
"You must specify the pvc_name or the pod_name"
|
||||
)
|
||||
sys.exit(1)
|
||||
if pvc_name and pod_name:
|
||||
logging.info(
|
||||
"pod_name will be ignored, pod_name used will be a retrieved from the pod used in the pvc_name"
|
||||
"pod_name will be ignored, pod_name used will be "
|
||||
"a retrieved from the pod used in the pvc_name"
|
||||
)
|
||||
|
||||
# Get pod name
|
||||
if pvc_name:
|
||||
if pod_name:
|
||||
logging.info(
|
||||
"pod_name '%s' will be overridden from the pod mounted in the PVC" % (str(pod_name))
|
||||
"pod_name '%s' will be overridden with one of "
|
||||
"the pods mounted in the PVC" % (str(pod_name))
|
||||
)
|
||||
command = "kubectl describe pvc %s -n %s | grep -E 'Mounted By:|Used By:' | grep -Eo '[^: ]*$'" % (
|
||||
str(pvc_name),
|
||||
str(namespace),
|
||||
)
|
||||
logging.debug("Get pod name command:\n %s" % command)
|
||||
pod_name = runcommand.invoke(command, 60).rstrip()
|
||||
logging.info("Pod name: %s" % pod_name)
|
||||
if pod_name == "<none>":
|
||||
pvc = kubecli.get_pvc_info(pvc_name, namespace)
|
||||
try:
|
||||
# random generator not used for
|
||||
# security/cryptographic purposes.
|
||||
pod_name = random.choice(pvc.podNames) # nosec
|
||||
logging.info("Pod name: %s" % pod_name)
|
||||
except Exception:
|
||||
logging.error(
|
||||
"Pod associated with %s PVC, on namespace %s, not found" % (str(pvc_name), str(namespace))
|
||||
"Pod associated with %s PVC, on namespace %s, "
|
||||
"not found" % (str(pvc_name), str(namespace))
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Get volume name
|
||||
command = 'kubectl get pods %s -n %s -o json | jq -r ".spec.volumes"' % (
|
||||
str(pod_name),
|
||||
str(namespace),
|
||||
)
|
||||
logging.debug("Get mount path command:\n %s" % command)
|
||||
volumes_list = runcommand.invoke(command, 60).rstrip()
|
||||
volumes_list_json = json.loads(volumes_list)
|
||||
for entry in volumes_list_json:
|
||||
if len(entry["persistentVolumeClaim"]["claimName"]) > 0:
|
||||
volume_name = entry["name"]
|
||||
pvc_name = entry["persistentVolumeClaim"]["claimName"]
|
||||
pod = kubecli.get_pod_info(name=pod_name, namespace=namespace)
|
||||
|
||||
if pod is None:
|
||||
logging.error(
|
||||
"Exiting as pod '%s' doesn't exist "
|
||||
"in namespace '%s'" % (
|
||||
str(pod_name),
|
||||
str(namespace)
|
||||
)
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
for volume in pod.volumes:
|
||||
if volume.pvcName is not None:
|
||||
volume_name = volume.name
|
||||
pvc_name = volume.pvcName
|
||||
pvc = kubecli.get_pvc_info(pvc_name, namespace)
|
||||
break
|
||||
if 'pvc' not in locals():
|
||||
logging.error(
|
||||
"Pod '%s' in namespace '%s' does not use a pvc" % (
|
||||
str(pod_name),
|
||||
str(namespace)
|
||||
)
|
||||
)
|
||||
sys.exit(1)
|
||||
logging.info("Volume name: %s" % volume_name)
|
||||
logging.info("PVC name: %s" % pvc_name)
|
||||
|
||||
# Get container name and mount path
|
||||
command = 'kubectl get pods %s -n %s -o json | jq -r ".spec.containers"' % (
|
||||
str(pod_name),
|
||||
str(namespace),
|
||||
)
|
||||
logging.debug("Get mount path command:\n %s" % command)
|
||||
volume_mounts_list = runcommand.invoke(command, 60).rstrip().replace("\n]\n[\n", ",\n")
|
||||
volume_mounts_list_json = json.loads(volume_mounts_list)
|
||||
for entry in volume_mounts_list_json:
|
||||
for vol in entry["volumeMounts"]:
|
||||
if vol["name"] == volume_name:
|
||||
mount_path = vol["mountPath"]
|
||||
container_name = entry["name"]
|
||||
for container in pod.containers:
|
||||
for vol in container.volumeMounts:
|
||||
if vol.name == volume_name:
|
||||
mount_path = vol.mountPath
|
||||
container_name = container.name
|
||||
break
|
||||
logging.info("Container path: %s" % container_name)
|
||||
logging.info("Mount path: %s" % mount_path)
|
||||
|
||||
# Get PVC capacity
|
||||
command = "kubectl describe pvc %s -n %s | grep \"Capacity:\" | grep -Eo '[^: ]*$'" % (
|
||||
str(pvc_name),
|
||||
str(namespace),
|
||||
)
|
||||
pvc_capacity = runcommand.invoke(
|
||||
command,
|
||||
60,
|
||||
).rstrip()
|
||||
logging.debug("Get PVC capacity command:\n %s" % command)
|
||||
pvc_capacity_bytes = toKbytes(pvc_capacity)
|
||||
logging.info("PVC capacity: %s KB" % pvc_capacity_bytes)
|
||||
|
||||
# Get used bytes in PVC
|
||||
command = "du -sk %s | grep -Eo '^[0-9]*'" % (str(mount_path))
|
||||
logging.debug("Get used bytes in PVC command:\n %s" % command)
|
||||
pvc_used = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
|
||||
logging.info("PVC used: %s KB" % pvc_used)
|
||||
# Get PVC capacity and used bytes
|
||||
command = "df %s -B 1024 | sed 1d" % (str(mount_path))
|
||||
command_output = (
|
||||
kubecli.exec_cmd_in_pod(
|
||||
command,
|
||||
pod_name,
|
||||
namespace,
|
||||
container_name,
|
||||
"sh"
|
||||
)
|
||||
).split()
|
||||
pvc_used_kb = int(command_output[2])
|
||||
pvc_capacity_kb = pvc_used_kb + int(command_output[3])
|
||||
logging.info("PVC used: %s KB" % pvc_used_kb)
|
||||
logging.info("PVC capacity: %s KB" % pvc_capacity_kb)
|
||||
|
||||
# Check valid fill percentage
|
||||
current_fill_percentage = float(pvc_used) / float(pvc_capacity_bytes)
|
||||
if not (current_fill_percentage * 100 < float(target_fill_percentage) <= 99):
|
||||
current_fill_percentage = pvc_used_kb / pvc_capacity_kb
|
||||
if not (
|
||||
current_fill_percentage * 100
|
||||
< float(target_fill_percentage)
|
||||
<= 99
|
||||
):
|
||||
logging.error(
|
||||
"""
|
||||
Target fill percentage (%.2f%%) is lower than current fill percentage (%.2f%%)
|
||||
or higher than 99%%
|
||||
"""
|
||||
% (target_fill_percentage, current_fill_percentage * 100)
|
||||
"Target fill percentage (%.2f%%) is lower than "
|
||||
"current fill percentage (%.2f%%) "
|
||||
"or higher than 99%%" % (
|
||||
target_fill_percentage,
|
||||
current_fill_percentage * 100
|
||||
)
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Calculate file size
|
||||
file_size = int((float(target_fill_percentage / 100) * float(pvc_capacity_bytes)) - float(pvc_used))
|
||||
logging.debug("File size: %s KB" % file_size)
|
||||
file_size_kb = int(
|
||||
(
|
||||
float(
|
||||
target_fill_percentage / 100
|
||||
) * float(pvc_capacity_kb)
|
||||
) - float(pvc_used_kb)
|
||||
)
|
||||
logging.debug("File size: %s KB" % file_size_kb)
|
||||
|
||||
file_name = "kraken.tmp"
|
||||
logging.info(
|
||||
"Creating %s file, %s KB size, in pod %s at %s (ns %s)"
|
||||
% (str(file_name), str(file_size), str(pod_name), str(mount_path), str(namespace))
|
||||
% (
|
||||
str(file_name),
|
||||
str(file_size_kb),
|
||||
str(pod_name),
|
||||
str(mount_path),
|
||||
str(namespace)
|
||||
)
|
||||
)
|
||||
|
||||
start_time = int(time.time())
|
||||
# Create temp file in the PVC
|
||||
full_path = "%s/%s" % (str(mount_path), str(file_name))
|
||||
command = "dd bs=1024 count=%s </dev/urandom >%s" % (str(file_size), str(full_path))
|
||||
logging.debug("Create temp file in the PVC command:\n %s" % command)
|
||||
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
|
||||
logging.info("\n" + str(response))
|
||||
command = "fallocate -l $((%s*1024)) %s" % (
|
||||
str(file_size_kb),
|
||||
str(full_path)
|
||||
)
|
||||
logging.debug(
|
||||
"Create temp file in the PVC command:\n %s" % command
|
||||
)
|
||||
kubecli.exec_cmd_in_pod(
|
||||
command, pod_name, namespace, container_name, "sh"
|
||||
)
|
||||
|
||||
# Check if file is created
|
||||
command = "ls %s" % (str(mount_path))
|
||||
command = "ls -lh %s" % (str(mount_path))
|
||||
logging.debug("Check file is created command:\n %s" % command)
|
||||
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
|
||||
response = kubecli.exec_cmd_in_pod(
|
||||
command, pod_name, namespace, container_name, "sh"
|
||||
)
|
||||
logging.info("\n" + str(response))
|
||||
if str(file_name).lower() in str(response).lower():
|
||||
logging.info("%s file successfully created" % (str(full_path)))
|
||||
logging.info(
|
||||
"%s file successfully created" % (str(full_path))
|
||||
)
|
||||
else:
|
||||
logging.error("Failed to create tmp file with %s size" % (str(file_size)))
|
||||
remove_temp_file(file_name, full_path, pod_name, namespace, container_name, mount_path, file_size)
|
||||
logging.error(
|
||||
"Failed to create tmp file with %s size" % (
|
||||
str(file_size_kb)
|
||||
)
|
||||
)
|
||||
remove_temp_file(
|
||||
file_name,
|
||||
full_path,
|
||||
pod_name,
|
||||
namespace,
|
||||
container_name,
|
||||
mount_path,
|
||||
file_size_kb
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Wait for the specified duration
|
||||
logging.info("Waiting for the specified duration in the config: %ss" % (duration))
|
||||
logging.info(
|
||||
"Waiting for the specified duration in the config: %ss" % (
|
||||
duration
|
||||
)
|
||||
)
|
||||
time.sleep(duration)
|
||||
logging.info("Finish waiting")
|
||||
|
||||
remove_temp_file(file_name, full_path, pod_name, namespace, container_name, mount_path, file_size)
|
||||
remove_temp_file(
|
||||
file_name,
|
||||
full_path,
|
||||
pod_name,
|
||||
namespace,
|
||||
container_name,
|
||||
mount_path,
|
||||
file_size_kb
|
||||
)
|
||||
|
||||
end_time = int(time.time())
|
||||
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
|
||||
cerberus.publish_kraken_status(
|
||||
config,
|
||||
failed_post_scenarios,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
|
||||
|
||||
def remove_temp_file(file_name, full_path, pod_name, namespace, container_name, mount_path, file_size):
|
||||
command = "rm %s" % (str(full_path))
|
||||
def remove_temp_file(
|
||||
file_name,
|
||||
full_path,
|
||||
pod_name,
|
||||
namespace,
|
||||
container_name,
|
||||
mount_path,
|
||||
file_size_kb
|
||||
):
|
||||
command = "rm -f %s" % (str(full_path))
|
||||
logging.debug("Remove temp file from the PVC command:\n %s" % command)
|
||||
kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
|
||||
command = "ls %s" % (str(mount_path))
|
||||
command = "ls -lh %s" % (str(mount_path))
|
||||
logging.debug("Check temp file is removed command:\n %s" % command)
|
||||
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name, "sh")
|
||||
response = kubecli.exec_cmd_in_pod(
|
||||
command,
|
||||
pod_name,
|
||||
namespace,
|
||||
container_name,
|
||||
"sh"
|
||||
)
|
||||
logging.info("\n" + str(response))
|
||||
if not (str(file_name).lower() in str(response).lower()):
|
||||
logging.info("Temp file successfully removed")
|
||||
else:
|
||||
logging.error("Failed to delete tmp file with %s size" % (str(file_size)))
|
||||
logging.error(
|
||||
"Failed to delete tmp file with %s size" % (str(file_size_kb))
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def toKbytes(value):
|
||||
if not re.match("^[0-9]+[K|M|G|T]i$", value):
|
||||
logging.error("PVC capacity %s does not match expression regexp '^[0-9]+[K|M|G|T]i$'")
|
||||
logging.error(
|
||||
"PVC capacity %s does not match expression "
|
||||
"regexp '^[0-9]+[K|M|G|T]i$'"
|
||||
)
|
||||
sys.exit(1)
|
||||
unit = {"K": 0, "M": 1, "G": 2, "T": 3}
|
||||
base = 1024 if ("i" in value) else 1000
|
||||
|
||||
@@ -1,16 +0,0 @@
|
||||
from typing import List, Dict
|
||||
|
||||
from kraken.scenarios.base import Scenario
|
||||
from kraken.scenarios.runner import ScenarioRunnerConfig
|
||||
|
||||
|
||||
class Loader:
|
||||
def __init__(self, scenarios: List[Scenario]):
|
||||
self.scenarios = scenarios
|
||||
|
||||
def load(self, data: Dict) -> ScenarioRunnerConfig:
|
||||
"""
|
||||
This function loads data from a dictionary and produces a scenario runner config. It uses the scenarios provided
|
||||
when instantiating the loader.
|
||||
"""
|
||||
|
||||
@@ -1,28 +0,0 @@
|
||||
from dataclasses import dataclass
|
||||
from typing import List
|
||||
|
||||
from kraken.scenarios import base
|
||||
|
||||
from kraken.scenarios.health import HealthChecker
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScenarioRunnerConfig:
|
||||
iterations: int
|
||||
steps: List[base.ScenarioConfig]
|
||||
|
||||
|
||||
class ScenarioRunner:
|
||||
"""
|
||||
This class provides the services to load a scenario configuration and iterate over the scenarios, while
|
||||
observing the health checks.
|
||||
"""
|
||||
|
||||
def __init__(self, scenarios: List[base.Scenario], health_checker: HealthChecker):
|
||||
self._scenarios = scenarios
|
||||
self._health_checker = health_checker
|
||||
|
||||
def run(self, config: ScenarioRunnerConfig):
|
||||
"""
|
||||
This function runs a list of scenarios described in the configuration.
|
||||
"""
|
||||
@@ -1,61 +0,0 @@
|
||||
from typing import TypeVar, Generic, Dict
|
||||
|
||||
from kraken.scenarios.kube import Client
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScenarioConfig(ABC):
|
||||
"""
|
||||
ScenarioConfig is a generic base class for configurations for individual scenarios. Each scenario should define
|
||||
its own configuration classes.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def from_dict(self, data: Dict) -> None:
|
||||
"""
|
||||
from_dict loads the configuration from a dict. It is mainly used to load JSON data into the scenario
|
||||
configuration.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def validate(self) -> None:
|
||||
"""
|
||||
validate is a function that validates all data on the scenario configuration. If the scenario configuration
|
||||
is invalid an Exception should be thrown.
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
T = TypeVar('T', bound=ScenarioConfig)
|
||||
|
||||
|
||||
class Scenario(Generic[T]):
|
||||
"""
|
||||
Scenario is a generic base class that provides a uniform run function to call in a loop. Scenario implementations
|
||||
should extend this class and accept their configuration via their initializer.
|
||||
"""
|
||||
|
||||
@staticmethod
|
||||
def create_config(self) -> T:
|
||||
"""
|
||||
create_config creates a new copy of the configuration structure that allows loading data from a dictionary
|
||||
and validating it.
|
||||
"""
|
||||
pass
|
||||
|
||||
def run(self, kube: Client, config: T) -> None:
|
||||
"""
|
||||
run is a function that is called when the scenario should be run. A Kubernetes client implementation will be
|
||||
passed. The scenario should execute and return immediately. If the scenario fails, an Exception should be
|
||||
thrown.
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
class TimeoutException(Exception):
|
||||
"""
|
||||
TimeoutException is an exception thrown when a scenario has a timeout waiting for a condition to happen.
|
||||
"""
|
||||
pass
|
||||
@@ -1,96 +0,0 @@
|
||||
import logging
|
||||
import random
|
||||
import re
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Dict, List
|
||||
|
||||
from kraken.scenarios import base
|
||||
from kraken.scenarios.base import ScenarioConfig, Scenario
|
||||
from kraken.scenarios.kube import Client, Pod, NotFoundException
|
||||
|
||||
|
||||
@dataclass
|
||||
class PodScenarioConfig(ScenarioConfig):
|
||||
"""
|
||||
PodScenarioConfig is a configuration structure specific to pod scenarios. It describes which pod from which
|
||||
namespace(s) to select for killing and how many pods to kill.
|
||||
"""
|
||||
|
||||
name_pattern: str
|
||||
namespace_pattern: str
|
||||
label_selector: str
|
||||
kill: int
|
||||
|
||||
def from_dict(self, data: Dict) -> None:
|
||||
self.name_pattern = data.get("name_pattern")
|
||||
self.namespace_pattern = data.get("namespace_pattern")
|
||||
self.label_selector = data.get("label_selector")
|
||||
self.kill = data.get("kill")
|
||||
|
||||
def validate(self) -> None:
|
||||
re.compile(self.name_pattern)
|
||||
re.compile(self.namespace_pattern)
|
||||
if self.kill < 1:
|
||||
raise Exception("Invalid value for 'kill': %d" % self.kill)
|
||||
|
||||
def namespace_regexp(self) -> re.Pattern:
|
||||
return re.compile(self.namespace_pattern)
|
||||
|
||||
def name_regexp(self) -> re.Pattern:
|
||||
return re.compile(self.name_pattern)
|
||||
|
||||
|
||||
class PodScenario(Scenario[PodScenarioConfig]):
|
||||
"""
|
||||
PodScenario is a scenario that tests the stability of a Kubernetes cluster by killing one or more pods based on the
|
||||
PodScenarioConfig.
|
||||
"""
|
||||
|
||||
def __init__(self, logger: logging.Logger):
|
||||
self.logger = logger
|
||||
|
||||
def create_config(self) -> PodScenarioConfig:
|
||||
return PodScenarioConfig(
|
||||
name_pattern=".*",
|
||||
namespace_pattern=".*",
|
||||
label_selector="",
|
||||
kill=1,
|
||||
)
|
||||
|
||||
def run(self, kube: Client, config: PodScenarioConfig):
|
||||
pod_candidates: List[Pod] = []
|
||||
namespace_re = config.namespace_regexp()
|
||||
name_re = config.name_regexp()
|
||||
|
||||
self.logger.info("Listing all pods to determine viable pods to kill...")
|
||||
for pod in kube.list_all_pods(label_selector=config.label_selector):
|
||||
if namespace_re.match(pod.namespace) and name_re.match(pod.name):
|
||||
pod_candidates.append(pod)
|
||||
random.shuffle(pod_candidates)
|
||||
removed_pod: List[Pod] = []
|
||||
pods_to_kill = min(config.kill, len(pod_candidates))
|
||||
|
||||
self.logger.info("Killing %d pods...", pods_to_kill)
|
||||
for i in range(pods_to_kill):
|
||||
pod = pod_candidates[i]
|
||||
self.logger.info("Killing pod %s...", pod.name)
|
||||
removed_pod.append(pod)
|
||||
kube.remove_pod(pod.name, pod.namespace)
|
||||
|
||||
self.logger.info("Waiting for pods to be removed...")
|
||||
for i in range(60):
|
||||
time.sleep(1)
|
||||
for pod in removed_pod:
|
||||
try:
|
||||
kube.get_pod(pod.name, pod.namespace)
|
||||
self.logger.info("Pod %s still exists...", pod.name)
|
||||
except NotFoundException:
|
||||
self.logger.info("Pod %s is now removed.", pod.name)
|
||||
removed_pod.remove(pod)
|
||||
if len(removed_pod) == 0:
|
||||
self.logger.info("All pods removed, pod scenario complete.")
|
||||
return
|
||||
|
||||
self.logger.warning("Timeout waiting for pods to be removed.")
|
||||
raise base.TimeoutException("Timeout while waiting for pods to be removed.")
|
||||
@@ -1,43 +0,0 @@
|
||||
import logging
|
||||
import sys
|
||||
import unittest
|
||||
|
||||
from kraken.scenarios import kube
|
||||
from kraken.scenarios.kube import Client, NotFoundException
|
||||
from kraken.scenarios.pod import PodScenario
|
||||
|
||||
|
||||
class TestPodScenario(unittest.TestCase):
|
||||
def test_run(self):
|
||||
"""
|
||||
This test creates a test pod and then runs the pod scenario restricting the run to that specific pod.
|
||||
"""
|
||||
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
|
||||
|
||||
c = Client()
|
||||
test_pod = c.create_test_pod()
|
||||
self.addCleanup(lambda: self._remove_test_pod(c, test_pod.name, test_pod.namespace))
|
||||
|
||||
scenario = PodScenario(logging.getLogger(__name__))
|
||||
config = scenario.create_config()
|
||||
config.kill = 1
|
||||
config.name_pattern = test_pod.name
|
||||
config.namespace_pattern = test_pod.namespace
|
||||
scenario.run(c, config)
|
||||
|
||||
try:
|
||||
c.get_pod(test_pod.name)
|
||||
self.fail("Getting the pod after a pod scenario run should result in a NotFoundException.")
|
||||
except NotFoundException:
|
||||
return
|
||||
|
||||
@staticmethod
|
||||
def _remove_test_pod(c: kube.Client, pod_name: str, pod_namespace: str):
|
||||
try:
|
||||
c.remove_pod(pod_name, pod_namespace)
|
||||
except NotFoundException:
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
@@ -5,13 +5,14 @@ import yaml
|
||||
import logging
|
||||
import time
|
||||
from multiprocessing.pool import ThreadPool
|
||||
import kraken.cerberus.setup as cerberus
|
||||
import kraken.kubernetes.client as kubecli
|
||||
import kraken.post_actions.actions as post_actions
|
||||
from kraken.node_actions.aws_node_scenarios import AWS
|
||||
from kraken.node_actions.openstack_node_scenarios import OPENSTACKCLOUD
|
||||
from kraken.node_actions.az_node_scenarios import Azure
|
||||
from kraken.node_actions.gcp_node_scenarios import GCP
|
||||
|
||||
from ..cerberus import setup as cerberus
|
||||
from ..kubernetes import client as kubecli
|
||||
from ..post_actions import actions as post_actions
|
||||
from ..node_actions.aws_node_scenarios import AWS
|
||||
from ..node_actions.openstack_node_scenarios import OPENSTACKCLOUD
|
||||
from ..node_actions.az_node_scenarios import Azure
|
||||
from ..node_actions.gcp_node_scenarios import GCP
|
||||
|
||||
|
||||
def multiprocess_nodes(cloud_object_function, nodes):
|
||||
@@ -53,7 +54,10 @@ def cluster_shut_down(shut_down_config):
|
||||
elif cloud_type.lower() in ["azure", "az"]:
|
||||
cloud_object = Azure()
|
||||
else:
|
||||
logging.error("Cloud type " + cloud_type + " is not currently supported for cluster shut down")
|
||||
logging.error(
|
||||
"Cloud type %s is not currently supported for cluster shut down" %
|
||||
cloud_type
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
nodes = kubecli.list_nodes()
|
||||
@@ -70,17 +74,28 @@ def cluster_shut_down(shut_down_config):
|
||||
while len(stopping_nodes) > 0:
|
||||
for node in stopping_nodes:
|
||||
if type(node) is tuple:
|
||||
node_status = cloud_object.wait_until_stopped(node[1], node[0], timeout)
|
||||
node_status = cloud_object.wait_until_stopped(
|
||||
node[1],
|
||||
node[0],
|
||||
timeout
|
||||
)
|
||||
else:
|
||||
node_status = cloud_object.wait_until_stopped(node, timeout)
|
||||
node_status = cloud_object.wait_until_stopped(
|
||||
node,
|
||||
timeout
|
||||
)
|
||||
|
||||
# Only want to remove node from stopping list when fully stopped/no error
|
||||
# Only want to remove node from stopping list
|
||||
# when fully stopped/no error
|
||||
if node_status:
|
||||
stopped_nodes.remove(node)
|
||||
|
||||
stopping_nodes = stopped_nodes.copy()
|
||||
|
||||
logging.info("Shutting down the cluster for the specified duration: %s" % (shut_down_duration))
|
||||
logging.info(
|
||||
"Shutting down the cluster for the specified duration: %s" %
|
||||
(shut_down_duration)
|
||||
)
|
||||
time.sleep(shut_down_duration)
|
||||
logging.info("Restarting the nodes")
|
||||
restarted_nodes = set(node_id)
|
||||
@@ -90,13 +105,22 @@ def cluster_shut_down(shut_down_config):
|
||||
while len(not_running_nodes) > 0:
|
||||
for node in not_running_nodes:
|
||||
if type(node) is tuple:
|
||||
node_status = cloud_object.wait_until_running(node[1], node[0], timeout)
|
||||
node_status = cloud_object.wait_until_running(
|
||||
node[1],
|
||||
node[0],
|
||||
timeout
|
||||
)
|
||||
else:
|
||||
node_status = cloud_object.wait_until_running(node, timeout)
|
||||
node_status = cloud_object.wait_until_running(
|
||||
node,
|
||||
timeout
|
||||
)
|
||||
if node_status:
|
||||
restarted_nodes.remove(node)
|
||||
not_running_nodes = restarted_nodes.copy()
|
||||
logging.info("Waiting for 150s to allow cluster component initialization")
|
||||
logging.info(
|
||||
"Waiting for 150s to allow cluster component initialization"
|
||||
)
|
||||
time.sleep(150)
|
||||
|
||||
logging.info("Successfully injected cluster_shut_down scenario!")
|
||||
@@ -111,13 +135,21 @@ def run(scenarios_list, config, wait_duration):
|
||||
pre_action_output = ""
|
||||
with open(shut_down_config[0], "r") as f:
|
||||
shut_down_config_yaml = yaml.full_load(f)
|
||||
shut_down_config_scenario = shut_down_config_yaml["cluster_shut_down_scenario"]
|
||||
shut_down_config_scenario = \
|
||||
shut_down_config_yaml["cluster_shut_down_scenario"]
|
||||
start_time = int(time.time())
|
||||
cluster_shut_down(shut_down_config_scenario)
|
||||
logging.info("Waiting for the specified duration: %s" % (wait_duration))
|
||||
logging.info(
|
||||
"Waiting for the specified duration: %s" % (wait_duration)
|
||||
)
|
||||
time.sleep(wait_duration)
|
||||
failed_post_scenarios = post_actions.check_recovery(
|
||||
"", shut_down_config, failed_post_scenarios, pre_action_output
|
||||
)
|
||||
end_time = int(time.time())
|
||||
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
|
||||
cerberus.publish_kraken_status(
|
||||
config,
|
||||
failed_post_scenarios,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
|
||||
@@ -1,23 +1,32 @@
|
||||
import datetime
|
||||
import time
|
||||
import logging
|
||||
import kraken.invoke.command as runcommand
|
||||
import kraken.kubernetes.client as kubecli
|
||||
import re
|
||||
import sys
|
||||
import kraken.cerberus.setup as cerberus
|
||||
import yaml
|
||||
import random
|
||||
|
||||
from ..cerberus import setup as cerberus
|
||||
from ..kubernetes import client as kubecli
|
||||
from ..invoke import command as runcommand
|
||||
|
||||
|
||||
def pod_exec(pod_name, command, namespace, container_name):
|
||||
i = 0
|
||||
for i in range(5):
|
||||
response = kubecli.exec_cmd_in_pod(command, pod_name, namespace, container_name)
|
||||
response = kubecli.exec_cmd_in_pod(
|
||||
command,
|
||||
pod_name,
|
||||
namespace,
|
||||
container_name
|
||||
)
|
||||
if not response:
|
||||
time.sleep(2)
|
||||
continue
|
||||
elif "unauthorized" in response.lower() or "authorization" in response.lower():
|
||||
elif (
|
||||
"unauthorized" in response.lower() or
|
||||
"authorization" in response.lower()
|
||||
):
|
||||
time.sleep(2)
|
||||
continue
|
||||
else:
|
||||
@@ -26,7 +35,9 @@ def pod_exec(pod_name, command, namespace, container_name):
|
||||
|
||||
|
||||
def node_debug(node_name, command):
|
||||
response = runcommand.invoke("oc debug node/" + node_name + " -- chroot /host " + command)
|
||||
response = runcommand.invoke(
|
||||
"oc debug node/" + node_name + " -- chroot /host " + command
|
||||
)
|
||||
return response
|
||||
|
||||
|
||||
@@ -37,9 +48,18 @@ def get_container_name(pod_name, namespace, container_name=""):
|
||||
if container_name in container_names:
|
||||
return container_name
|
||||
else:
|
||||
logging.error("Container name %s not an existing container in pod %s" % (container_name, pod_name))
|
||||
logging.error(
|
||||
"Container name %s not an existing container in pod %s" % (
|
||||
container_name,
|
||||
pod_name
|
||||
)
|
||||
)
|
||||
else:
|
||||
container_name = container_names[random.randint(0, len(container_names) - 1)]
|
||||
container_name = container_names[
|
||||
# random module here is not used for security/cryptographic
|
||||
# purposes
|
||||
random.randint(0, len(container_names) - 1) # nosec
|
||||
]
|
||||
return container_name
|
||||
|
||||
|
||||
@@ -55,7 +75,10 @@ def skew_time(scenario):
|
||||
node_names = []
|
||||
if "object_name" in scenario.keys() and scenario["object_name"]:
|
||||
node_names = scenario["object_name"]
|
||||
elif "label_selector" in scenario.keys() and scenario["label_selector"]:
|
||||
elif (
|
||||
"label_selector" in scenario.keys() and
|
||||
scenario["label_selector"]
|
||||
):
|
||||
node_names = kubecli.list_nodes(scenario["label_selector"])
|
||||
|
||||
for node in node_names:
|
||||
@@ -75,44 +98,79 @@ def skew_time(scenario):
|
||||
elif "namespace" in scenario.keys() and scenario["namespace"]:
|
||||
if "label_selector" not in scenario.keys():
|
||||
logging.info(
|
||||
"label_selector key not found, querying for all the pods in namespace: %s" % (scenario["namespace"])
|
||||
"label_selector key not found, querying for all the pods "
|
||||
"in namespace: %s" % (scenario["namespace"])
|
||||
)
|
||||
pod_names = kubecli.list_pods(scenario["namespace"])
|
||||
else:
|
||||
logging.info(
|
||||
"Querying for the pods matching the %s label_selector in namespace %s"
|
||||
"Querying for the pods matching the %s label_selector "
|
||||
"in namespace %s"
|
||||
% (scenario["label_selector"], scenario["namespace"])
|
||||
)
|
||||
pod_names = kubecli.list_pods(scenario["namespace"], scenario["label_selector"])
|
||||
pod_names = kubecli.list_pods(
|
||||
scenario["namespace"],
|
||||
scenario["label_selector"]
|
||||
)
|
||||
counter = 0
|
||||
for pod_name in pod_names:
|
||||
pod_names[counter] = [pod_name, scenario["namespace"]]
|
||||
counter += 1
|
||||
elif "label_selector" in scenario.keys() and scenario["label_selector"]:
|
||||
elif (
|
||||
"label_selector" in scenario.keys() and
|
||||
scenario["label_selector"]
|
||||
):
|
||||
pod_names = kubecli.get_all_pods(scenario["label_selector"])
|
||||
|
||||
if len(pod_names) == 0:
|
||||
logging.info("Cannot find pods matching the namespace/label_selector, please check")
|
||||
logging.info(
|
||||
"Cannot find pods matching the namespace/label_selector, "
|
||||
"please check"
|
||||
)
|
||||
sys.exit(1)
|
||||
pod_counter = 0
|
||||
for pod in pod_names:
|
||||
if len(pod) > 1:
|
||||
selected_container_name = get_container_name(pod[0], pod[1], container_name)
|
||||
pod_exec_response = pod_exec(pod[0], skew_command, pod[1], selected_container_name)
|
||||
selected_container_name = get_container_name(
|
||||
pod[0],
|
||||
pod[1],
|
||||
container_name
|
||||
)
|
||||
pod_exec_response = pod_exec(
|
||||
pod[0],
|
||||
skew_command,
|
||||
pod[1],
|
||||
selected_container_name
|
||||
)
|
||||
if pod_exec_response is False:
|
||||
logging.error(
|
||||
"Couldn't reset time on container %s in pod %s in namespace %s"
|
||||
"Couldn't reset time on container %s "
|
||||
"in pod %s in namespace %s"
|
||||
% (selected_container_name, pod[0], pod[1])
|
||||
)
|
||||
sys.exit(1)
|
||||
pod_names[pod_counter].append(selected_container_name)
|
||||
else:
|
||||
selected_container_name = get_container_name(pod, scenario["namespace"], container_name)
|
||||
pod_exec_response = pod_exec(pod, skew_command, scenario["namespace"], selected_container_name)
|
||||
selected_container_name = get_container_name(
|
||||
pod,
|
||||
scenario["namespace"],
|
||||
container_name
|
||||
)
|
||||
pod_exec_response = pod_exec(
|
||||
pod,
|
||||
skew_command,
|
||||
scenario["namespace"],
|
||||
selected_container_name
|
||||
)
|
||||
if pod_exec_response is False:
|
||||
logging.error(
|
||||
"Couldn't reset time on container %s in pod %s in namespace %s"
|
||||
% (selected_container_name, pod, scenario["namespace"])
|
||||
"Couldn't reset time on container "
|
||||
"%s in pod %s in namespace %s"
|
||||
% (
|
||||
selected_container_name,
|
||||
pod,
|
||||
scenario["namespace"]
|
||||
)
|
||||
)
|
||||
sys.exit(1)
|
||||
pod_names[pod_counter].append(selected_container_name)
|
||||
@@ -128,8 +186,9 @@ def parse_string_date(obj_datetime):
|
||||
obj_datetime = re.sub(r"\s\s+", " ", obj_datetime).strip()
|
||||
logging.info("Obj_date sub time " + str(obj_datetime))
|
||||
date_line = re.match(
|
||||
r"[\s\S\n]*\w{3} \w{3} \d{1,} \d{2}:\d{2}:\d{2} \w{3} \d{4}[\s\S\n]*", obj_datetime
|
||||
) # noqa
|
||||
r"[\s\S\n]*\w{3} \w{3} \d{1,} \d{2}:\d{2}:\d{2} \w{3} \d{4}[\s\S\n]*", # noqa
|
||||
obj_datetime
|
||||
)
|
||||
if date_line is not None:
|
||||
search_response = date_line.group().strip()
|
||||
logging.info("Search response: " + str(search_response))
|
||||
@@ -137,7 +196,9 @@ def parse_string_date(obj_datetime):
|
||||
else:
|
||||
return ""
|
||||
except Exception as e:
|
||||
logging.info("Exception %s when trying to parse string to date" % str(e))
|
||||
logging.info(
|
||||
"Exception %s when trying to parse string to date" % str(e)
|
||||
)
|
||||
return ""
|
||||
|
||||
|
||||
@@ -145,7 +206,10 @@ def parse_string_date(obj_datetime):
|
||||
def string_to_date(obj_datetime):
|
||||
obj_datetime = parse_string_date(obj_datetime)
|
||||
try:
|
||||
date_time_obj = datetime.datetime.strptime(obj_datetime, "%a %b %d %H:%M:%S %Z %Y")
|
||||
date_time_obj = datetime.datetime.strptime(
|
||||
obj_datetime,
|
||||
"%a %b %d %H:%M:%S %Z %Y"
|
||||
)
|
||||
return date_time_obj
|
||||
except Exception:
|
||||
logging.info("Couldn't parse string to datetime object")
|
||||
@@ -162,36 +226,66 @@ def check_date_time(object_type, names):
|
||||
node_datetime_string = node_debug(node_name, skew_command)
|
||||
node_datetime = string_to_date(node_datetime_string)
|
||||
counter = 0
|
||||
while not first_date_time < node_datetime < datetime.datetime.utcnow():
|
||||
while not (
|
||||
first_date_time < node_datetime < datetime.datetime.utcnow()
|
||||
):
|
||||
time.sleep(10)
|
||||
logging.info("Date/time on node %s still not reset, waiting 10 seconds and retrying" % node_name)
|
||||
logging.info(
|
||||
"Date/time on node %s still not reset, "
|
||||
"waiting 10 seconds and retrying" % node_name
|
||||
)
|
||||
node_datetime_string = node_debug(node_name, skew_command)
|
||||
node_datetime = string_to_date(node_datetime_string)
|
||||
counter += 1
|
||||
if counter > max_retries:
|
||||
logging.error("Date and time in node %s didn't reset properly" % node_name)
|
||||
logging.error(
|
||||
"Date and time in node %s didn't reset properly" %
|
||||
node_name
|
||||
)
|
||||
not_reset.append(node_name)
|
||||
break
|
||||
if counter < max_retries:
|
||||
logging.info("Date in node " + str(node_name) + " reset properly")
|
||||
logging.info(
|
||||
"Date in node " + str(node_name) + " reset properly"
|
||||
)
|
||||
elif object_type == "pod":
|
||||
for pod_name in names:
|
||||
first_date_time = datetime.datetime.utcnow()
|
||||
counter = 0
|
||||
pod_datetime_string = pod_exec(pod_name[0], skew_command, pod_name[1], pod_name[2])
|
||||
pod_datetime_string = pod_exec(
|
||||
pod_name[0],
|
||||
skew_command,
|
||||
pod_name[1],
|
||||
pod_name[2]
|
||||
)
|
||||
pod_datetime = string_to_date(pod_datetime_string)
|
||||
while not first_date_time < pod_datetime < datetime.datetime.utcnow():
|
||||
while not (
|
||||
first_date_time < pod_datetime < datetime.datetime.utcnow()
|
||||
):
|
||||
time.sleep(10)
|
||||
logging.info("Date/time on pod %s still not reset, waiting 10 seconds and retrying" % pod_name[0])
|
||||
pod_datetime = pod_exec(pod_name[0], skew_command, pod_name[1], pod_name[2])
|
||||
logging.info(
|
||||
"Date/time on pod %s still not reset, "
|
||||
"waiting 10 seconds and retrying" % pod_name[0]
|
||||
)
|
||||
pod_datetime = pod_exec(
|
||||
pod_name[0],
|
||||
skew_command,
|
||||
pod_name[1],
|
||||
pod_name[2]
|
||||
)
|
||||
pod_datetime = string_to_date(pod_datetime)
|
||||
counter += 1
|
||||
if counter > max_retries:
|
||||
logging.error("Date and time in pod %s didn't reset properly" % pod_name[0])
|
||||
logging.error(
|
||||
"Date and time in pod %s didn't reset properly" %
|
||||
pod_name[0]
|
||||
)
|
||||
not_reset.append(pod_name[0])
|
||||
break
|
||||
if counter < max_retries:
|
||||
logging.info("Date in pod " + str(pod_name[0]) + " reset properly")
|
||||
logging.info(
|
||||
"Date in pod " + str(pod_name[0]) + " reset properly"
|
||||
)
|
||||
return not_reset
|
||||
|
||||
|
||||
@@ -205,7 +299,14 @@ def run(scenarios_list, config, wait_duration):
|
||||
not_reset = check_date_time(object_type, object_names)
|
||||
if len(not_reset) > 0:
|
||||
logging.info("Object times were not reset")
|
||||
logging.info("Waiting for the specified duration: %s" % (wait_duration))
|
||||
logging.info(
|
||||
"Waiting for the specified duration: %s" % (wait_duration)
|
||||
)
|
||||
time.sleep(wait_duration)
|
||||
end_time = int(time.time())
|
||||
cerberus.publish_kraken_status(config, not_reset, start_time, end_time)
|
||||
cerberus.publish_kraken_status(
|
||||
config,
|
||||
not_reset,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
|
||||
@@ -2,12 +2,15 @@ import yaml
|
||||
import sys
|
||||
import logging
|
||||
import time
|
||||
from kraken.node_actions.aws_node_scenarios import AWS
|
||||
import kraken.cerberus.setup as cerberus
|
||||
from ..node_actions.aws_node_scenarios import AWS
|
||||
from ..cerberus import setup as cerberus
|
||||
|
||||
|
||||
# filters the subnet of interest and applies the network acl to create zone outage
|
||||
def run(scenarios_list, config, wait_duration):
|
||||
"""
|
||||
filters the subnet of interest and applies the network acl
|
||||
to create zone outage
|
||||
"""
|
||||
failed_post_scenarios = ""
|
||||
for zone_outage_config in scenarios_list:
|
||||
if len(zone_outage_config) > 1:
|
||||
@@ -24,7 +27,11 @@ def run(scenarios_list, config, wait_duration):
|
||||
if cloud_type.lower() == "aws":
|
||||
cloud_object = AWS()
|
||||
else:
|
||||
logging.error("Cloud type " + cloud_type + " is not currently supported for zone outage scenarios")
|
||||
logging.error(
|
||||
"Cloud type %s is not currently supported for "
|
||||
"zone outage scenarios"
|
||||
% cloud_type
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
start_time = int(time.time())
|
||||
@@ -32,39 +39,62 @@ def run(scenarios_list, config, wait_duration):
|
||||
for subnet_id in subnet_ids:
|
||||
logging.info("Targeting subnet_id")
|
||||
network_association_ids = []
|
||||
associations, original_acl_id = cloud_object.describe_network_acls(vpc_id, subnet_id)
|
||||
associations, original_acl_id = \
|
||||
cloud_object.describe_network_acls(vpc_id, subnet_id)
|
||||
for entry in associations:
|
||||
if entry["SubnetId"] == subnet_id:
|
||||
network_association_ids.append(entry["NetworkAclAssociationId"])
|
||||
network_association_ids.append(
|
||||
entry["NetworkAclAssociationId"]
|
||||
)
|
||||
logging.info(
|
||||
"Network association ids associated with the subnet %s: %s"
|
||||
"Network association ids associated with "
|
||||
"the subnet %s: %s"
|
||||
% (subnet_id, network_association_ids)
|
||||
)
|
||||
acl_id = cloud_object.create_default_network_acl(vpc_id)
|
||||
new_association_id = cloud_object.replace_network_acl_association(
|
||||
network_association_ids[0], acl_id
|
||||
)
|
||||
new_association_id = \
|
||||
cloud_object.replace_network_acl_association(
|
||||
network_association_ids[0], acl_id
|
||||
)
|
||||
|
||||
# capture the orginal_acl_id, created_acl_id and new association_id to use during the recovery
|
||||
# capture the orginal_acl_id, created_acl_id and
|
||||
# new association_id to use during the recovery
|
||||
ids[new_association_id] = original_acl_id
|
||||
acl_ids_created.append(acl_id)
|
||||
|
||||
# wait for the specified duration
|
||||
logging.info("Waiting for the specified duration in the config: %s" % (duration))
|
||||
logging.info(
|
||||
"Waiting for the specified duration "
|
||||
"in the config: %s" % (duration)
|
||||
)
|
||||
time.sleep(duration)
|
||||
|
||||
# replace the applied acl with the previous acl in use
|
||||
for new_association_id, original_acl_id in ids.items():
|
||||
cloud_object.replace_network_acl_association(new_association_id, original_acl_id)
|
||||
logging.info("Wating for 60 seconds to make sure the changes are in place")
|
||||
cloud_object.replace_network_acl_association(
|
||||
new_association_id,
|
||||
original_acl_id
|
||||
)
|
||||
logging.info(
|
||||
"Wating for 60 seconds to make sure "
|
||||
"the changes are in place"
|
||||
)
|
||||
time.sleep(60)
|
||||
|
||||
# delete the network acl created for the run
|
||||
for acl_id in acl_ids_created:
|
||||
cloud_object.delete_network_acl(acl_id)
|
||||
|
||||
logging.info("End of scenario. Waiting for the specified duration: %s" % (wait_duration))
|
||||
logging.info(
|
||||
"End of scenario. "
|
||||
"Waiting for the specified duration: %s" % (wait_duration)
|
||||
)
|
||||
time.sleep(wait_duration)
|
||||
|
||||
end_time = int(time.time())
|
||||
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
|
||||
cerberus.publish_kraken_status(
|
||||
config,
|
||||
failed_post_scenarios,
|
||||
start_time,
|
||||
end_time
|
||||
)
|
||||
|
||||
@@ -12,7 +12,7 @@ oauth2client>=4.1.3
|
||||
python-openstackclient
|
||||
gitpython
|
||||
paramiko
|
||||
setuptools
|
||||
setuptools==63.4.1
|
||||
openshift-client
|
||||
python-ipmi
|
||||
podman-compose
|
||||
@@ -22,4 +22,5 @@ itsdangerous==2.0.1
|
||||
werkzeug==2.0.3
|
||||
aliyun-python-sdk-core-v3
|
||||
aliyun-python-sdk-ecs
|
||||
cryptography==36.0.2 # Remove once https://github.com/paramiko/paramiko/issues/2038 gets fixed.
|
||||
arcaflow-plugin-sdk==0.3.0
|
||||
git+https://github.com/vmware/vsphere-automation-sdk-python.git
|
||||
244
run_kraken.py
244
run_kraken.py
@@ -2,8 +2,6 @@
|
||||
|
||||
import os
|
||||
import sys
|
||||
from typing import List
|
||||
|
||||
import yaml
|
||||
import logging
|
||||
import optparse
|
||||
@@ -24,14 +22,13 @@ import kraken.application_outage.actions as application_outage
|
||||
import kraken.pvc.pvc_scenario as pvc_scenario
|
||||
import kraken.network_chaos.actions as network_chaos
|
||||
import server as server
|
||||
from kraken.scenarios.base import Scenario
|
||||
from kraken.scenarios.pod import PodScenario
|
||||
from kraken.scenarios.runner import ScenarioRunner
|
||||
from kraken import plugins
|
||||
|
||||
|
||||
def publish_kraken_status(status):
|
||||
with open("/tmp/kraken_status", "w+") as file:
|
||||
file.write(str(status))
|
||||
KUBE_BURNER_URL = (
|
||||
"https://github.com/cloud-bulldozer/kube-burner/"
|
||||
"releases/download/v{version}/kube-burner-{version}-Linux-x86_64.tar.gz"
|
||||
)
|
||||
KUBE_BURNER_VERSION = "0.9.1"
|
||||
|
||||
|
||||
# Main function
|
||||
@@ -48,35 +45,60 @@ def main(cfg):
|
||||
distribution = config["kraken"].get("distribution", "openshift")
|
||||
kubeconfig_path = config["kraken"].get("kubeconfig_path", "")
|
||||
chaos_scenarios = config["kraken"].get("chaos_scenarios", [])
|
||||
publish_running_status = config["kraken"].get("publish_kraken_status", False)
|
||||
publish_running_status = config["kraken"].get(
|
||||
"publish_kraken_status", False
|
||||
)
|
||||
port = config["kraken"].get("port", "8081")
|
||||
run_signal = config["kraken"].get("signal_state", "RUN")
|
||||
litmus_install = config["kraken"].get("litmus_install", True)
|
||||
litmus_version = config["kraken"].get("litmus_version", "v1.9.1")
|
||||
litmus_uninstall = config["kraken"].get("litmus_uninstall", False)
|
||||
litmus_uninstall_before_run = config["kraken"].get("litmus_uninstall_before_run", True)
|
||||
litmus_uninstall_before_run = config["kraken"].get(
|
||||
"litmus_uninstall_before_run", True
|
||||
)
|
||||
wait_duration = config["tunings"].get("wait_duration", 60)
|
||||
iterations = config["tunings"].get("iterations", 1)
|
||||
daemon_mode = config["tunings"].get("daemon_mode", False)
|
||||
deploy_performance_dashboards = config["performance_monitoring"].get("deploy_dashboards", False)
|
||||
deploy_performance_dashboards = config["performance_monitoring"].get(
|
||||
"deploy_dashboards", False
|
||||
)
|
||||
dashboard_repo = config["performance_monitoring"].get(
|
||||
"repo", "https://github.com/cloud-bulldozer/performance-dashboards.git"
|
||||
) # noqa
|
||||
capture_metrics = config["performance_monitoring"].get("capture_metrics", False)
|
||||
"repo",
|
||||
"https://github.com/cloud-bulldozer/performance-dashboards.git"
|
||||
)
|
||||
capture_metrics = config["performance_monitoring"].get(
|
||||
"capture_metrics", False
|
||||
)
|
||||
kube_burner_url = config["performance_monitoring"].get(
|
||||
"kube_burner_binary_url",
|
||||
"https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz", # noqa
|
||||
KUBE_BURNER_URL.format(version=KUBE_BURNER_VERSION)
|
||||
)
|
||||
config_path = config["performance_monitoring"].get(
|
||||
"config_path", "config/kube_burner.yaml"
|
||||
)
|
||||
metrics_profile = config["performance_monitoring"].get(
|
||||
"metrics_profile_path", "config/metrics-aggregated.yaml"
|
||||
)
|
||||
prometheus_url = config["performance_monitoring"].get(
|
||||
"prometheus_url", ""
|
||||
)
|
||||
prometheus_bearer_token = config["performance_monitoring"].get(
|
||||
"prometheus_bearer_token", ""
|
||||
)
|
||||
config_path = config["performance_monitoring"].get("config_path", "config/kube_burner.yaml")
|
||||
metrics_profile = config["performance_monitoring"].get("metrics_profile_path", "config/metrics-aggregated.yaml")
|
||||
prometheus_url = config["performance_monitoring"].get("prometheus_url", "")
|
||||
prometheus_bearer_token = config["performance_monitoring"].get("prometheus_bearer_token", "")
|
||||
run_uuid = config["performance_monitoring"].get("uuid", "")
|
||||
enable_alerts = config["performance_monitoring"].get("enable_alerts", False)
|
||||
alert_profile = config["performance_monitoring"].get("alert_profile", "")
|
||||
enable_alerts = config["performance_monitoring"].get(
|
||||
"enable_alerts", False
|
||||
)
|
||||
alert_profile = config["performance_monitoring"].get(
|
||||
"alert_profile", ""
|
||||
)
|
||||
|
||||
# Initialize clients
|
||||
if not os.path.isfile(kubeconfig_path):
|
||||
logging.error("Cannot read the kubeconfig file at %s, please check" % kubeconfig_path)
|
||||
logging.error(
|
||||
"Cannot read the kubeconfig file at %s, please check" %
|
||||
kubeconfig_path
|
||||
)
|
||||
sys.exit(1)
|
||||
logging.info("Initializing client to talk to the Kubernetes cluster")
|
||||
os.environ["KUBECONFIG"] = str(kubeconfig_path)
|
||||
@@ -87,17 +109,25 @@ def main(cfg):
|
||||
|
||||
# Set up kraken url to track signal
|
||||
if not 0 <= int(port) <= 65535:
|
||||
logging.info("Using port 8081 as %s isn't a valid port number" % (port))
|
||||
logging.info(
|
||||
"Using port 8081 as %s isn't a valid port number" % (port)
|
||||
)
|
||||
port = 8081
|
||||
address = ("0.0.0.0", port)
|
||||
|
||||
# If publish_running_status is False this should keep us going in our loop below
|
||||
# If publish_running_status is False this should keep us going
|
||||
# in our loop below
|
||||
if publish_running_status:
|
||||
server_address = address[0]
|
||||
port = address[1]
|
||||
logging.info(
|
||||
"Publishing kraken status at http://%s:%s" % (
|
||||
server_address,
|
||||
port
|
||||
)
|
||||
)
|
||||
logging.info("Publishing kraken status at http://%s:%s" % (server_address, port))
|
||||
server.start_server(address)
|
||||
publish_kraken_status(run_signal)
|
||||
server.start_server(address, run_signal)
|
||||
|
||||
# Cluster info
|
||||
logging.info("Fetching cluster info")
|
||||
@@ -115,35 +145,36 @@ def main(cfg):
|
||||
|
||||
# Generate uuid for the run
|
||||
if run_uuid:
|
||||
logging.info("Using the uuid defined by the user for the run: %s" % run_uuid)
|
||||
logging.info(
|
||||
"Using the uuid defined by the user for the run: %s" % run_uuid
|
||||
)
|
||||
else:
|
||||
run_uuid = str(uuid.uuid4())
|
||||
logging.info("Generated a uuid for the run: %s" % run_uuid)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
scenarios: List[Scenario] = [
|
||||
PodScenario(logger),
|
||||
]
|
||||
health_checker = CerberusHealthChecker(config)
|
||||
runner = ScenarioRunner(scenarios, health_checker)
|
||||
# Initialize the start iteration to 0
|
||||
iteration = 0
|
||||
|
||||
# Set the number of iterations to loop to infinity if daemon mode is
|
||||
# enabled or else set it to the provided iterations count in the config
|
||||
if daemon_mode:
|
||||
logging.info("Daemon mode enabled, kraken will cause chaos forever\n")
|
||||
logging.info(
|
||||
"Daemon mode enabled, kraken will cause chaos forever\n"
|
||||
)
|
||||
logging.info("Ignoring the iterations set")
|
||||
iterations = float("inf")
|
||||
else:
|
||||
logging.info("Daemon mode not enabled, will run through %s iterations\n" % str(iterations))
|
||||
logging.info(
|
||||
"Daemon mode not enabled, will run through %s iterations\n" %
|
||||
str(iterations)
|
||||
)
|
||||
iterations = int(iterations)
|
||||
|
||||
failed_post_scenarios = []
|
||||
litmus_installed = False
|
||||
|
||||
# Capture the start time
|
||||
start_time = int(time.time())
|
||||
litmus_installed = False
|
||||
|
||||
# Loop to run the chaos starts here
|
||||
while int(iteration) < iterations and run_signal != "STOP":
|
||||
@@ -156,7 +187,8 @@ def main(cfg):
|
||||
if run_signal == "PAUSE":
|
||||
while publish_running_status and run_signal == "PAUSE":
|
||||
logging.info(
|
||||
"Pausing Kraken run, waiting for %s seconds and will re-poll signal"
|
||||
"Pausing Kraken run, waiting for %s seconds"
|
||||
" and will re-poll signal"
|
||||
% str(wait_duration)
|
||||
)
|
||||
time.sleep(wait_duration)
|
||||
@@ -169,28 +201,53 @@ def main(cfg):
|
||||
if scenarios_list:
|
||||
# Inject pod chaos scenarios specified in the config
|
||||
if scenario_type == "pod_scenarios":
|
||||
logging.info("Running pod scenarios")
|
||||
failed_post_scenarios = pod_scenarios.run(
|
||||
kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_duration
|
||||
logging.error(
|
||||
"Pod scenarios have been removed, please use "
|
||||
"plugin_scenarios with the "
|
||||
"kill-pods configuration instead."
|
||||
)
|
||||
sys.exit(1)
|
||||
elif scenario_type == "plugin_scenarios":
|
||||
failed_post_scenarios = plugins.run(
|
||||
scenarios_list,
|
||||
kubeconfig_path,
|
||||
failed_post_scenarios
|
||||
)
|
||||
elif scenario_type == "container_scenarios":
|
||||
logging.info("Running container scenarios")
|
||||
failed_post_scenarios = pod_scenarios.container_run(
|
||||
kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_duration
|
||||
)
|
||||
failed_post_scenarios = \
|
||||
pod_scenarios.container_run(
|
||||
kubeconfig_path,
|
||||
scenarios_list,
|
||||
config,
|
||||
failed_post_scenarios,
|
||||
wait_duration
|
||||
)
|
||||
|
||||
# Inject node chaos scenarios specified in the config
|
||||
elif scenario_type == "node_scenarios":
|
||||
logging.info("Running node scenarios")
|
||||
nodeaction.run(scenarios_list, config, wait_duration)
|
||||
nodeaction.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration
|
||||
)
|
||||
|
||||
# Inject time skew chaos scenarios specified in the config
|
||||
# Inject time skew chaos scenarios specified
|
||||
# in the config
|
||||
elif scenario_type == "time_scenarios":
|
||||
if distribution == "openshift":
|
||||
logging.info("Running time skew scenarios")
|
||||
time_actions.run(scenarios_list, config, wait_duration)
|
||||
time_actions.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration
|
||||
)
|
||||
else:
|
||||
logging.error("Litmus scenarios are currently supported only on openshift")
|
||||
logging.error(
|
||||
"Litmus scenarios are currently "
|
||||
"supported only on openshift"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Inject litmus based chaos scenarios
|
||||
@@ -198,46 +255,79 @@ def main(cfg):
|
||||
if distribution == "openshift":
|
||||
logging.info("Running litmus scenarios")
|
||||
litmus_namespace = "litmus"
|
||||
if not litmus_installed:
|
||||
# Remove Litmus resources before running the scenarios
|
||||
common_litmus.delete_chaos(litmus_namespace)
|
||||
common_litmus.delete_chaos_experiments(litmus_namespace)
|
||||
if litmus_uninstall_before_run:
|
||||
common_litmus.uninstall_litmus(litmus_version, litmus_namespace)
|
||||
common_litmus.install_litmus(litmus_version, litmus_namespace)
|
||||
common_litmus.deploy_all_experiments(litmus_version, litmus_namespace)
|
||||
litmus_installed = True
|
||||
common_litmus.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
litmus_uninstall,
|
||||
wait_duration,
|
||||
litmus_namespace,
|
||||
if litmus_install:
|
||||
# Remove Litmus resources
|
||||
# before running the scenarios
|
||||
common_litmus.delete_chaos(
|
||||
litmus_namespace
|
||||
)
|
||||
common_litmus.delete_chaos_experiments(
|
||||
litmus_namespace
|
||||
)
|
||||
if litmus_uninstall_before_run:
|
||||
common_litmus.uninstall_litmus(
|
||||
litmus_version,
|
||||
litmus_namespace
|
||||
)
|
||||
common_litmus.install_litmus(
|
||||
litmus_version,
|
||||
litmus_namespace
|
||||
)
|
||||
common_litmus.deploy_all_experiments(
|
||||
litmus_version,
|
||||
litmus_namespace
|
||||
)
|
||||
litmus_installed = True
|
||||
common_litmus.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
litmus_uninstall,
|
||||
wait_duration,
|
||||
litmus_namespace,
|
||||
)
|
||||
else:
|
||||
logging.error("Litmus scenarios are currently only supported on openshift")
|
||||
logging.error(
|
||||
"Litmus scenarios are currently "
|
||||
"only supported on openshift"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# Inject cluster shutdown scenarios
|
||||
elif scenario_type == "cluster_shut_down_scenarios":
|
||||
shut_down.run(scenarios_list, config, wait_duration)
|
||||
shut_down.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration
|
||||
)
|
||||
|
||||
# Inject namespace chaos scenarios
|
||||
elif scenario_type == "namespace_scenarios":
|
||||
logging.info("Running namespace scenarios")
|
||||
namespace_actions.run(
|
||||
scenarios_list, config, wait_duration, failed_post_scenarios, kubeconfig_path
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration,
|
||||
failed_post_scenarios,
|
||||
kubeconfig_path
|
||||
)
|
||||
|
||||
# Inject zone failures
|
||||
elif scenario_type == "zone_outages":
|
||||
logging.info("Inject zone outages")
|
||||
zone_outages.run(scenarios_list, config, wait_duration)
|
||||
zone_outages.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration
|
||||
)
|
||||
|
||||
# Application outages
|
||||
elif scenario_type == "application_outages":
|
||||
logging.info("Injecting application outage")
|
||||
application_outage.run(scenarios_list, config, wait_duration)
|
||||
application_outage.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration
|
||||
)
|
||||
|
||||
# PVC scenarios
|
||||
elif scenario_type == "pvc_scenarios":
|
||||
@@ -247,7 +337,11 @@ def main(cfg):
|
||||
# Network scenarios
|
||||
elif scenario_type == "network_chaos":
|
||||
logging.info("Running Network Chaos")
|
||||
network_chaos.run(scenarios_list, config, wait_duration)
|
||||
network_chaos.run(
|
||||
scenarios_list,
|
||||
config,
|
||||
wait_duration
|
||||
)
|
||||
|
||||
iteration += 1
|
||||
logging.info("")
|
||||
@@ -293,12 +387,15 @@ def main(cfg):
|
||||
common_litmus.uninstall_litmus(litmus_version, litmus_namespace)
|
||||
|
||||
if failed_post_scenarios:
|
||||
logging.error("Post scenarios are still failing at the end of all iterations")
|
||||
logging.error(
|
||||
"Post scenarios are still failing at the end of all iterations"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
run_dir = os.getcwd() + "/kraken.report"
|
||||
logging.info(
|
||||
"Successfully finished running Kraken. UUID for the run: %s. Report generated at %s. Exiting"
|
||||
"Successfully finished running Kraken. UUID for the run: "
|
||||
"%s. Report generated at %s. Exiting"
|
||||
% (run_uuid, run_dir)
|
||||
)
|
||||
else:
|
||||
@@ -320,7 +417,10 @@ if __name__ == "__main__":
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
handlers=[logging.FileHandler("kraken.report", mode="w"), logging.StreamHandler()],
|
||||
handlers=[
|
||||
logging.FileHandler("kraken.report", mode="w"),
|
||||
logging.StreamHandler()
|
||||
],
|
||||
)
|
||||
if options.cfg is None:
|
||||
logging.error("Please check if you have passed the config")
|
||||
|
||||
@@ -1,89 +0,0 @@
|
||||
{
|
||||
"$schema": "https://json-schema.org/draft/2019-09/schema",
|
||||
"$id": "https://github.com/chaos-kubox/krkn/",
|
||||
"type": "object",
|
||||
"default": {},
|
||||
"title": "Composite scenario for Krkn",
|
||||
"required": [
|
||||
"steps"
|
||||
],
|
||||
"properties": {
|
||||
"iterations": {
|
||||
"type": "integer",
|
||||
"default": 1,
|
||||
"title": "How many iterations to execute",
|
||||
"examples": [
|
||||
3
|
||||
]
|
||||
},
|
||||
"steps": {
|
||||
"type": "array",
|
||||
"default": [],
|
||||
"title": "The steps Schema",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"default": {},
|
||||
"title": "A Schema",
|
||||
"required": [
|
||||
"pod"
|
||||
],
|
||||
"properties": {
|
||||
"pod": {
|
||||
"type": "object",
|
||||
"default": {},
|
||||
"title": "The pod Schema",
|
||||
"required": [
|
||||
"name_pattern",
|
||||
"namespace_pattern"
|
||||
],
|
||||
"properties": {
|
||||
"name_pattern": {
|
||||
"type": "string",
|
||||
"default": "",
|
||||
"title": "The name_pattern Schema",
|
||||
"examples": [
|
||||
""
|
||||
]
|
||||
},
|
||||
"namespace_pattern": {
|
||||
"type": "string",
|
||||
"default": "",
|
||||
"title": "The namespace_pattern Schema",
|
||||
"examples": [
|
||||
""
|
||||
]
|
||||
}
|
||||
},
|
||||
"examples": [{
|
||||
"name_pattern": "test-.*",
|
||||
"namespace_pattern": "default"
|
||||
}]
|
||||
}
|
||||
},
|
||||
"examples": [{
|
||||
"pod": {
|
||||
"name_pattern": "test-.*",
|
||||
"namespace_pattern": "default"
|
||||
}
|
||||
}]
|
||||
},
|
||||
"examples": [
|
||||
[{
|
||||
"pod": {
|
||||
"name_pattern": "test-.*",
|
||||
"namespace_pattern": "default"
|
||||
}
|
||||
}]
|
||||
]
|
||||
}
|
||||
},
|
||||
"examples": [{
|
||||
"iterations": 1,
|
||||
"steps": [{
|
||||
"pod": {
|
||||
"name_pattern": "test-.*",
|
||||
"namespace_pattern": "default"
|
||||
}
|
||||
}]
|
||||
}]
|
||||
}
|
||||
@@ -1,5 +0,0 @@
|
||||
iterations: 1
|
||||
steps:
|
||||
- pod:
|
||||
name_pattern:
|
||||
namespace_pattern:
|
||||
6
scenarios/kube/pod.yml
Normal file
6
scenarios/kube/pod.yml
Normal file
@@ -0,0 +1,6 @@
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
name_pattern: ^nginx-.*$
|
||||
namespace_pattern: ^default$
|
||||
kill: 1
|
||||
@@ -1,32 +1,10 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete scheduler pods"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "kube-system"
|
||||
selector: "k8s-app=kube-scheduler"
|
||||
filters:
|
||||
- randomSample:
|
||||
size: 1
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "kube-system"
|
||||
selector: "k8s-app=kube-scheduler"
|
||||
retries:
|
||||
retriesTimeout:
|
||||
timeout: 180
|
||||
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 3
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^kube-system$
|
||||
label_selector: k8s-app=kube-scheduler
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^kube-system$
|
||||
label_selector: k8s-app=kube-scheduler
|
||||
count: 3
|
||||
|
||||
@@ -1,32 +1,10 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete acme-air pods"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "acme-air"
|
||||
selector: ""
|
||||
filters:
|
||||
- randomSample:
|
||||
size: 1
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "acme-air"
|
||||
selector: ""
|
||||
retries:
|
||||
retriesTimeout:
|
||||
timeout: 180
|
||||
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 8
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^acme-air$
|
||||
name_pattern: .*
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^acme-air$
|
||||
name_pattern: .*
|
||||
count: 8
|
||||
@@ -1,32 +1,10 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete etcd pods"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-etcd"
|
||||
selector: "k8s-app=etcd"
|
||||
filters:
|
||||
- randomSample:
|
||||
size: 1
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-etcd"
|
||||
selector: "k8s-app=etcd"
|
||||
retries:
|
||||
retriesTimeout:
|
||||
timeout: 180
|
||||
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 3
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-etcd$
|
||||
label_selector: k8s-app=etcd
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-etcd$
|
||||
label_selector: k8s-app=etcd
|
||||
count: 3
|
||||
|
||||
17
scenarios/openshift/network_chaos_ingress.yml
Normal file
17
scenarios/openshift/network_chaos_ingress.yml
Normal file
@@ -0,0 +1,17 @@
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: network_chaos
|
||||
config:
|
||||
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
|
||||
<node_name_1>:
|
||||
- <interface-1>
|
||||
label_selector: <label_selector> # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
|
||||
instance_count: <number> # Number of nodes to perform action/select that match the label selector
|
||||
kubeconfig_path: <path> # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
|
||||
execution_type: <serial/parallel> # Used to specify whether you want to apply filters on interfaces one at a time or all at once. Default is 'parallel'
|
||||
network_params: # latency, loss and bandwidth are the three supported network parameters to alter for the chaos test
|
||||
latency: <time> # Value is a string. For example : 50ms
|
||||
loss: <fraction> # Loss is a fraction between 0 and 1. It has to be enclosed in quotes to treat it as a string. For example, '0.02' (not 0.02)
|
||||
bandwidth: <rate> # Value is a string. For example: 100mbit
|
||||
wait_duration: <time_duration> # Default is 300. Ensure that it is at least about twice of test_duration
|
||||
test_duration: <time_duration> # Default is 120
|
||||
kraken_config: <path> # Specify this if you want to use Cerberus config
|
||||
@@ -1,35 +1,10 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete openshift-apiserver pods"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-apiserver"
|
||||
selector: "app=openshift-apiserver-a"
|
||||
|
||||
filters:
|
||||
- randomSample:
|
||||
size: 1
|
||||
|
||||
# The actions will be executed in the order specified
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-apiserver"
|
||||
selector: "app=openshift-apiserver-a"
|
||||
retries:
|
||||
retriesTimeout:
|
||||
timeout: 180
|
||||
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 3
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-apiserver$
|
||||
label_selector: app=openshift-apiserver-a
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-apiserver$
|
||||
label_selector: app=openshift-apiserver-a
|
||||
count: 3
|
||||
|
||||
@@ -1,3 +1,7 @@
|
||||
# yaml-language-server: $schema=../pod.schema.json
|
||||
namespace_pattern: openshift-kube-apiserver
|
||||
kill: 1
|
||||
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
|
||||
@@ -1,21 +1,10 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 10
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "check 2 pods are in namespace with selector: prometheus"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-monitoring"
|
||||
selector: "app=prometheus"
|
||||
filters:
|
||||
- property:
|
||||
name: "state"
|
||||
value: "Running"
|
||||
# The actions will be executed in the order specified
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 2
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-monitoring$
|
||||
label_selector: app=prometheus
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-monitoring$
|
||||
label_selector: app=prometheus
|
||||
count: 2
|
||||
@@ -1,71 +1,90 @@
|
||||
#!/usr/bin/env python3
|
||||
import subprocess
|
||||
import logging
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
from kubernetes import client, config
|
||||
from kubernetes.client.rest import ApiException
|
||||
import logging
|
||||
|
||||
|
||||
# List all namespaces
|
||||
def list_namespaces():
|
||||
namespaces = []
|
||||
"""
|
||||
List all namespaces
|
||||
"""
|
||||
spaces_list = []
|
||||
try:
|
||||
config.load_kube_config()
|
||||
cli = client.CoreV1Api()
|
||||
ret = cli.list_namespace(pretty=True)
|
||||
except ApiException as e:
|
||||
logging.error(
|
||||
"Exception when calling \
|
||||
CoreV1Api->list_namespaced_pod: %s\n"
|
||||
% e
|
||||
"Exception when calling CoreV1Api->list_namespace: %s\n",
|
||||
e
|
||||
)
|
||||
for namespace in ret.items:
|
||||
namespaces.append(namespace.metadata.name)
|
||||
return namespaces
|
||||
for current_namespace in ret.items:
|
||||
spaces_list.append(current_namespace.metadata.name)
|
||||
return spaces_list
|
||||
|
||||
|
||||
# Check if all the watch_namespaces are valid
|
||||
def check_namespaces(namespaces):
|
||||
"""
|
||||
Check if all the watch_namespaces are valid
|
||||
"""
|
||||
try:
|
||||
valid_namespaces = list_namespaces()
|
||||
regex_namespaces = set(namespaces) - set(valid_namespaces)
|
||||
final_namespaces = set(namespaces) - set(regex_namespaces)
|
||||
valid_regex = set()
|
||||
if regex_namespaces:
|
||||
for namespace in valid_namespaces:
|
||||
for current_ns in valid_namespaces:
|
||||
for regex_namespace in regex_namespaces:
|
||||
if re.search(regex_namespace, namespace):
|
||||
final_namespaces.add(namespace)
|
||||
if re.search(regex_namespace, current_ns):
|
||||
final_namespaces.add(current_ns)
|
||||
valid_regex.add(regex_namespace)
|
||||
break
|
||||
invalid_namespaces = regex_namespaces - valid_regex
|
||||
if invalid_namespaces:
|
||||
raise Exception("There exists no namespaces matching: %s" % (invalid_namespaces))
|
||||
raise Exception(
|
||||
"There exists no namespaces matching: %s" % (
|
||||
invalid_namespaces
|
||||
)
|
||||
)
|
||||
return list(final_namespaces)
|
||||
except Exception as e:
|
||||
logging.error("%s" % (e))
|
||||
logging.error(str(e))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def run(cmd):
|
||||
try:
|
||||
output = subprocess.Popen(
|
||||
cmd, shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
|
||||
cmd,
|
||||
shell=True,
|
||||
universal_newlines=True,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.STDOUT
|
||||
)
|
||||
(out, err) = output.communicate()
|
||||
except Exception as e:
|
||||
logging.error("Failed to run %s, error: %s" % (cmd, e))
|
||||
logging.error("Failed to run %s, error: %s", cmd, e)
|
||||
return out
|
||||
|
||||
|
||||
regex_namespace = ["openshift-.*"]
|
||||
namespaces = check_namespaces(regex_namespace)
|
||||
pods_running = 0
|
||||
for namespace in namespaces:
|
||||
new_pods_running = run("oc get pods -n " + namespace + " | grep -c Running").rstrip()
|
||||
try:
|
||||
pods_running += int(new_pods_running)
|
||||
except Exception:
|
||||
continue
|
||||
print(pods_running)
|
||||
def print_running_pods():
|
||||
regex_namespace_list = ["openshift-.*"]
|
||||
checked_namespaces = check_namespaces(regex_namespace_list)
|
||||
pods_running = 0
|
||||
for namespace in checked_namespaces:
|
||||
new_pods_running = run(
|
||||
"oc get pods -n " + namespace + " | grep -c Running"
|
||||
).rstrip()
|
||||
try:
|
||||
pods_running += int(new_pods_running)
|
||||
except Exception:
|
||||
continue
|
||||
print(pods_running)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
print_running_pods()
|
||||
|
||||
@@ -1,35 +1,11 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: "delete prometheus pods"
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-monitoring"
|
||||
selector: "app=prometheus"
|
||||
|
||||
filters:
|
||||
- randomSample:
|
||||
size: 1
|
||||
|
||||
# The actions will be executed in the order specified
|
||||
actions:
|
||||
- kill:
|
||||
probability: 1
|
||||
force: true
|
||||
- podAction:
|
||||
matches:
|
||||
- labels:
|
||||
namespace: "openshift-monitoring"
|
||||
selector: "app=prometheus"
|
||||
retries:
|
||||
retriesTimeout:
|
||||
timeout: 180
|
||||
|
||||
actions:
|
||||
- checkPodCount:
|
||||
count: 2
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-monitoring$
|
||||
label_selector: app=prometheus
|
||||
- id: wait-for-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-monitoring$
|
||||
label_selector: app=prometheus
|
||||
count: 2
|
||||
timeout: 180
|
||||
@@ -1,20 +1,6 @@
|
||||
config:
|
||||
runStrategy:
|
||||
runs: 1
|
||||
maxSecondsBetweenRuns: 30
|
||||
minSecondsBetweenRuns: 1
|
||||
scenarios:
|
||||
- name: kill up to 3 pods in any openshift namespace
|
||||
steps:
|
||||
- podAction:
|
||||
matches:
|
||||
- namespace: "openshift-.*"
|
||||
filters:
|
||||
- property:
|
||||
name: "state"
|
||||
value: "Running"
|
||||
- randomSample:
|
||||
size: 3
|
||||
actions:
|
||||
- kill:
|
||||
probability: .7
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: kill-pods
|
||||
config:
|
||||
namespace_pattern: ^openshift-.*$
|
||||
name_pattern: .*
|
||||
kill: 3
|
||||
|
||||
10
scenarios/openshift/vmware_node_scenarios.yml
Normal file
10
scenarios/openshift/vmware_node_scenarios.yml
Normal file
@@ -0,0 +1,10 @@
|
||||
# yaml-language-server: $schema=../plugin.schema.json
|
||||
- id: <node_stop_scenario/node_start_scenario/node_reboot_scenario/node_terminate_scenario>
|
||||
config:
|
||||
name: <node_name> # Node on which scenario has to be injected; can set multiple names separated by comma
|
||||
label_selector: <label_selector> # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
|
||||
runs: 1 # Number of times to inject each scenario under actions (will perform on same node each time)
|
||||
instance_count: 1 # Number of nodes to perform action/select that match the label selector
|
||||
timeout: 300 # Duration to wait for completion of node scenario injection
|
||||
verify_session: True # Set to True if you want to verify the vSphere client session using certificates; else False
|
||||
skip_openshift_checks: False # Set to True if you don't want to wait for the status of the nodes to change on OpenShift before passing the scenario
|
||||
5
scenarios/plugin.schema.README.md
Normal file
5
scenarios/plugin.schema.README.md
Normal file
@@ -0,0 +1,5 @@
|
||||
This file is generated by running the "plugins" module in the kraken project:
|
||||
|
||||
```
|
||||
python -m kraken.plugins >scenarios/plugin.schema.json
|
||||
```
|
||||
157
scenarios/plugin.schema.json
Normal file
157
scenarios/plugin.schema.json
Normal file
@@ -0,0 +1,157 @@
|
||||
{
|
||||
"$id": "https://github.com/redhat-chaos/krkn/",
|
||||
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
||||
"title": "Kraken Arcaflow scenarios",
|
||||
"description": "Serial execution of Arcaflow Python plugins. See https://github.com/arcaflow for details.",
|
||||
"type": "array",
|
||||
"minContains": 1,
|
||||
"items": {
|
||||
"oneOf": [
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string",
|
||||
"const": "kill-pods"
|
||||
},
|
||||
"config": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"namespace_pattern": {
|
||||
"type": "string",
|
||||
"format": "regex",
|
||||
"title": "Namespace pattern",
|
||||
"description": "Regular expression for target pod namespaces."
|
||||
},
|
||||
"name_pattern": {
|
||||
"type": "string",
|
||||
"format": "regex",
|
||||
"title": "Name pattern",
|
||||
"description": "Regular expression for target pods. Required if label_selector is not set."
|
||||
},
|
||||
"kill": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"title": "Number of pods to kill",
|
||||
"description": "How many pods should we attempt to kill?"
|
||||
},
|
||||
"label_selector": {
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"title": "Label selector",
|
||||
"description": "Kubernetes label selector for the target pods. Required if name_pattern is not set.\nSee https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details."
|
||||
},
|
||||
"kubeconfig_path": {
|
||||
"type": "string",
|
||||
"title": "Kubeconfig path",
|
||||
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\nSee https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for details."
|
||||
},
|
||||
"timeout": {
|
||||
"type": "integer",
|
||||
"title": "Timeout",
|
||||
"description": "Timeout to wait for the target pod(s) to be removed in seconds."
|
||||
},
|
||||
"backoff": {
|
||||
"type": "integer",
|
||||
"title": "Backoff",
|
||||
"description": "How many seconds to wait between checks for the target pod status."
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"namespace_pattern"
|
||||
]
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"id",
|
||||
"config"
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string",
|
||||
"const": "wait-for-pods"
|
||||
},
|
||||
"config": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"namespace_pattern": {
|
||||
"type": "string",
|
||||
"format": "regex",
|
||||
"title": "namespace_pattern"
|
||||
},
|
||||
"name_pattern": {
|
||||
"type": "string",
|
||||
"format": "regex",
|
||||
"title": "name_pattern"
|
||||
},
|
||||
"label_selector": {
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"title": "label_selector"
|
||||
},
|
||||
"count": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"title": "Pod count",
|
||||
"description": "Wait for at least this many pods to exist"
|
||||
},
|
||||
"timeout": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"title": "Timeout",
|
||||
"description": "How many seconds to wait for?"
|
||||
},
|
||||
"backoff": {
|
||||
"type": "integer",
|
||||
"title": "Backoff",
|
||||
"description": "How many seconds to wait between checks for the target pod status."
|
||||
},
|
||||
"kubeconfig_path": {
|
||||
"type": "string",
|
||||
"title": "kubeconfig_path"
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"namespace_pattern"
|
||||
]
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"id",
|
||||
"config"
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string",
|
||||
"const": "run_python"
|
||||
},
|
||||
"config": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"filename": {
|
||||
"type": "string",
|
||||
"title": "filename"
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"filename"
|
||||
]
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"id",
|
||||
"config"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
31
server.py
31
server.py
@@ -4,9 +4,13 @@ import _thread
|
||||
from http.server import HTTPServer, BaseHTTPRequestHandler
|
||||
from http.client import HTTPConnection
|
||||
|
||||
server_status = ""
|
||||
|
||||
# Start a simple http server to publish the cerberus status file content
|
||||
class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
|
||||
"""
|
||||
A simple http server to publish the cerberus status file content
|
||||
"""
|
||||
|
||||
requests_served = 0
|
||||
|
||||
def do_GET(self):
|
||||
@@ -16,9 +20,8 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
|
||||
def do_status(self):
|
||||
self.send_response(200)
|
||||
self.end_headers()
|
||||
f = open("/tmp/kraken_status", "rb")
|
||||
self.wfile.write(f.read())
|
||||
SimpleHTTPRequestHandler.requests_served = SimpleHTTPRequestHandler.requests_served + 1
|
||||
self.wfile.write(bytes(server_status, encoding='utf8'))
|
||||
SimpleHTTPRequestHandler.requests_served += 1
|
||||
|
||||
def do_POST(self):
|
||||
if self.path == "/STOP":
|
||||
@@ -31,23 +34,26 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
|
||||
def set_run(self):
|
||||
self.send_response(200)
|
||||
self.end_headers()
|
||||
with open("/tmp/kraken_status", "w+") as file:
|
||||
file.write(str("RUN"))
|
||||
global server_status
|
||||
server_status = 'RUN'
|
||||
|
||||
def set_stop(self):
|
||||
self.send_response(200)
|
||||
self.end_headers()
|
||||
with open("/tmp/kraken_status", "w+") as file:
|
||||
file.write(str("STOP"))
|
||||
global server_status
|
||||
server_status = 'STOP'
|
||||
|
||||
def set_pause(self):
|
||||
self.send_response(200)
|
||||
self.end_headers()
|
||||
with open("/tmp/kraken_status", "w+") as file:
|
||||
file.write(str("PAUSE"))
|
||||
global server_status
|
||||
server_status = 'PAUSE'
|
||||
|
||||
def publish_kraken_status(status):
|
||||
global server_status
|
||||
server_status = status
|
||||
|
||||
def start_server(address):
|
||||
def start_server(address, status):
|
||||
server = address[0]
|
||||
port = address[1]
|
||||
global httpd
|
||||
@@ -55,7 +61,8 @@ def start_server(address):
|
||||
logging.info("Starting http server at http://%s:%s\n" % (server, port))
|
||||
try:
|
||||
_thread.start_new_thread(httpd.serve_forever, ())
|
||||
except Exception:
|
||||
publish_kraken_status(status)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to start the http server \
|
||||
at http://%s:%s"
|
||||
|
||||
61
tests/test_ingress_network_plugin.py
Normal file
61
tests/test_ingress_network_plugin.py
Normal file
@@ -0,0 +1,61 @@
|
||||
import unittest
|
||||
import logging
|
||||
from arcaflow_plugin_sdk import plugin
|
||||
from kraken.plugins.network import ingress_shaping
|
||||
|
||||
|
||||
class NetworkScenariosTest(unittest.TestCase):
|
||||
|
||||
def test_serialization(self):
|
||||
plugin.test_object_serialization(
|
||||
ingress_shaping.NetworkScenarioConfig(
|
||||
node_interface_name={"foo": ['bar']},
|
||||
network_params={
|
||||
"latency": "50ms",
|
||||
"loss": "0.02",
|
||||
"bandwidth": "100mbit"
|
||||
}
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
ingress_shaping.NetworkScenarioSuccessOutput(
|
||||
filter_direction="ingress",
|
||||
test_interfaces={"foo": ['bar']},
|
||||
network_parameters={
|
||||
"latency": "50ms",
|
||||
"loss": "0.02",
|
||||
"bandwidth": "100mbit"
|
||||
},
|
||||
execution_type="parallel"),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
ingress_shaping.NetworkScenarioErrorOutput(
|
||||
error="Hello World",
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
|
||||
def test_network_chaos(self):
|
||||
output_id, output_data = ingress_shaping.network_chaos(
|
||||
ingress_shaping.NetworkScenarioConfig(
|
||||
label_selector="node-role.kubernetes.io/master",
|
||||
instance_count=1,
|
||||
network_params={
|
||||
"latency": "50ms",
|
||||
"loss": "0.02",
|
||||
"bandwidth": "100mbit"
|
||||
}
|
||||
)
|
||||
)
|
||||
if output_id == "error":
|
||||
logging.error(output_data.error)
|
||||
self.fail(
|
||||
"The network chaos scenario did not complete successfully "
|
||||
"because an error/exception occurred"
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
175
tests/test_pod_plugin.py
Normal file
175
tests/test_pod_plugin.py
Normal file
@@ -0,0 +1,175 @@
|
||||
import random
|
||||
import re
|
||||
import string
|
||||
import threading
|
||||
import unittest
|
||||
|
||||
from arcaflow_plugin_sdk import plugin
|
||||
from kubernetes.client import V1Pod, V1ObjectMeta, V1PodSpec, V1Container, ApiException
|
||||
|
||||
from kraken.plugins import pod_plugin
|
||||
from kraken.plugins.pod_plugin import setup_kubernetes, KillPodConfig, PodKillSuccessOutput
|
||||
from kubernetes import client
|
||||
|
||||
|
||||
class KillPodTest(unittest.TestCase):
|
||||
def test_serialization(self):
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.KillPodConfig(
|
||||
namespace_pattern=re.compile(".*"),
|
||||
name_pattern=re.compile(".*")
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.PodKillSuccessOutput(
|
||||
pods={}
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.PodErrorOutput(
|
||||
error="Hello world!"
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
|
||||
def test_not_enough_pods(self):
|
||||
name = ''.join(random.choices(string.ascii_lowercase, k=8))
|
||||
output_id, output_data = pod_plugin.kill_pods(KillPodConfig(
|
||||
namespace_pattern=re.compile("^default$"),
|
||||
name_pattern=re.compile("^unit-test-" + re.escape(name) + "$"),
|
||||
))
|
||||
if output_id != "error":
|
||||
self.fail("Not enough pods did not result in an error.")
|
||||
print(output_data.error)
|
||||
|
||||
def test_kill_pod(self):
|
||||
with setup_kubernetes(None) as cli:
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
pod = core_v1.create_namespaced_pod("default", V1Pod(
|
||||
metadata=V1ObjectMeta(
|
||||
generate_name="test-",
|
||||
),
|
||||
spec=V1PodSpec(
|
||||
containers=[
|
||||
V1Container(
|
||||
name="test",
|
||||
image="alpine",
|
||||
tty=True,
|
||||
)
|
||||
]
|
||||
),
|
||||
))
|
||||
|
||||
def remove_test_pod():
|
||||
try:
|
||||
core_v1.delete_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
|
||||
except ApiException as e:
|
||||
if e.status != 404:
|
||||
raise
|
||||
|
||||
self.addCleanup(remove_test_pod)
|
||||
|
||||
output_id, output_data = pod_plugin.kill_pods(KillPodConfig(
|
||||
namespace_pattern=re.compile("^default$"),
|
||||
name_pattern=re.compile("^" + re.escape(pod.metadata.name) + "$"),
|
||||
))
|
||||
|
||||
if output_id == "error":
|
||||
self.fail(output_data.error)
|
||||
self.assertIsInstance(output_data, PodKillSuccessOutput)
|
||||
out: PodKillSuccessOutput = output_data
|
||||
self.assertEqual(1, len(out.pods))
|
||||
pod_list = list(out.pods.values())
|
||||
self.assertEqual(pod.metadata.name, pod_list[0].name)
|
||||
|
||||
try:
|
||||
core_v1.read_namespaced_pod(pod_list[0].name, pod_list[0].namespace)
|
||||
self.fail("Killed pod is still present.")
|
||||
except ApiException as e:
|
||||
if e.status != 404:
|
||||
self.fail("Incorrect API exception encountered: {}".format(e))
|
||||
|
||||
|
||||
class WaitForPodTest(unittest.TestCase):
|
||||
def test_serialization(self):
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.WaitForPodsConfig(
|
||||
namespace_pattern=re.compile(".*"),
|
||||
name_pattern=re.compile(".*")
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.WaitForPodsConfig(
|
||||
namespace_pattern=re.compile(".*"),
|
||||
label_selector="app=nginx"
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.PodWaitSuccessOutput(
|
||||
pods=[]
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
pod_plugin.PodErrorOutput(
|
||||
error="Hello world!"
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
|
||||
def test_timeout(self):
|
||||
name = "watch-test-" + ''.join(random.choices(string.ascii_lowercase, k=8))
|
||||
output_id, output_data = pod_plugin.wait_for_pods(pod_plugin.WaitForPodsConfig(
|
||||
namespace_pattern=re.compile("^default$"),
|
||||
name_pattern=re.compile("^" + re.escape(name) + "$"),
|
||||
timeout=1
|
||||
))
|
||||
self.assertEqual("error", output_id)
|
||||
|
||||
def test_watch(self):
|
||||
with setup_kubernetes(None) as cli:
|
||||
core_v1 = client.CoreV1Api(cli)
|
||||
name = "watch-test-" + ''.join(random.choices(string.ascii_lowercase, k=8))
|
||||
|
||||
def create_test_pod():
|
||||
core_v1.create_namespaced_pod("default", V1Pod(
|
||||
metadata=V1ObjectMeta(
|
||||
name=name,
|
||||
),
|
||||
spec=V1PodSpec(
|
||||
containers=[
|
||||
V1Container(
|
||||
name="test",
|
||||
image="alpine",
|
||||
tty=True,
|
||||
)
|
||||
]
|
||||
),
|
||||
))
|
||||
|
||||
def remove_test_pod():
|
||||
try:
|
||||
core_v1.delete_namespaced_pod(name, "default")
|
||||
except ApiException as e:
|
||||
if e.status != 404:
|
||||
raise
|
||||
|
||||
self.addCleanup(remove_test_pod)
|
||||
|
||||
t = threading.Timer(10, create_test_pod)
|
||||
t.start()
|
||||
|
||||
output_id, output_data = pod_plugin.wait_for_pods(pod_plugin.WaitForPodsConfig(
|
||||
namespace_pattern=re.compile("^default$"),
|
||||
name_pattern=re.compile("^" + re.escape(name) + "$"),
|
||||
timeout=60
|
||||
))
|
||||
self.assertEqual("success", output_id)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
28
tests/test_run_python_plugin.py
Normal file
28
tests/test_run_python_plugin.py
Normal file
@@ -0,0 +1,28 @@
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from kraken.plugins import run_python_file
|
||||
from kraken.plugins.run_python_plugin import RunPythonFileInput
|
||||
|
||||
|
||||
class RunPythonPluginTest(unittest.TestCase):
|
||||
def test_success_execution(self):
|
||||
tmp_file = tempfile.NamedTemporaryFile()
|
||||
tmp_file.write(bytes("print('Hello world!')", 'utf-8'))
|
||||
tmp_file.flush()
|
||||
output_id, output_data = run_python_file(RunPythonFileInput(tmp_file.name))
|
||||
self.assertEqual("success", output_id)
|
||||
self.assertEqual("Hello world!\n", output_data.stdout)
|
||||
|
||||
def test_error_execution(self):
|
||||
tmp_file = tempfile.NamedTemporaryFile()
|
||||
tmp_file.write(bytes("import sys\nprint('Hello world!')\nsys.exit(42)\n", 'utf-8'))
|
||||
tmp_file.flush()
|
||||
output_id, output_data = run_python_file(RunPythonFileInput(tmp_file.name))
|
||||
self.assertEqual("error", output_id)
|
||||
self.assertEqual(42, output_data.exit_code)
|
||||
self.assertEqual("Hello world!\n", output_data.stdout)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
129
tests/test_vmware_plugin.py
Normal file
129
tests/test_vmware_plugin.py
Normal file
@@ -0,0 +1,129 @@
|
||||
import unittest
|
||||
import os
|
||||
import logging
|
||||
from arcaflow_plugin_sdk import plugin
|
||||
from kraken.plugins.vmware.kubernetes_functions import Actions
|
||||
from kraken.plugins.vmware import vmware_plugin
|
||||
|
||||
|
||||
class NodeScenariosTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
vsphere_env_vars = [
|
||||
"VSPHERE_IP",
|
||||
"VSPHERE_USERNAME",
|
||||
"VSPHERE_PASSWORD"
|
||||
]
|
||||
self.credentials_present = all(
|
||||
env_var in os.environ for env_var in vsphere_env_vars
|
||||
)
|
||||
|
||||
def test_serialization(self):
|
||||
plugin.test_object_serialization(
|
||||
vmware_plugin.NodeScenarioConfig(
|
||||
name="test",
|
||||
skip_openshift_checks=True
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
vmware_plugin.NodeScenarioSuccessOutput(
|
||||
nodes={}, action=Actions.START
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
plugin.test_object_serialization(
|
||||
vmware_plugin.NodeScenarioErrorOutput(
|
||||
error="Hello World", action=Actions.START
|
||||
),
|
||||
self.fail,
|
||||
)
|
||||
|
||||
def test_node_start(self):
|
||||
if not self.credentials_present:
|
||||
self.skipTest(
|
||||
"Check if the environmental variables 'VSPHERE_IP', "
|
||||
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
|
||||
)
|
||||
vsphere = vmware_plugin.vSphere(verify=False)
|
||||
vm_id, vm_name = vsphere.create_default_vm()
|
||||
if vm_id is None:
|
||||
self.fail("Could not create test VM")
|
||||
|
||||
output_id, output_data = vmware_plugin.node_start(
|
||||
vmware_plugin.NodeScenarioConfig(
|
||||
name=vm_name, skip_openshift_checks=True, verify_session=False
|
||||
)
|
||||
)
|
||||
if output_id == "error":
|
||||
logging.error(output_data.error)
|
||||
self.fail("The VMware VM did not start because an error occurred")
|
||||
vsphere.release_instances(vm_name)
|
||||
|
||||
def test_node_stop(self):
|
||||
if not self.credentials_present:
|
||||
self.skipTest(
|
||||
"Check if the environmental variables 'VSPHERE_IP', "
|
||||
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
|
||||
)
|
||||
vsphere = vmware_plugin.vSphere(verify=False)
|
||||
vm_id, vm_name = vsphere.create_default_vm()
|
||||
if vm_id is None:
|
||||
self.fail("Could not create test VM")
|
||||
vsphere.start_instances(vm_name)
|
||||
|
||||
output_id, output_data = vmware_plugin.node_stop(
|
||||
vmware_plugin.NodeScenarioConfig(
|
||||
name=vm_name, skip_openshift_checks=True, verify_session=False
|
||||
)
|
||||
)
|
||||
if output_id == "error":
|
||||
logging.error(output_data.error)
|
||||
self.fail("The VMware VM did not stop because an error occurred")
|
||||
vsphere.release_instances(vm_name)
|
||||
|
||||
def test_node_reboot(self):
|
||||
if not self.credentials_present:
|
||||
self.skipTest(
|
||||
"Check if the environmental variables 'VSPHERE_IP', "
|
||||
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
|
||||
)
|
||||
vsphere = vmware_plugin.vSphere(verify=False)
|
||||
vm_id, vm_name = vsphere.create_default_vm()
|
||||
if vm_id is None:
|
||||
self.fail("Could not create test VM")
|
||||
vsphere.start_instances(vm_name)
|
||||
|
||||
output_id, output_data = vmware_plugin.node_reboot(
|
||||
vmware_plugin.NodeScenarioConfig(
|
||||
name=vm_name, skip_openshift_checks=True, verify_session=False
|
||||
)
|
||||
)
|
||||
if output_id == "error":
|
||||
logging.error(output_data.error)
|
||||
self.fail("The VMware VM did not reboot because an error occurred")
|
||||
vsphere.release_instances(vm_name)
|
||||
|
||||
def test_node_terminate(self):
|
||||
if not self.credentials_present:
|
||||
self.skipTest(
|
||||
"Check if the environmental variables 'VSPHERE_IP', "
|
||||
"'VSPHERE_USERNAME', 'VSPHERE_PASSWORD' are set"
|
||||
)
|
||||
vsphere = vmware_plugin.vSphere(verify=False)
|
||||
vm_id, vm_name = vsphere.create_default_vm()
|
||||
if vm_id is None:
|
||||
self.fail("Could not create test VM")
|
||||
vsphere.start_instances(vm_name)
|
||||
|
||||
output_id, output_data = vmware_plugin.node_terminate(
|
||||
vmware_plugin.NodeScenarioConfig(
|
||||
name=vm_name, skip_openshift_checks=True, verify_session=False
|
||||
)
|
||||
)
|
||||
if output_id == "error":
|
||||
logging.error(output_data.error)
|
||||
self.fail("The VMware VM did not reboot because an error occurred")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Reference in New Issue
Block a user