Compare commits

...

37 Commits

Author SHA1 Message Date
Tullio Sebastiani
b9c08a45db extracted the namespace as scenario input (#419)
fixed sub-workflow and input

Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2023-05-15 18:24:23 +02:00
Naga Ravi Chaitanya Elluri
d9f4607aa6 Add blogs and update roadmap 2023-05-15 11:50:16 -04:00
yogananth-subramanian
8806781a4f Pod network outage Chaos scenario
Pod network outage chaos scenario blocks traffic at pod level irrespective of the network policy used.
With the current network policies, it is not possible to explicitly block ports which are enabled
by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules
to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.

Below example config blocks access to openshift console.
````
- id: pod_network_outage
  config:
    namespace: openshift-console
    direction:
        - ingress
    ingress_ports:
        - 8443
    label_selector: 'component=ui'
````
2023-05-15 10:43:58 -04:00
Tullio Sebastiani
83b811bee4 Arcaflow stress-ng hogs with parallelism support (#418)
* kubeconfig management for arcaflow + hogs scenario refactoring  

  * kubeconfig authentication parsing refactored to support arcaflow kubernetes deployer  
  * reimplemented all the hog scenarios to allow multiple parallel containers of the same scenarios 
  (eg. to stress two or more nodes in the same run simultaneously) 
  * updated documentation 
* removed sysbench scenarios


* recovered cpu hogs


* updated requirements.txt


* updated config.yaml

* added gitleaks file for test fixtures

* imported sys and logging

* removed config_arcaflow.yaml

* updated readme

* refactored arcaflow documentation entrypoint
2023-05-15 09:45:16 -04:00
Paige Rubendall
16ea18c718 Ibm plugin node scenario (#417)
* Node scenarios for ibmcloud

* adding openshift check info
2023-05-09 12:07:38 -04:00
Naga Ravi Chaitanya Elluri
1ab94754e3 Add missing parameters supported by container scenarios (#415)
Also renames retry_wait to expected_recovery_time to make it clear that
the Kraken will exit 1 if the container doesn't recover within the expected
time.
Fixes https://github.com/redhat-chaos/krkn/issues/414
2023-05-05 13:02:07 -04:00
Tullio Sebastiani
278b2bafd7 Kraken is pointing to a buggy kill-pod plugin implementation (#416) 2023-05-04 18:19:54 +02:00
Naga Ravi Chaitanya Elluri
bc863fa01f Add support to check for critical alerts
This commit enables users to opt in to check for critical alerts firing
in the cluster post chaos at the end of each scenario. Chaos scenario is
considered as failed if the cluster is unhealthy in which case user can
start debugging to fix and harden respective areas.

Fixes https://github.com/redhat-chaos/krkn/issues/410
2023-05-03 16:14:13 -04:00
Naga Ravi Chaitanya Elluri
900ca74d80 Reorganize the content from https://github.com/startx-lab (#346)
Moving the content around installing kraken using helm to the
chaos in practice section of the guide to showcase how startx-lab
is deploying and leveraging Kraken.
2023-04-24 13:51:49 -04:00
Tullio Sebastiani
82b8df4e85 kill-pod plugin dependency pointing to specific commit
switched to redhat-chaos repo
2023-04-20 08:26:51 -04:00
Tullio Sebastiani
691be66b0a kubeconfig_path in new_client_from_config
added clients in the same context of the config
2023-04-19 14:12:46 -04:00
Tullio Sebastiani
019b036f9f renamed trigger work from /test to funtest (#401)
added quotes


renamed trigger to funtest

Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2023-04-10 09:30:53 -04:00
Paige Rubendall
13fa711c9b adding privileged namespace (#399)
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2023-04-06 16:18:57 -04:00
Naga Ravi Chaitanya Elluri
17f61625e4 Exit on critical alert failures
This commit captures and exits on non-zero return code i.e when
critical alerts are fired

Fixes https://github.com/redhat-chaos/krkn/issues/396
2023-03-27 12:43:57 -04:00
Tullio Sebastiani
3627b5ba88 cpu hog scenario + basic arcaflow documentation (#391)
typo


typo


updated documentation


fixed workflow map issue
2023-03-15 16:52:20 +01:00
Tullio Sebastiani
fee4f7d2bf arcaflow integration (#384)
arcaflow library version

Co-authored-by: Tullio Sebastiani <tsebasti@redhat.com>
2023-03-08 12:01:03 +01:00
Tullio Sebastiani
0534e03c48 removed useless step that was failing (#389)
removed only old namespace file cat

Co-authored-by: Tullio Sebastiani <tsebasti@redhat.com>
2023-02-23 16:28:09 +01:00
Tullio Sebastiani
bb9a19ab71 removed blocking event check 2023-02-22 09:41:52 -05:00
Tullio Sebastiani
c5b9554de5 check user's authorization before running functional tests
check users authorization before running functional tests


removed usesless checkout


step rename


typo in trigger
2023-02-21 12:38:34 -05:00
dependabot[bot]
e5f97434d3 Bump werkzeug from 2.0.3 to 2.2.3 (#385)
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.0.3 to 2.2.3.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/2.0.3...2.2.3)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2023-02-20 14:34:31 -05:00
Tullio Sebastiani
8b18fa8a35 Github Action + functional tests (no *hog tests) (#382)
* Github Action + functional tests (no *hog tests)

* changed the trigger keyword to /test

* removed deprecated kill_pod scenario + added namespace to app_outage (new kill_pod)

* #365: renamed ingress_namespace scenario to network_diagnostrcs

* requested team filter added

---------

Co-authored-by: Tullio Sebastiani <tullio.sebastiani@x3solutions.it>
2023-02-16 09:42:33 +01:00
Paige Rubendall
93686ca736 new quay image reference 2023-01-31 17:21:45 -05:00
Naga Ravi Chaitanya Elluri
64f4c234e9 Add prom token creation step
This enables compatability with all OpenShift versions.
Reference PR by Paige in Cerberus: https://github.com/redhat-chaos/cerberus/pull/190.
2023-01-31 12:36:09 -05:00
Naga Ravi Chaitanya Elluri
915cc5db94 Bump release version to v1.2.0 2023-01-19 12:03:46 -05:00
José Castillo Lema
493a8a245f Docker provider for node actions (#369)
* Docker provider for node actions

* Adjusted dependencies and imports

* Update config_kind.yaml

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>
2023-01-10 14:36:18 -05:00
José Castillo Lema
d76ab31155 OCM/ACM integration (#370)
* OCM support for ManagedClusters

* Updated docs and general adjustments

* Improved docs

* Improved docs2

* Removed io packet import

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Removed time from imports

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Removed duplicate logging import

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Removed sys import

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Update run.py

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>
2023-01-10 08:58:17 -05:00
dependabot[bot]
bed40b0c6a Bump setuptools from 63.4.1 to 65.5.1
Bumps [setuptools](https://github.com/pypa/setuptools) from 63.4.1 to 65.5.1.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/CHANGES.rst)
- [Commits](https://github.com/pypa/setuptools/compare/v63.4.1...v65.5.1)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-01-04 18:42:52 +05:30
Paige Rubendall
3c5c3c5665 Giving more details on configuration (#371)
* givng more details on configuration

* adding few changes
2022-12-08 11:18:42 +05:30
Tullio Sebastiani
cf7bc28a2d updated k8s/Openshift installation documentation (#359)
* Added some bits and pieces to the krkn k8s installation to make it easier

* updated k8s/Oc installation documentation

* gitignore

* doc reorg

* fixed numbering + removed italic

Co-authored-by: Tullio Sebastiani <tullio.sebastiani@x3solutions.it>
2022-11-30 23:02:17 +05:30
Paige Rubendall
4035f2724b Adding wait duration for pods (#368)
* adding wait duration for pods

* adding kube apiserver with plugin schema
2022-11-18 07:43:26 +05:30
Naga Ravi Chaitanya Elluri
6b17dbdbb3 Allow users to set the listening address
This commit provides an option for the user to set the listening address
for the signal. This also fixes a security vulnerability.

Fixes https://github.com/redhat-chaos/krkn/issues/307
2022-11-08 15:59:57 -05:00
Naga Ravi Chaitanya Elluri
1c207538b6 Use run dir instead of tmp
This commit also logs a message to handle the exception during the
node checks.

Fixes https://github.com/redhat-chaos/krkn/issues/356, https://github.com/redhat-chaos/krkn/issues/357
2022-11-08 15:46:08 -05:00
Naga Ravi Chaitanya Elluri
6ccc16a0ab Use autoescape=True to mitigate XSS vulnerabilities
Fixes https://github.com/redhat-chaos/krkn/issues/354
2022-11-08 14:34:06 -05:00
Naga Ravi Chaitanya Elluri
b9d5a7af4d Use safe loader for Yaml
This fixes the security vulnerabilities for example - it raises an
exception when opening a yaml file with code.

Fixes https://github.com/redhat-chaos/krkn/issues/352
2022-11-08 13:35:06 -05:00
Sandro Bonazzola
1c4a51cbfa refactor: use arcaflow plugin
Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-10-18 16:43:33 +02:00
Christophe LARUE
68c02135d3 Add helm and tekton examples 2022-10-18 09:41:24 -04:00
Naga Ravi Chaitanya Elluri
61700c0dc5 Bump release version to v1.1.1 2022-10-14 12:47:17 -04:00
97 changed files with 6683 additions and 744 deletions

View File

@@ -52,7 +52,7 @@ jobs:
- name: Check CI results
run: grep Fail CI/results.markdown && false || true
- name: Build the Docker images
run: docker build --no-cache -t quay.io/chaos-kubox/krkn containers/
run: docker build --no-cache -t quay.io/redhat-chaos/krkn containers/
- name: Login in quay
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: docker login quay.io -u ${QUAY_USER} -p ${QUAY_TOKEN}
@@ -61,7 +61,7 @@ jobs:
QUAY_TOKEN: ${{ secrets.QUAY_TOKEN_1 }}
- name: Push the Docker images
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: docker push quay.io/chaos-kubox/krkn
run: docker push quay.io/redhat-chaos/krkn
- name: Rebuild krkn-hub
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
uses: redhat-chaos/actions/krkn-hub@main

111
.github/workflows/functional_tests.yaml vendored Normal file
View File

@@ -0,0 +1,111 @@
on: issue_comment
jobs:
check_user:
# This job only runs for pull request comments
name: Check User Authorization
env:
USERS: ${{vars.USERS}}
if: contains(github.event.comment.body, '/funtest') && contains(github.event.comment.html_url, '/pull/')
runs-on: ubuntu-latest
steps:
- name: Check User
run: |
for name in `echo $USERS`
do
name="${name//$'\r'/}"
name="${name//$'\n'/}"
if [ $name == "${{github.event.sender.login}}" ]
then
echo "user ${{github.event.sender.login}} authorized, action started..."
exit 0
fi
done
echo "user ${{github.event.sender.login}} is not allowed to run functional tests Action"
exit 1
pr_commented:
# This job only runs for pull request comments containing /functional
name: Functional Tests
if: contains(github.event.comment.body, '/funtest') && contains(github.event.comment.html_url, '/pull/')
runs-on: ubuntu-latest
needs:
- check_user
steps:
- name: Check out Kraken
uses: actions/checkout@v3
- name: Checkout Pull Request
run: hub pr checkout ${{ github.event.issue.number }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Install OC CLI
uses: redhat-actions/oc-installer@v1
with:
oc_version: latest
- name: Install python 3.9
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Setup kraken dependencies
run: pip install -r requirements.txt
- name: Create Workdir & export the path
run: |
mkdir workdir
echo "WORKDIR_PATH=`pwd`/workdir" >> $GITHUB_ENV
- name: Teardown CRC (Post Action)
uses: webiny/action-post-run@3.0.0
id: post-run-command
with:
# currently using image coming from tsebastiani quay.io repo
# waiting that a fix is merged in the upstream one
# post action run cannot (apparently) be properly indented
run: docker run -v "${{ env.WORKDIR_PATH }}:/workdir" -e WORKING_MODE=T -e AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} -e AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} -e AWS_DEFAULT_REGION=us-west-2 -e TEARDOWN_RUN_ID=crc quay.io/tsebastiani/crc-cloud
- name: Run CRC
# currently using image coming from tsebastiani quay.io repo
# waiting that a fix is merged in the upstream one
run: |
docker run -v "${{ env.WORKDIR_PATH }}:/workdir" \
-e WORKING_MODE=C \
-e PULL_SECRET="${{ secrets.PULL_SECRET }}" \
-e AWS_ACCESS_KEY_ID="${{ secrets.AWS_ACCESS_KEY_ID }}" \
-e AWS_SECRET_ACCESS_KEY="${{ secrets.AWS_SECRET_ACCESS_KEY }}" \
-e AWS_DEFAULT_REGION=us-west-2 \
-e CREATE_RUN_ID=crc \
-e PASS_KUBEADMIN="${{ secrets.KUBEADMIN_PWD }}" \
-e PASS_REDHAT="${{ secrets.REDHAT_PWD }}" \
-e PASS_DEVELOPER="${{ secrets.DEVELOPER_PWD }}" \
quay.io/tsebastiani/crc-cloud
- name: OpenShift login and example deployment, GitHub Action env init
env:
NAMESPACE: test-namespace
DEPLOYMENT_NAME: test-nginx
KUBEADMIN_PWD: '${{ secrets.KUBEADMIN_PWD }}'
run: ./CI/CRC/init_github_action.sh
- name: Setup test suite
run: |
yq -i '.kraken.port="8081"' CI/config/common_test_config.yaml
yq -i '.kraken.signal_address="0.0.0.0"' CI/config/common_test_config.yaml
echo "test_app_outages_gh" > ./CI/tests/my_tests
echo "test_container" >> ./CI/tests/my_tests
echo "test_namespace" >> ./CI/tests/my_tests
echo "test_net_chaos" >> ./CI/tests/my_tests
echo "test_time" >> ./CI/tests/my_tests
- name: Print affected config files
run: |
echo -e "## CI/config/common_test_config.yaml\n\n"
cat CI/config/common_test_config.yaml
- name: Running test suite
run: |
./CI/run.sh
- name: Print test output
run: cat CI/out/*
- name: Create coverage report
run: |
echo "# Test results" > $GITHUB_STEP_SUMMARY
cat CI/results.markdown >> $GITHUB_STEP_SUMMARY
echo "# Test coverage" >> $GITHUB_STEP_SUMMARY
python -m coverage report --format=markdown >> $GITHUB_STEP_SUMMARY

6
.gitignore vendored
View File

@@ -23,6 +23,8 @@ kube_burner*
.pydevproject
.settings
.idea
.vscode
config/debug.yaml
tags
# Package files
@@ -61,3 +63,7 @@ CI/out/*
CI/ci_results
CI/scenarios/*node.yaml
CI/results.markdown
#env
chaos/*

6
.gitleaks.toml Normal file
View File

@@ -0,0 +1,6 @@
[allowlist]
description = "Global Allowlist"
paths = [
'''kraken/arcaflow_plugin/fixtures/*'''
]

44
CI/CRC/deployment.yaml Normal file
View File

@@ -0,0 +1,44 @@
apiVersion: v1
kind: Namespace
metadata:
name: $NAMESPACE
---
apiVersion: v1
kind: Service
metadata:
name: $DEPLOYMENT_NAME-service
namespace: $NAMESPACE
spec:
selector:
app: $DEPLOYMENT_NAME
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: $NAMESPACE
name: $DEPLOYMENT_NAME-deployment
spec:
replicas: 3
selector:
matchLabels:
app: $DEPLOYMENT_NAME
template:
metadata:
labels:
app: $DEPLOYMENT_NAME
spec:
containers:
- name: $DEPLOYMENT_NAME
image: nginxinc/nginx-unprivileged:stable-alpine
ports:
- name: http
containerPort: 8080

72
CI/CRC/init_github_action.sh Executable file
View File

@@ -0,0 +1,72 @@
#!/bin/bash
SCRIPT_PATH=./CI/CRC
DEPLOYMENT_PATH=$SCRIPT_PATH/deployment.yaml
CLUSTER_INFO=cluster_infos.json
[[ -z $WORKDIR_PATH ]] && echo "[ERROR] please set \$WORKDIR_PATH environment variable" && exit 1
CLUSTER_INFO_PATH=$WORKDIR_PATH/crc/$CLUSTER_INFO
[[ ! -f $DEPLOYMENT_PATH ]] && echo "[ERROR] please run $0 from GitHub action root directory" && exit 1
[[ -z $KUBEADMIN_PWD ]] && echo "[ERROR] kubeadmin password not set, please check the repository secrets" && exit 1
[[ -z $DEPLOYMENT_NAME ]] && echo "[ERROR] please set \$DEPLOYMENT_NAME environment variable" && exit 1
[[ -z $NAMESPACE ]] && echo "[ERROR] please set \$NAMESPACE environment variable" && exit 1
[[ ! -f $CLUSTER_INFO_PATH ]] && echo "[ERROR] cluster_info.json not found in $CLUSTER_INFO_PATH" && exit 1
OPENSSL=`which openssl 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: openssl missing, please install it and try again" && exit 1
OC=`which oc 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: oc missing, please install it and try again" && exit 1
SED=`which sed 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: sed missing, please install it and try again" && exit 1
JQ=`which jq 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: jq missing, please install it and try again" && exit 1
ENVSUBST=`which envsubst 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: envsubst missing, please install it and try again" && exit 1
API_ADDRESS="$($JQ -r '.api.address' $CLUSTER_INFO_PATH)"
API_PORT="$($JQ -r '.api.port' $CLUSTER_INFO_PATH)"
BASE_HOST=`$JQ -r '.api.address' $CLUSTER_INFO_PATH | sed -r 's#https:\/\/api\.(.+\.nip\.io)#\1#'`
FQN=$DEPLOYMENT_NAME.apps.$BASE_HOST
echo "[INF] logging on $API_ADDRESS:$API_PORT"
COUNTER=1
until `$OC login --insecure-skip-tls-verify -u kubeadmin -p $KUBEADMIN_PWD $API_ADDRESS:$API_PORT > /dev/null 2>&1`
do
echo "[INF] login attempt $COUNTER"
[[ $COUNTER == 20 ]] && echo "[ERR] maximum login attempts exceeded, failing" && exit 1
((COUNTER++))
sleep 10
done
echo "[INF] deploying example deployment: $DEPLOYMENT_NAME in namespace: $NAMESPACE"
$ENVSUBST < $DEPLOYMENT_PATH | $OC apply -f - > /dev/null 2>&1
echo "[INF] creating SSL self-signed certificates for route https://$FQN"
$OPENSSL genrsa -out servercakey.pem > /dev/null 2>&1
$OPENSSL req -new -x509 -key servercakey.pem -out serverca.crt -subj "/CN=$FQN/O=Red Hat Inc./C=US" > /dev/null 2>&1
$OPENSSL genrsa -out server.key > /dev/null 2>&1
$OPENSSL req -new -key server.key -out server_reqout.txt -subj "/CN=$FQN/O=Red Hat Inc./C=US" > /dev/null 2>&1
$OPENSSL x509 -req -in server_reqout.txt -days 3650 -sha256 -CAcreateserial -CA serverca.crt -CAkey servercakey.pem -out server.crt > /dev/null 2>&1
echo "[INF] creating deployment: $DEPLOYMENT_NAME public route: https://$FQN"
$OC create route --namespace $NAMESPACE edge --service=$DEPLOYMENT_NAME-service --cert=server.crt --key=server.key --ca-cert=serverca.crt --hostname="$FQN" > /dev/null 2>&1
echo "[INF] setting github action environment variables"
NODE_NAME="`$OC get nodes -o json | $JQ -r '.items[0].metadata.name'`"
COVERAGE_FILE="`pwd`/coverage.md"
echo "DEPLOYMENT_NAME=$DEPLOYMENT_NAME" >> $GITHUB_ENV
echo "DEPLOYMENT_FQN=$FQN" >> $GITHUB_ENV
echo "API_ADDRESS=$API_ADDRESS" >> $GITHUB_ENV
echo "API_PORT=$API_PORT" >> $GITHUB_ENV
echo "NODE_NAME=$NODE_NAME" >> $GITHUB_ENV
echo "NAMESPACE=$NAMESPACE" >> $GITHUB_ENV
echo "COVERAGE_FILE=$COVERAGE_FILE" >> $GITHUB_ENV
echo "[INF] deployment fully qualified name will be available in \${{ env.DEPLOYMENT_NAME }} with value $DEPLOYMENT_NAME"
echo "[INF] deployment name will be available in \${{ env.DEPLOYMENT_FQN }} with value $FQN"
echo "[INF] OCP API address will be available in \${{ env.API_ADDRESS }} with value $API_ADDRESS"
echo "[INF] OCP API port will be available in \${{ env.API_PORT }} with value $API_PORT"
echo "[INF] OCP node name will be available in \${{ env.NODE_NAME }} with value $NODE_NAME"
echo "[INF] coverage file will ve available in \${{ env.COVERAGE_FILE }} with value $COVERAGE_FILE"

View File

@@ -1,5 +1,22 @@
#!/bin/bash
set -x
MAX_RETRIES=60
OC=`which oc 2>/dev/null`
[[ $? != 0 ]] && echo "[ERROR]: oc missing, please install it and try again" && exit 1
wait_cluster_become_ready() {
COUNT=1
until `$OC get namespace > /dev/null 2>&1`
do
echo "[INF] waiting OpenShift to become ready, after $COUNT check"
sleep 3
[[ $COUNT == $MAX_RETRIES ]] && echo "[ERR] max retries exceeded, failing" && exit 1
((COUNT++))
done
}
ci_tests_loc="CI/tests/my_tests"
@@ -22,5 +39,7 @@ echo '-----------------------|--------|---------' >> $results
# Run each test
for test_name in `cat CI/tests/my_tests`
do
wait_cluster_become_ready
./CI/run_test.sh $test_name $results
wait_cluster_become_ready
done

View File

@@ -1,31 +0,0 @@
---
kind: Pod
apiVersion: v1
metadata:
name: hello-pod
creationTimestamp:
labels:
name: hello-openshift
spec:
containers:
- name: hello-openshift
image: openshift/hello-openshift
ports:
- containerPort: 5050
protocol: TCP
resources: {}
volumeMounts:
- name: tmp
mountPath: "/tmp"
terminationMessagePath: "/dev/termination-log"
imagePullPolicy: IfNotPresent
securityContext:
capabilities: {}
privileged: false
volumes:
- name: tmp
emptyDir: {}
restartPolicy: Always
dnsPolicy: ClusterFirst
serviceAccount: ''
status: {}

View File

@@ -1,6 +0,0 @@
# yaml-language-server: $schema=../../scenarios/plugin.schema.json
- id: kill-pods
config:
label_selector: name=hello-openshift
namespace_pattern: ^default$
kill: 1

View File

@@ -1,7 +0,0 @@
scenarios:
- action: delete
namespace: "^.*ingress.*$"
label_selector:
runs: 1
sleep: 15
wait_time: 30

View File

@@ -0,0 +1,7 @@
scenarios:
- action: delete
namespace: "^$openshift-network-diagnostics$"
label_selector:
runs: 1
sleep: 15
wait_time: 30

View File

@@ -1,7 +1,20 @@
apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: kraken
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/enforce-version: v1.24
pod-security.kubernetes.io/warn: privileged
security.openshift.io/scc.podSecurityLabelSync: "false"
name: kraken
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: kraken-test-pv
namespace: kraken
labels:
type: local
spec:
@@ -17,6 +30,7 @@ apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: kraken-test-pvc
namespace: kraken
spec:
storageClassName: manual
accessModes:
@@ -29,6 +43,7 @@ apiVersion: v1
kind: Pod
metadata:
name: kraken-test-pod
namespace: kraken
spec:
volumes:
- name: kraken-test-pv

21
CI/tests/test_app_outages_gh.sh Executable file
View File

@@ -0,0 +1,21 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_app_outage {
[ -z $DEPLOYMENT_NAME ] && echo "[ERR] DEPLOYMENT_NAME variable not set, failing." && exit 1
yq -i '.application_outage.pod_selector={"app":"'$DEPLOYMENT_NAME'"}' CI/scenarios/app_outage.yaml
yq -i '.application_outage.namespace="'$NAMESPACE'"' CI/scenarios/app_outage.yaml
export scenario_type="application_outages"
export scenario_file="CI/scenarios/app_outage.yaml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/app_outage.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/app_outage.yaml
echo "App outage scenario test: Success"
}
functional_test_app_outage

20
CI/tests/test_cpu_hog_gh.sh Executable file
View File

@@ -0,0 +1,20 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_cpu {
[ -z $NODE_NAME ] && echo "[ERR] NODE_NAME variable not set, failing." && exit 1
yq -i ' .spec.experiments = [{"name": "node-cpu-hog", "spec":{"components":{"env":[{"name":"TOTAL_CHAOS_DURATION","value":"10"},{"name":"NODE_CPU_CORE","value":"1"},{"name":"NODES_AFFECTED_PERC","value":"30"},{"name":"TARGET_NODES","value":"'$NODE_NAME'"}]}}}]' CI/scenarios/node_cpu_hog_engine_node.yaml
cp CI/config/common_test_config.yaml CI/config/litmus_config.yaml
yq '.kraken.chaos_scenarios = [{"litmus_scenarios":[["scenarios/openshift/templates/litmus-rbac.yaml","CI/scenarios/node_cpu_hog_engine_node.yaml"]]}]' -i CI/config/litmus_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
}
functional_test_litmus_cpu

19
CI/tests/test_io_hog_gh.sh Executable file
View File

@@ -0,0 +1,19 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_io {
[ -z $NODE_NAME ] && echo "[ERR] NODE_NAME variable not set, failing." && exit 1
yq -i ' .spec.experiments = [{"name": "node-io-stress", "spec":{"components":{"env":[{"name":"TOTAL_CHAOS_DURATION","value":"10"},{"name":"FILESYSTEM_UTILIZATION_PERCENTAGE","value":"100"},{"name":"CPU","value":"1"},{"name":"NUMBER_OF_WORKERS","value":"3"},{"name":"TARGET_NODES","value":"'$NODE_NAME'"}]}}}]' CI/scenarios/node_io_engine_node.yaml
cp CI/config/common_test_config.yaml CI/config/litmus_config.yaml
yq '.kraken.chaos_scenarios = [{"litmus_scenarios":[["scenarios/openshift/templates/litmus-rbac.yaml","CI/scenarios/node_io_engine_node.yaml"]]}]' -i CI/config/litmus_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
}
functional_test_litmus_io

19
CI/tests/test_mem_hog_gh.sh Executable file
View File

@@ -0,0 +1,19 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function functional_test_litmus_mem {
[ -z $NODE_NAME ] && echo "[ERR] NODE_NAME variable not set, failing." && exit 1
yq -i ' .spec.experiments = [{"name": "node-io-stress", "spec":{"components":{"env":[{"name":"TOTAL_CHAOS_DURATION","value":"10"},{"name":"CPU","value":"1"},{"name":"TARGET_NODES","value":"'$NODE_NAME'"}]}}}]' CI/scenarios/node_mem_engine_node.yaml
cp CI/config/common_test_config.yaml CI/config/litmus_config.yaml
yq '.kraken.chaos_scenarios = [{"litmus_scenarios":[["scenarios/openshift/templates/litmus-rbac.yaml","CI/scenarios/node_mem_engine_node.yaml"]]}]' -i CI/config/litmus_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/litmus_config.yaml
echo "Litmus scenario test: Success"
}
functional_test_litmus_mem

View File

@@ -7,10 +7,10 @@ trap finish EXIT
function funtional_test_namespace_deletion {
export scenario_type="namespace_scenarios"
export scenario_file="- CI/scenarios/ingress_namespace.yaml"
export scenario_file="- CI/scenarios/network_diagnostics_namespace.yaml"
export post_config=""
yq '.scenarios.[0].namespace="^openshift-network-diagnostics$"' -i CI/scenarios/network_diagnostics_namespace.yaml
envsubst < CI/config/common_test_config.yaml > CI/config/namespace_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/namespace_config.yaml
echo $?
echo "Namespace scenario test: Success"

View File

@@ -1,19 +0,0 @@
set -xeEo pipefail
source CI/tests/common.sh
trap error ERR
trap finish EXIT
function funtional_test_pod_deletion {
export scenario_type="pod_scenarios"
export scenario_file="- CI/scenarios/hello_pod_killing.yml"
export post_config=""
envsubst < CI/config/common_test_config.yaml > CI/config/pod_config.yaml
python3 -m coverage run -a run_kraken.py -c CI/config/pod_config.yaml
echo $?
echo "Pod scenario test: Success"
}
funtional_test_pod_deletion

View File

@@ -1,5 +1,5 @@
# Krkn aka Kraken
[![Docker Repository on Quay](https://quay.io/repository/chaos-kubox/krkn/status "Docker Repository on Quay")](https://quay.io/repository/chaos-kubox/krkn?tab=tags&tag=latest)
[![Docker Repository on Quay](https://quay.io/repository/redhat-chaos/krkn/status "Docker Repository on Quay")](https://quay.io/repository/redhat-chaos/krkn?tab=tags&tag=latest)
![Krkn logo](media/logo.png)
@@ -56,19 +56,20 @@ Instructions on how to setup the config and the options supported can be found a
### Kubernetes/OpenShift chaos scenarios supported
Scenario type | Kubernetes | OpenShift
--------------------------- | ------------- | -------------------- |
Scenario type | Kubernetes | OpenShift
--------------------------- | ------------- |--------------------|
[Pod Scenarios](docs/pod_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Container Scenarios](docs/container_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Node Scenarios](docs/node_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Time Scenarios](docs/time_scenarios.md) | :x: | :heavy_check_mark: |
[Litmus Scenarios](docs/litmus_scenarios.md) | :x: | :heavy_check_mark: |
[Hog Scenarios](docs/arcaflow_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Cluster Shut Down Scenarios](docs/cluster_shut_down_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Namespace Scenarios](docs/namespace_scenarios.md) | :heavy_check_mark: | :heavy_check_mark: |
[Zone Outage Scenarios](docs/zone_outage.md) | :heavy_check_mark: | :heavy_check_mark: |
[Application_outages](docs/application_outages.md) | :heavy_check_mark: | :heavy_check_mark: |
[PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: | :heavy_check_mark: |
[Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: | :heavy_check_mark: |
[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: | :question: |
### Kraken scenario pass/fail criteria and report
@@ -96,19 +97,22 @@ Kraken supports capturing metrics for the duration of the scenarios defined in t
### Alerts
In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. Information on enabling and leveraging this feature can be found [here](docs/alerts.md).
### OCM / ACM integration
Kraken supports injecting faults into [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management) managed clusters through [ManagedCluster Scenarios](docs/managedcluster_scenarios.md).
### Blogs and other useful resources
- Blog post on introduction to Kraken: https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes
- Discussion and demo on how Kraken can be leveraged to ensure OpenShift is reliable, performant and scalable: https://www.youtube.com/watch?v=s1PvupI5sD0&ab_channel=OpenShift
- Blog post emphasizing the importance of making Chaos part of Performance and Scale runs to mimic the production environments: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests
- Blog post on findings from Chaos test runs: https://cloud.redhat.com/blog/openshift/kubernetes-chaos-stories
### Roadmap
Following is a list of enhancements that we are planning to work on adding support in Kraken. Of course any help/contributions are greatly appreciated.
- [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/redhat-chaos/krkn/issues/124)
- Continue to improve [Chaos Testing Guide](https://cloud-bulldozer.github.io/kraken/) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- Support for running Kraken on Kubernetes distribution - see https://github.com/redhat-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- Sweet logo for Kraken - see https://github.com/redhat-chaos/krkn/issues/195
- Continue to improve [Chaos Testing Guide](https://redhat-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/redhat-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
### Contributions

View File

@@ -2,21 +2,28 @@ kraken:
distribution: openshift # Distribution can be kubernetes or openshift
kubeconfig_path: ~/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
port: 8081
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
port: 8081 # Signal port
litmus_install: True # Installs specified version, set to False if it's already setup
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
- scenarios/arcaflow/io-hog/input.yaml
- scenarios/arcaflow/memory-hog/input.yaml
- container_scenarios: # List of chaos pod scenarios to load
- - scenarios/openshift/container_etcd.yml
- plugin_scenarios:
- scenarios/openshift/etcd.yml
- scenarios/openshift/regex_openshift_pod_kill.yml
- scenarios/openshift/vmware_node_scenarios.yml
- scenarios/openshift/ibmcloud_node_scenarios.yml
- scenarios/openshift/network_chaos_ingress.yml
- scenarios/openshift/pod_network_outage.yml
- node_scenarios: # List of chaos node scenarios to load
- scenarios/openshift/node_scenarios_example.yml
- plugin_scenarios:
@@ -64,7 +71,7 @@ performance_monitoring:
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path to alert profile with the prometheus queries
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios

40
config/config_kind.yaml Normal file
View File

@@ -0,0 +1,40 @@
kraken:
distribution: kubernetes # Distribution can be kubernetes or openshift
kubeconfig_path: ~/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
port: 8081
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
litmus_install: True # Installs specified version, set to False if it's already setup
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts
chaos_scenarios: # List of policies/chaos scenarios to load
- plugin_scenarios:
- scenarios/kind/scheduler.yml
- node_scenarios:
- scenarios/kind/node_scenarios_example.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_applicaton_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
kube_burner_binary_url: "https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.9.1/kube-burner-0.9.1-Linux-x86_64.tar.gz"
capture_metrics: False
config_path: config/kube_burner.yaml # Define the Elasticsearch url and index name in this config
metrics_profile_path: config/metrics-aggregated.yaml
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path to alert profile with the prometheus queries
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever

View File

@@ -32,7 +32,7 @@ performance_monitoring:
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
alert_profile: config/alerts # Path to alert profile with the prometheus queries
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos after soak time for the cluster to settle down
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios

View File

@@ -2,9 +2,10 @@ kraken:
distribution: openshift # Distribution can be kubernetes or openshift
kubeconfig_path: ~/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
port: 8081
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
port: 8081 # Signal port
litmus_version: v1.13.6 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
litmus_uninstall_before_run: True # If you want to uninstall litmus before a new run starts

View File

@@ -21,7 +21,7 @@ COPY --from=azure-cli /usr/local/bin/az /usr/bin/az
RUN yum install epel-release -y && \
yum install -y git python39 python3-pip jq gettext && \
python3.9 -m pip install -U pip && \
git clone https://github.com/redhat-chaos/krkn.git --branch main /root/kraken && \
git clone https://github.com/redhat-chaos/krkn.git --branch v1.2.0 /root/kraken && \
mkdir -p /root/.kube && cd /root/kraken && \
pip3.9 install -r requirements.txt

View File

@@ -1,30 +1,53 @@
### Kraken image
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/chaos-kubox/krkn).
Container image gets automatically built by quay.io at [Kraken image](https://quay.io/redhat-chaos/krkn).
### Run containerized version
Refer [instructions](https://github.com/redhat-chaos/krkn/blob/main/docs/installation.md#run-containerized-version) for information on how to run the containerized version of kraken.
### Run Custom Kraken Image
Refer to [instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/build_own_image-README.md) for information on how to run a custom containerized version of kraken using podman.
### Kraken as a KubeApp
#### GENERAL NOTES:
- It is not generally recommended to run Kraken internal to the cluster as the pod which is running Kraken might get disrupted, the suggested use case to run kraken from inside k8s/OpenShift is to target **another** cluster (eg. to bypass network restrictions or to leverage cluster's computational resources)
- your kubeconfig might contain several cluster contexts and credentials so be sure, before creating the ConfigMap, to keep **only** the credentials related to the destination cluster. Please refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) for more details
- to add privileges to the service account you must be logged in the cluster with an highly privileged account (ideally kubeadmin)
To run containerized Kraken as a Kubernetes/OpenShift Deployment, follow these steps:
1. Configure the [config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) file according to your requirements.
**NOTE**: both the scenarios ConfigMaps are needed regardless you're running kraken in Kubernetes or OpenShift
2. Create a namespace under which you want to run the kraken pod using `kubectl create ns <namespace>`.
3. Switch to `<namespace>` namespace:
- In Kubernetes, use `kubectl config set-context --current --namespace=<namespace>`
- In OpenShift, use `oc project <namespace>`
4. Create a ConfigMap named kube-config using `kubectl create configmap kube-config --from-file=<path_to_kubeconfig>`
5. Create a ConfigMap named kraken-config using `kubectl create configmap kraken-config --from-file=<path_to_kraken_config>`
6. Create a ConfigMap named scenarios-config using `kubectl create configmap scenarios-config --from-file=<path_to_scenarios_folder>`
7. Create a ConfigMap named scenarios-openshift-config using `kubectl create configmap scenarios-openshift-config --from-file=<path_to_scenarios_openshift_folder>`
8. Create a ConfigMap named scenarios-kube-config using `kubectl create configmap scenarios-kube-config --from-file=<path_to_scenarios_kube_folder>`
- In Kubernetes, use `kubectl config set-context --current --namespace=<namespace>`
- In OpenShift, use `oc project <namespace>`
4. Create a ConfigMap named kube-config using `kubectl create configmap kube-config --from-file=<path_to_kubeconfig>` *(eg. ~/.kube/config)*
5. Create a ConfigMap named kraken-config using `kubectl create configmap kraken-config --from-file=<path_to_kraken>/config`
6. Create a ConfigMap named scenarios-config using `kubectl create configmap scenarios-config --from-file=<path_to_kraken>/scenarios`
7. Create a ConfigMap named scenarios-openshift-config using `kubectl create configmap scenarios-openshift-config --from-file=<path_to_kraken>/scenarios/openshift`
8. Create a ConfigMap named scenarios-kube-config using `kubectl create configmap scenarios-kube-config --from-file=<path_to_kraken>/scenarios/kube`
9. Create a service account to run the kraken pod `kubectl create serviceaccount useroot`.
10. In Openshift, add privileges to service account and execute `oc adm policy add-scc-to-user privileged -z useroot`.
11. Create a Job using `kubectl apply -f kraken.yml` and monitor the status using `oc get jobs` and `oc get pods`.
NOTE: It is not recommended to run Kraken internal to the cluster as the pod which is running Kraken might get disrupted.
10. In Openshift, add privileges to service account and execute `oc adm policy add-scc-to-user privileged -z useroot`.
11. Create a Job using `kubectl apply -f <path_to_kraken>/containers/kraken.yml` and monitor the status using `oc get jobs` and `oc get pods`.

View File

@@ -16,7 +16,7 @@ spec:
- name: kraken
securityContext:
privileged: true
image: quay.io/chaos-kubox/krkn
image: quay.io/redhat-chaos/krkn
command: ["/bin/sh", "-c"]
args: ["python3.9 run_kraken.py -c config/config.yaml"]
volumeMounts:

View File

@@ -1,6 +1,17 @@
## Alerts
Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports alerting based on the queries defined by the user and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Kraken supports:
### Checking for critical alerts
If enabled, the check runs at the end of each scenario and Kraken exits in case critical alerts are firing to allow user to debug. You can enable it in the config:
```
performance_monitoring:
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos
```
### Alerting based on the queries defined by the user
Takes PromQL queries as input and modifies the return code of the run to determine pass/fail. It's especially useful in case of automated runs in CI where user won't be able to monitor the system. It uses [Kube-burner](https://kube-burner.readthedocs.io/en/latest/) under the hood. This feature can be enabled in the [config](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml) by setting the following:
```
performance_monitoring:
@@ -11,7 +22,7 @@ performance_monitoring:
alert_profile: config/alerts # Path to alert profile with the prometheus queries.
```
### Alert profile
#### Alert profile
A couple of [alert profiles](https://github.com/redhat-chaos/krkn/tree/main/config) [alerts](https://github.com/redhat-chaos/krkn/blob/main/config/alerts) are shipped by default and can be tweaked to add more queries to alert on. The following are a few alerts examples:
```

View File

@@ -0,0 +1,68 @@
## Arcaflow Scenarios
Arcaflow is a workflow engine in development which provides the ability to execute workflow steps in sequence, in parallel, repeatedly, etc. The main difference to competitors such as Netflix Conductor is the ability to run ad-hoc workflows without an infrastructure setup required.
The engine uses containers to execute plugins and runs them either locally in Docker/Podman or remotely on a Kubernetes cluster. The workflow system is strongly typed and allows for generating JSON schema and OpenAPI documents for all data formats involved.
### Available Scenarios
#### Hog scenarios:
- [CPU Hog](arcaflow_scenarios/cpu_hog.md)
- [Memory Hog](arcaflow_scenarios/memory_hog.md)
- [I/O Hog](arcaflow_scenarios/io_hog.md)
### Prequisites
Arcaflow supports three deployment technologies:
- Docker
- Podman
- Kubernetes
#### Docker
In order to run Arcaflow Scenarios with the Docker deployer, be sure that:
- Docker is correctly installed in your Operating System (to find instructions on how to install docker please refer to [Docker Documentation](https://www.docker.com/))
- The Docker daemon is running
#### Podman
The podman deployer is built around the podman CLI and doesn't need necessarily to be run along with the podman daemon.
To run Arcaflow Scenarios in your Operating system be sure that:
- podman is correctly installed in your Operating System (to find instructions on how to install podman refer to [Podman Documentation](https://podman.io/))
- the podman CLI is in your shell PATH
#### Kubernetes
The kubernetes deployer integrates directly the Kubernetes API Client and needs only a valid kubeconfig file and a reachable Kubernetes/OpenShift Cluster.
### Usage
To enable arcaflow scenarios edit the kraken config file, go to the section `kraken -> chaos_scenarios` of the yaml structure
and add a new element to the list named `arcaflow_scenarios` then add the desired scenario
pointing to the `input.yaml` file.
```
kraken:
...
chaos_scenarios:
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
```
#### input.yaml
The implemented scenarios can be found in *scenarios/arcaflow/<scenario_name>* folder.
The entrypoint of each scenario is the *input.yaml* file.
In this file there are all the options to set up the scenario accordingly to the desired target
### config.yaml
The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level.
The supported deployers are:
- Docker
- Podman (podman daemon not needed, suggested option)
- Kubernetes
The supported log levels are:
- debug
- info
- warning
- error
### workflow.yaml
This file contains the steps that will be executed to perform the scenario against the target.
Each step is represented by a container that will be executed from the deployer and its options.
Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows.
To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the [Arcaflow Documentation](https://arcalot.io/arcaflow/).

View File

@@ -0,0 +1,19 @@
# CPU Hog
This scenario is based on the arcaflow [arcaflow-plugin-stressng](https://github.com/arcalot/arcaflow-plugin-stressng) plugin.
The purpose of this scenario is to create cpu pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
To enable this plugin add the pointer to the scenario input file `scenarios/arcaflow/cpu-hog/input.yaml` as described in the
Usage section.
This scenario takes a list of objects named `input_list` with the following properties:
- **kubeconfig :** *string* the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
- **namespace :** *string* the namespace where the scenario container will be deployed
**Note:** this parameter will be automatically filled by kraken if the `kubeconfig_path` property is correctly set
- **node_selector :** *key-value map* the node label that will be used as `nodeSelector` by the pod to target a specific cluster node
- **duration :** *string* stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
- **cpu_count :** *int* the number of CPU cores to be used (0 means all)
- **cpu_method :** *string* a fine-grained control of which cpu stressors to use (ackermann, cfloat etc. see [manpage](https://manpages.org/sysbench) for all the cpu_method options)
- **cpu_load_percentage :** *int* the CPU load by percentage
To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item
to the `input_list` with the same properties (and eventually different values eg. different node_selectors
to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value `parallelism` in `workload.yaml` file

View File

@@ -0,0 +1,21 @@
# I/O Hog
This scenario is based on the arcaflow [arcaflow-plugin-stressng](https://github.com/arcalot/arcaflow-plugin-stressng) plugin.
The purpose of this scenario is to create disk pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
The scenario allows to attach a node path to the pod as a `hostPath` volume.
To enable this plugin add the pointer to the scenario input file `scenarios/arcaflow/io-hog/input.yaml` as described in the
Usage section.
This scenario takes a list of objects named `input_list` with the following properties:
- **kubeconfig :** *string* the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
- **namespace :** *string* the namespace where the scenario container will be deployed
**Note:** this parameter will be automatically filled by kraken if the `kubeconfig_path` property is correctly set
- **node_selector :** *key-value map* the node label that will be used as `nodeSelector` by the pod to target a specific cluster node
- **duration :** *string* stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
- **target_pod_folder :** *string* the path in the pod where the volume is mounted
- **target_pod_volume :** *object* the `hostPath` volume definition in the [Kubernetes/OpenShift](https://docs.openshift.com/container-platform/3.11/install_config/persistent_storage/using_hostpath.html) format, that will be attached to the pod as a volume
- **io_write_bytes :** *string* writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
- **io_block_size :** *string* size of each write in bytes. Size can be from 1 byte to 4m.
To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item
to the `input_list` with the same properties (and eventually different values eg. different node_selectors
to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value `parallelism` in `workload.yaml` file

View File

@@ -0,0 +1,18 @@
# Memory Hog
This scenario is based on the arcaflow [arcaflow-plugin-stressng](https://github.com/arcalot/arcaflow-plugin-stressng) plugin.
The purpose of this scenario is to create Virtual Memory pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
To enable this plugin add the pointer to the scenario input file `scenarios/arcaflow/memory-hog/input.yaml` as described in the
Usage section.
This scenario takes a list of objects named `input_list` with the following properties:
- **kubeconfig :** *string* the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
- **namespace :** *string* the namespace where the scenario container will be deployed
**Note:** this parameter will be automatically filled by kraken if the `kubeconfig_path` property is correctly set
- **node_selector :** *key-value map* the node label that will be used as `nodeSelector` by the pod to target a specific cluster node
- **duration :** *string* stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
- **vm_bytes :** *string* N bytes per vm process or percentage of memory used (using the % symbol). The size can be expressed in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
- **vm_workers :** *int* Number of VM stressors to be run (0 means 1 stressor per CPU)
To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item
to the `input_list` with the same properties (and eventually different values eg. different node_selectors
to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value `parallelism` in `workload.yaml` file

View File

@@ -1,11 +1,12 @@
Supported Cloud Providers:
* [AWS](#aws)
* [GCP](#gcp)
* [Openstack](#openstack)
* [Azure](#azure)
* [Alibaba](#alibaba)
* [VMware](#vmware)
- [AWS](#aws)
- [GCP](#gcp)
- [Openstack](#openstack)
- [Azure](#azure)
- [Alibaba](#alibaba)
- [VMware](#vmware)
- [IBMCloud](#ibmcloud)
## AWS
@@ -65,4 +66,24 @@ Set the following environment variables
3. ```export VSPHERE_PASSWORD=<vSphere_client_password>```
These are the credentials that you would normally use to access the vSphere client.
These are the credentials that you would normally use to access the vSphere client.
## IBMCloud
If no api key is set up with proper VPC resource permissions, use the following to create:
* Access group
* Service id with the following access
* With policy **VPC Infrastructure Services**
* Resources = All
* Roles:
* Editor
* Administrator
* Operator
* Viewer
* API Key
Set the following environment variables
1. ```export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1```
2. ```export IBMC_APIKEY=<ibmcloud_api_key>```

View File

@@ -2,3 +2,73 @@
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at [config/config.yaml](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml).
**NOTE**: [config](https://github.com/redhat-chaos/krkn/blob/main/config/config_performance.yaml) can be used if leveraging the [automated way](https://github.com/redhat-chaos/krkn#setting-up-infrastructure-dependencies) to install the infrastructure pieces.
Config components:
* [Kraken](#kraken)
* [Cerberus](#cerberus)
* [Performance Monitoring](#performance-monitoring)
* [Tunings](#tunings)
# Kraken
This section defines scenarios and specific data to the chaos run
## Distribution
Either **openshift** or **kubernetes** depending on the type of cluster you want to run chaos on.
The prometheus url/route and bearer token are automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
## Exit on failure
**exit_on_failure**: Exit when a post action check or cerberus run fails
## Publish kraken status
**publish_kraken_status**: Can be accessed at http://0.0.0.0:8081 (or what signal_address and port you set in signal address section)
**signal_state**: State you want kraken to start at; will wait for the RUN signal to start running a chaos iteration. When set to PAUSE before running the scenarios, refer to [signal.md](signal.md) for more details
## Signal Address
**signal_address**: Address to listen/post the signal state to
**port**: port to listen/post the signal state to
## Litmus Variables
Litmus installation specifics if you are running one of the hog scenarios. See [litmus doc](litmus_scenarios.md) for more information on these types of scenarios
**litmus_install**: Installs specified version of litmus, set to False if it's already setup
**litmus_version**: Litmus version to install
**litmus_uninstall**: If you want to uninstall litmus if failure
**litmus_uninstall_before_run**: If you want to uninstall litmus before a new run starts, True or False values
## Chaos Scenarios
**chaos_scenarios**: List of different types of chaos scenarios you want to run with paths to their specific yaml file configurations
If a scenario has a post action check script, it will be run before and after each scenario to validate the component under test starts and ends at the same state
Currently the scenarios are run one after another (in sequence) and will exit if one of the scenarios fail, without moving onto the next one
Chaos scenario types:
- container_scenarios
- plugin_scenarios
- node_scenarios
- time_scenarios
- litmus_scenarios
- cluster_shut_down_scenarios
- namespace_scenarios
- zone_outages
- application_outages
- pvc_scenarios
- network_chaos
# Cerberus
Parameters to set for enabling of cerberus checks at the end of each executed scenario. The given url will pinged after the scenario and post action check have been completed for each scenario and iteration.
**cerberus_enabled**: Enable it when cerberus is previously installed
**cerberus_url**: When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
**check_applicaton_routes**: When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
# Performance Monitoring
There are 2 main sections defined in this part of the config [metrics](metrics.md) and [alerts](alerts.md); read more about each of these configurations in their respective docs
# Tunings
**wait_duration**: Duration to wait between each chaos scenario
**iterations**: Number of times to execute the scenarios
**daemon_mode**: True or False; If true, iterations are set to infinity which means that the kraken will cause chaos forever and number of iterations is ignored

View File

@@ -14,7 +14,9 @@ scenarios:
container_name: "<specific container name>" # This is optional, can take out and will kill all containers in all pods found under namespace and label
pod_names: # This is optional, can take out and will select all pods with given namespace and label
- <pod_name>
retry_wait: <number of seconds to wait for container to be running again> (defaults to 120seconds)
count: <number of containers to disrupt, default=1>
action: <Action to run. For example kill 1 ( hang up ) or kill 9. Default is set to kill 1>
expected_recovery_time: <number of seconds to wait for container to be running again> (defaults to 120seconds)
```
#### Post Action
@@ -34,5 +36,5 @@ See [scenarios/post_action_etcd_container.py](https://github.com/redhat-chaos/kr
containers that were killed as well as the namespaces and pods to verify all containers that were affected recover properly.
```
retry_wait: <seconds to wait for container to recover>
expected_recovery_time: <seconds to wait for container to recover>
```

View File

@@ -10,7 +10,9 @@
* [Cluster recovery checks, metrics evaluation and pass/fail criteria](#cluster-recovery-checks-metrics-evaluation-and-passfail-criteria)
* [Scenarios](#scenarios)
* [Test Environment Recommendations - how and where to run chaos tests](#test-environment-recommendations---how-and-where-to-run-chaos-tests)
* [Chaos testing in Practice within the OpenShift Organization](#chaos-testing-in-practice-within-the-OpenShift-Organization)
* [Chaos testing in Practice](#chaos-testing-in-practice)
* [OpenShift oraganization](#openshift-organization)
* [startx-lab](#startx-lab)
### Introduction
@@ -207,8 +209,9 @@ Let us take a look at few recommendations on how and where to run the chaos test
- You might have existing test cases, be it related to Performance, Scalability or QE. Run the chaos in the background during the test runs to observe the impact. Signaling feature in Kraken can help with coordinating the chaos runs i.e., start, stop, pause the scenarios based on the state of the other test jobs.
#### Chaos testing in Practice within the OpenShift Organization
#### Chaos testing in Practice
##### OpenShift organization
Within the OpenShift organization we use kraken to perform chaos testing throughout a release before the code is available to customers.
1. We execute kraken during our regression test suite.
@@ -226,3 +229,83 @@ Within the OpenShift organization we use kraken to perform chaos testing through
iii. This test can be seen here: https://github.com/openshift/svt/tree/master/reliability-v2
3. We are starting to add in test cases that perform chaos testing during an upgrade (not many iterations of this have been completed).
##### startx-lab
**NOTE**: Requests for enhancements and any issues need to be filed at the mentioned links given that they are not natively supported in Kraken.
The following content covers the implementation details around how Startx is leveraging Kraken:
* Using kraken as part of a tekton pipeline
You can find on [artifacthub.io](https://artifacthub.io/packages/search?kind=7&ts_query_web=kraken) the
[kraken-scenario](https://artifacthub.io/packages/tekton-task/startx-tekton-catalog/kraken-scenario) `tekton-task`
which can be used to start a kraken chaos scenarios as part of a chaos pipeline.
To use this task, you must have :
- Openshift pipeline enabled (or tekton CRD loaded for Kubernetes clusters)
- 1 Secret named `kraken-aws-creds` for scenarios using aws
- 1 ConfigMap named `kraken-kubeconfig` with credentials to the targeted cluster
- 1 ConfigMap named `kraken-config-example` with kraken configuration file (config.yaml)
- 1 ConfigMap named `kraken-common-example` with all kraken related files
- The `pipeline` SA with be autorized to run with priviveged SCC
You can create theses resources using the following sequence :
```bash
oc project default
oc adm policy add-scc-to-user privileged -z pipeline
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/common.yaml
```
Then you must change content of `kraken-aws-creds` secret, `kraken-kubeconfig` and `kraken-config-example` configMap
to reflect your cluster configuration. Refer to the [kraken configuration](https://github.com/redhat-chaos/krkn/blob/main/config/config.yaml)
and [configuration examples](https://github.com/startxfr/tekton-catalog/blob/stable/task/kraken-scenario/0.1/samples/)
for details on how to configure theses resources.
* Start as a single taskrun
```bash
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/taskrun.yaml
```
* Start as a pipelinerun
```yaml
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/pipelinerun.yaml
```
* Deploying kraken using a helm-chart
You can find on [artifacthub.io](https://artifacthub.io/packages/search?kind=0&ts_query_web=kraken) the
[chaos-kraken](https://artifacthub.io/packages/helm/startx/chaos-kraken) `helm-chart`
which can be used to deploy a kraken chaos scenarios.
Default configuration create the following resources :
- 1 project named **chaos-kraken**
- 1 scc with privileged context for kraken deployment
- 1 configmap with kraken 21 generic scenarios, various scripts and configuration
- 1 configmap with kubeconfig of the targeted cluster
- 1 job named kraken-test-xxx
- 1 service to the kraken pods
- 1 route to the kraken service
```bash
# Install the startx helm repository
helm repo add startx https://startxfr.github.io/helm-repository/packages/
# Install the kraken project
helm install --set project.enabled=true chaos-kraken-project startx/chaos-kraken
# Deploy the kraken instance
helm install \
--set kraken.enabled=true \
--set kraken.aws.credentials.region="eu-west-3" \
--set kraken.aws.credentials.key_id="AKIAXXXXXXXXXXXXXXXX" \
--set kraken.aws.credentials.secret="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
--set kraken.kubeconfig.token.server="https://api.mycluster:6443" \
--set kraken.kubeconfig.token.token="sha256~XXXXXXXXXX_PUT_YOUR_TOKEN_HERE_XXXXXXXXXXXX" \
-n chaos-kraken \
chaos-kraken-instance startx/chaos-kraken
```

View File

@@ -37,17 +37,17 @@ $ python3.9 run_kraken.py --config <config_file_location>
### Run containerized version
Assuming that the latest docker ( 17.05 or greater with multi-build support ) is installed on the host, run:
```
$ docker pull quay.io/chaos-kubox/krkn:latest
$ docker run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -d quay.io/chaos-kubox/krkn:latest
$ docker run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -v <path_to_scenarios_directory>:/root/kraken/scenarios:Z -d quay.io/chaos-kubox/krkn:latest #custom or tweaked scenario configs
$ docker pull quay.io/redhat-chaos/krkn:latest
$ docker run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -d quay.io/redhat-chaos/krkn:latest
$ docker run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -v <path_to_scenarios_directory>:/root/kraken/scenarios:Z -d quay.io/redhat-chaos/krkn:latest #custom or tweaked scenario configs
$ docker logs -f kraken
```
Similarly, podman can be used to achieve the same:
```
$ podman pull quay.io/chaos-kubox/krkn
$ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -d quay.io/chaos-kubox/krkn:latest
$ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -v <path_to_scenarios_directory>:/root/kraken/scenarios:Z -d quay.io/chaos-kubox/krkn:latest #custom or tweaked scenario configs
$ podman pull quay.io/redhat-chaos/krkn
$ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -d quay.io/redhat-chaos/krkn:latest
$ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -v <path_to_scenarios_directory>:/root/kraken/scenarios:Z -d quay.io/redhat-chaos/krkn:latest #custom or tweaked scenario configs
$ podman logs -f kraken
```
@@ -56,3 +56,8 @@ If you want to build your own kraken image see [here](https://github.com/redhat-
### Run Kraken as a Kubernetes deployment
Refer [Instructions](https://github.com/redhat-chaos/krkn/blob/main/containers/README.md) on how to deploy and run Kraken as a Kubernetes/OpenShift deployment.
Refer to the [chaos-kraken chart manpage](https://artifacthub.io/packages/helm/startx/chaos-kraken)
and especially the [kraken configuration values](https://artifacthub.io/packages/helm/startx/chaos-kraken#chaos-kraken-values-dictionary)
for details on how to configure this chart.

View File

@@ -0,0 +1,36 @@
### ManagedCluster Scenarios
[ManagedCluster](https://open-cluster-management.io/concepts/managedcluster/) scenarios provide a way to integrate kraken with [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management).
ManagedCluster scenarios leverage [ManifestWorks](https://open-cluster-management.io/concepts/manifestwork/) to inject faults into the ManagedClusters.
The following ManagedCluster chaos scenarios are supported:
1. **managedcluster_start_scenario**: Scenario to start the ManagedCluster instance.
2. **managedcluster_stop_scenario**: Scenario to stop the ManagedCluster instance.
3. **managedcluster_stop_start_scenario**: Scenario to stop and then start the ManagedCluster instance.
4. **start_klusterlet_scenario**: Scenario to start the klusterlet of the ManagedCluster instance.
5. **stop_klusterlet_scenario**: Scenario to stop the klusterlet of the ManagedCluster instance.
6. **stop_start_klusterlet_scenario**: Scenario to stop and start the klusterlet of the ManagedCluster instance.
ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under `managedcluster_scenarios` option in the Kraken config. Refer to [managedcluster_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/kube/managedcluster_scenarios_example.yml) config file.
```
managedcluster_scenarios:
- actions: # ManagedCluster chaos scenarios to be injected
- managedcluster_stop_start_scenario
managedcluster_name: cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
instance_count: 1 # Number of managedcluster to perform action/select that match the label selector
runs: 1 # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
timeout: 420 # Duration to wait for completion of ManagedCluster scenario injection
# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
# (default leaseDurationSeconds = 60 sec)
- actions:
- stop_start_klusterlet_scenario
managedcluster_name: cluster1
# label_selector:
instance_count: 1
runs: 1
timeout: 60
```

View File

@@ -38,6 +38,14 @@ See the example node scenario or the example below.
**NOTE**: Baremetal machines are fragile. Some node actions can occasionally corrupt the filesystem if it does not shut down properly, and sometimes the kubelet does not start properly.
#### Docker
The Docker provider can be used to run node scenarios against kind clusters.
[kind](https://kind.sigs.k8s.io/) is a tool for running local Kubernetes clusters using Docker container "nodes".
kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
#### GCP
How to set up GCP cli to run node scenarios is defined [here](cloud_setup.md#gcp).
@@ -68,6 +76,42 @@ How to set up Alibaba cli to run node scenarios is defined [here](cloud_setup.md
#### VMware
How to set up VMware vSphere to run node scenarios is defined [here](cloud_setup.md#vmware)
This cloud type uses a different configuration style, see actions below and [example config file](../scenarios/openshift/vmware_node_scenarios.yml)
*vmware-node-terminate, vmware-node-reboot, vmware-node-stop, vmware-node-start*
#### IBMCloud
How to set up IBMCloud to run node scenarios is defined [here](cloud_setup.md#ibmcloud)
This cloud type uses a different configuration style, see actions below and [example config file](../scenarios/openshift/ibmcloud_node_scenarios.yml)
*ibmcloud-node-terminate, ibmcloud-node-reboot, ibmcloud-node-stop, ibmcloud-node-start
*
#### IBMCloud and Vmware example
```
- id: ibmcloud-node-stop
config:
name: "<node_name>"
label_selector: "node-role.kubernetes.io/worker" # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
runs: 1 # Number of times to inject each scenario under actions (will perform on same node each time)
instance_count: 1 # Number of nodes to perform action/select that match the label selector
timeout: 30 # Duration to wait for completion of node scenario injection
skip_openshift_checks: False # Set to True if you don't want to wait for the status of the nodes to change on OpenShift before passing the scenario
- id: ibmcloud-node-start
config:
name: "<node_name>" #Same name as before
label_selector: "node-role.kubernetes.io/worker" # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
runs: 1 # Number of times to inject each scenario under actions (will perform on same node each time)
instance_count: 1 # Number of nodes to perform action/select that match the label selector
timeout: 30 # Duration to wait for completion of node scenario injection
skip_openshift_checks: False # Set to True if you don't want to wait for the status of the nodes to change on OpenShift before passing the scenario
```
#### General

View File

@@ -0,0 +1,15 @@
### Pod outage
Scenario to block the traffic ( Ingress/Egress ) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
##### Sample scenario config (using a plugin)
```
- id: pod_network_outage
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied
direction: # Optioinal - List of directions to apply filters
- ingress # Blocks ingress traffic, Default both egress and ingress
ingress_ports: # Optional - List of ports to block traffic on
- 8443 # Blocks 8443, Default [], i.e. all ports.
label_selector: 'component=ui' # Blocks access to openshift console
```

View File

@@ -0,0 +1,2 @@
from .arcaflow_plugin import *
from .context_auth import ContextAuth

View File

@@ -0,0 +1,168 @@
import arcaflow
import os
import yaml
import logging
import sys
from pathlib import Path
from typing import List
from .context_auth import ContextAuth
def run(scenarios_list: List[str], kubeconfig_path: str):
for scenario in scenarios_list:
engine_args = build_args(scenario)
run_workflow(engine_args, kubeconfig_path)
def run_workflow(engine_args: arcaflow.EngineArgs, kubeconfig_path: str):
set_arca_kubeconfig(engine_args, kubeconfig_path)
exit_status = arcaflow.run(engine_args)
if exit_status != 0:
logging.error(
f"failed to run arcaflow scenario {engine_args.input}"
)
sys.exit(exit_status)
def build_args(input_file: str) -> arcaflow.EngineArgs:
"""sets the kubeconfig parsed by setArcaKubeConfig as an input to the arcaflow workflow"""
context = Path(input_file).parent
workflow = "{}/workflow.yaml".format(context)
config = "{}/config.yaml".format(context)
if not os.path.exists(context):
raise Exception(
"context folder for arcaflow workflow not found: {}".format(
context)
)
if not os.path.exists(input_file):
raise Exception(
"input file for arcaflow workflow not found: {}".format(input_file))
if not os.path.exists(workflow):
raise Exception(
"workflow file for arcaflow workflow not found: {}".format(
workflow)
)
if not os.path.exists(config):
raise Exception(
"configuration file for arcaflow workflow not found: {}".format(
config)
)
engine_args = arcaflow.EngineArgs()
engine_args.context = context
engine_args.config = config
engine_args.input = input_file
return engine_args
def set_arca_kubeconfig(engine_args: arcaflow.EngineArgs, kubeconfig_path: str):
context_auth = ContextAuth()
if not os.path.exists(kubeconfig_path):
raise Exception("kubeconfig not found in {}".format(kubeconfig_path))
with open(kubeconfig_path, "r") as stream:
try:
kubeconfig = yaml.safe_load(stream)
context_auth.fetch_auth_data(kubeconfig)
except Exception as e:
logging.error("impossible to read kubeconfig file in: {}".format(
kubeconfig_path))
raise e
kubeconfig_str = set_kubeconfig_auth(kubeconfig, context_auth)
with open(engine_args.input, "r") as stream:
input_file = yaml.safe_load(stream)
if "input_list" in input_file and isinstance(input_file["input_list"],list):
for index, _ in enumerate(input_file["input_list"]):
if isinstance(input_file["input_list"][index], dict):
input_file["input_list"][index]["kubeconfig"] = kubeconfig_str
else:
input_file["kubeconfig"] = kubeconfig_str
stream.close()
with open(engine_args.input, "w") as stream:
yaml.safe_dump(input_file, stream)
with open(engine_args.config, "r") as stream:
config_file = yaml.safe_load(stream)
if config_file["deployer"]["type"] == "kubernetes":
kube_connection = set_kubernetes_deployer_auth(config_file["deployer"]["connection"], context_auth)
config_file["deployer"]["connection"]=kube_connection
with open(engine_args.config, "w") as stream:
yaml.safe_dump(config_file, stream,explicit_start=True, width=4096)
def set_kubernetes_deployer_auth(deployer: any, context_auth: ContextAuth) -> any:
if context_auth.clusterHost is not None :
deployer["host"] = context_auth.clusterHost
if context_auth.clientCertificateData is not None :
deployer["cert"] = context_auth.clientCertificateData
if context_auth.clientKeyData is not None:
deployer["key"] = context_auth.clientKeyData
if context_auth.clusterCertificateData is not None:
deployer["cacert"] = context_auth.clusterCertificateData
if context_auth.username is not None:
deployer["username"] = context_auth.username
if context_auth.password is not None:
deployer["password"] = context_auth.password
if context_auth.bearerToken is not None:
deployer["bearerToken"] = context_auth.bearerToken
return deployer
def set_kubeconfig_auth(kubeconfig: any, context_auth: ContextAuth) -> str:
"""
Builds an arcaflow-compatible kubeconfig representation and returns it as a string.
In order to run arcaflow plugins in kubernetes/openshift the kubeconfig must contain client certificate/key
and server certificate base64 encoded within the kubeconfig file itself in *-data fields. That is not always the
case, infact kubeconfig may contain filesystem paths to those files, this function builds an arcaflow-compatible
kubeconfig file and returns it as a string that can be safely included in input.yaml
"""
if "current-context" not in kubeconfig.keys():
raise Exception(
"invalid kubeconfig file, impossible to determine current-context"
)
user_id = None
cluster_id = None
user_name = None
cluster_name = None
current_context = kubeconfig["current-context"]
for context in kubeconfig["contexts"]:
if context["name"] == current_context:
user_name = context["context"]["user"]
cluster_name = context["context"]["cluster"]
if user_name is None:
raise Exception(
"user not set for context {} in kubeconfig file".format(current_context)
)
if cluster_name is None:
raise Exception(
"cluster not set for context {} in kubeconfig file".format(current_context)
)
for index, user in enumerate(kubeconfig["users"]):
if user["name"] == user_name:
user_id = index
for index, cluster in enumerate(kubeconfig["clusters"]):
if cluster["name"] == cluster_name:
cluster_id = index
if cluster_id is None:
raise Exception(
"no cluster {} found in kubeconfig users".format(cluster_name)
)
if "client-certificate" in kubeconfig["users"][user_id]["user"]:
kubeconfig["users"][user_id]["user"]["client-certificate-data"] = context_auth.clientCertificateDataBase64
del kubeconfig["users"][user_id]["user"]["client-certificate"]
if "client-key" in kubeconfig["users"][user_id]["user"]:
kubeconfig["users"][user_id]["user"]["client-key-data"] = context_auth.clientKeyDataBase64
del kubeconfig["users"][user_id]["user"]["client-key"]
if "certificate-authority" in kubeconfig["clusters"][cluster_id]["cluster"]:
kubeconfig["clusters"][cluster_id]["cluster"]["certificate-authority-data"] = context_auth.clusterCertificateDataBase64
del kubeconfig["clusters"][cluster_id]["cluster"]["certificate-authority"]
kubeconfig_str = yaml.dump(kubeconfig)
return kubeconfig_str

View File

@@ -0,0 +1,142 @@
import yaml
import os
import base64
class ContextAuth:
clusterCertificate: str = None
clusterCertificateData: str = None
clusterHost: str = None
clientCertificate: str = None
clientCertificateData: str = None
clientKey: str = None
clientKeyData: str = None
clusterName: str = None
username: str = None
password: str = None
bearerToken: str = None
# TODO: integrate in krkn-lib-kubernetes in the next iteration
@property
def clusterCertificateDataBase64(self):
if self.clusterCertificateData is not None:
return base64.b64encode(bytes(self.clusterCertificateData,'utf8')).decode("ascii")
return
@property
def clientCertificateDataBase64(self):
if self.clientCertificateData is not None:
return base64.b64encode(bytes(self.clientCertificateData,'utf8')).decode("ascii")
return
@property
def clientKeyDataBase64(self):
if self.clientKeyData is not None:
return base64.b64encode(bytes(self.clientKeyData,"utf-8")).decode("ascii")
return
def fetch_auth_data(self, kubeconfig: any):
context_username = None
current_context = kubeconfig["current-context"]
if current_context is None:
raise Exception("no current-context found in kubeconfig")
for context in kubeconfig["contexts"]:
if context["name"] == current_context:
context_username = context["context"]["user"]
self.clusterName = context["context"]["cluster"]
if context_username is None:
raise Exception("user not found for context {0}".format(current_context))
if self.clusterName is None:
raise Exception("cluster not found for context {0}".format(current_context))
cluster_id = None
user_id = None
for index, user in enumerate(kubeconfig["users"]):
if user["name"] == context_username:
user_id = index
if user_id is None :
raise Exception("user {0} not found in kubeconfig users".format(context_username))
for index, cluster in enumerate(kubeconfig["clusters"]):
if cluster["name"] == self.clusterName:
cluster_id = index
if cluster_id is None:
raise Exception(
"no cluster {} found in kubeconfig users".format(self.clusterName)
)
user = kubeconfig["users"][user_id]["user"]
cluster = kubeconfig["clusters"][cluster_id]["cluster"]
# sets cluster api URL
self.clusterHost = cluster["server"]
# client certificates
if "client-key" in user:
try:
self.clientKey = user["client-key"]
self.clientKeyData = self.read_file(user["client-key"])
except Exception as e:
raise e
if "client-key-data" in user:
try:
self.clientKeyData = base64.b64decode(user["client-key-data"]).decode('utf-8')
except Exception as e:
raise Exception("impossible to decode client-key-data")
if "client-certificate" in user:
try:
self.clientCertificate = user["client-certificate"]
self.clientCertificateData = self.read_file(user["client-certificate"])
except Exception as e:
raise e
if "client-certificate-data" in user:
try:
self.clientCertificateData = base64.b64decode(user["client-certificate-data"]).decode('utf-8')
except Exception as e:
raise Exception("impossible to decode client-certificate-data")
# cluster certificate authority
if "certificate-authority" in cluster:
try:
self.clusterCertificate = cluster["certificate-authority"]
self.clusterCertificateData = self.read_file(cluster["certificate-authority"])
except Exception as e:
raise e
if "certificate-authority-data" in cluster:
try:
self.clusterCertificateData = base64.b64decode(cluster["certificate-authority-data"]).decode('utf-8')
except Exception as e:
raise Exception("impossible to decode certificate-authority-data")
if "username" in user:
self.username = user["username"]
if "password" in user:
self.password = user["password"]
if "token" in user:
self.bearerToken = user["token"]
def read_file(self, filename:str) -> str:
if not os.path.exists(filename):
raise Exception("file not found {0} ".format(filename))
with open(filename, "rb") as file_stream:
return file_stream.read().decode('utf-8')

View File

@@ -0,0 +1,19 @@
-----BEGIN CERTIFICATE-----
MIIDBjCCAe6gAwIBAgIBATANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwptaW5p
a3ViZUNBMB4XDTIzMDMxMzE1NDAxM1oXDTMzMDMxMTE1NDAxM1owFTETMBEGA1UE
AxMKbWluaWt1YmVDQTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMnz
U/gIbJBRGOgNYVKX2fV03ANOwnM4VjquR28QMAdxURqgOFZ6IxYNysHEyxxE9I+I
DAm9hi4vQPbOX7FlxUezuzw+ExEfa6RRJ+n+AGJOV1lezCVph6OaJxB1+L1UqaDZ
eM3B4cUf/iCc5Y4bs927+CBG3MJL/jmCVPCO+MiSn/l73PXSFNJAYMvRj42zkXqD
CVG9CwY2vWgZnnzl01l7jNGtie871AmV2uqKakJrQ2ILhD+8fZk4jE5JBDTCZnqQ
pXIc+vERNKLUS8cvjO6Ux8dMv/Z7+xonpXOU59LlpUdHWP9jgCvMTwiOriwqGjJ+
pQJWpX9Dm+oxJiVOJzsCAwEAAaNhMF8wDgYDVR0PAQH/BAQDAgKkMB0GA1UdJQQW
MBQGCCsGAQUFBwMCBggrBgEFBQcDATAPBgNVHRMBAf8EBTADAQH/MB0GA1UdDgQW
BBQU9pDMtbayJdNM6bp0IG8dcs15qTANBgkqhkiG9w0BAQsFAAOCAQEAtl9TVKPA
hTnPODqv0AGTqreS9kLg4WUUjZRaPUkPWmtCoTh2Yf55nRWdHOHeZnCWDSg24x42
lpt+13IdqKew1RKTpKCTkicMFi090A01bYu/w39Cm6nOAA5h8zkgSkV5czvQotuV
SoN2vB+nbuY28ah5PkdqjMHEZbNwa59cgEke8wB1R1DWFQ/pqflrH2v9ACAuY+5Q
i673tA6CXrb1YfaCQnVBzcfvjGS1MqShPKpOLMF+/GccPczNimaBxMnKvYLvf3pN
qEUrJC00mAcein8HmxR2Xz8wredbMUUyrQxW29pZJwfGE5GU0olnlsA0lZLbTwio
xoolo5y+fsK/dA==
-----END CERTIFICATE-----

View File

@@ -0,0 +1,19 @@
-----BEGIN CERTIFICATE-----
MIIDITCCAgmgAwIBAgIBAjANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwptaW5p
a3ViZUNBMB4XDTIzMDUwMTA4NTc0N1oXDTI2MDUwMTA4NTc0N1owMTEXMBUGA1UE
ChMOc3lzdGVtOm1hc3RlcnMxFjAUBgNVBAMTDW1pbmlrdWJlLXVzZXIwggEiMA0G
CSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC0b7uy9nQYrh7uC5NODve7dFNLAgo5
pWRS6Kx13ULA55gOpieZiI5/1jwUBjOz0Hhl5QAdHC1HDNu5wf4MmwIEheuq3kMA
mfuvNxW2BnWSDuXyUMlBfqlwg5o6W8ndEWaK33D7wd2WQsSsAnhQPJSjnzWKvWKq
+Kbcygc4hdss/ZWN+SXLTahNpHBw0sw8AcJqddNeXs2WI5GdZmbXL4QZI36EaNUm
m4xKmKRKYIP9wYkmXOV/D2h1meM44y4lul5v2qvo6I+umJ84q4W1/W1vVmAzyVfL
v1TQCUx8cpKMHzw3ma6CTBCtU3Oq9HKHBnf8GyHZicmV7ESzf/phJu4ZAgMBAAGj
YDBeMA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUH
AwIwDAYDVR0TAQH/BAIwADAfBgNVHSMEGDAWgBQU9pDMtbayJdNM6bp0IG8dcs15
qTANBgkqhkiG9w0BAQsFAAOCAQEABNzEQQMYUcLsBASHladEjr46avKn7gREfaDl
Y5PBvgCPP42q/sW/9iCNY3UpT9TJZWM6s01+0p6I96jYbRQER1NX7O4OgQYHmFw2
PF6UOG2vMo54w11OvL7sbr4d+nkE6ItdM9fLDIJ3fEOYJZkSoxhOL/U3jSjIl7Wu
KCIlpM/M/gcZ4w2IvcLrWtvswbFNUd+dwQfBGcQTmSQDOLE7MqSvzYAkeNv73GLB
ieba7gs/PmoTFsf9nW60iXymDDF4MtODn15kqT/y1uD6coujmiEiIomBfxqAkUCU
0ciP/KF5oOEMmMedm7/peQxaRTMdRSk4yu7vbj/BxnTcj039Qg==
-----END CERTIFICATE-----

View File

@@ -0,0 +1,27 @@
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEAtG+7svZ0GK4e7guTTg73u3RTSwIKOaVkUuisdd1CwOeYDqYn
mYiOf9Y8FAYzs9B4ZeUAHRwtRwzbucH+DJsCBIXrqt5DAJn7rzcVtgZ1kg7l8lDJ
QX6pcIOaOlvJ3RFmit9w+8HdlkLErAJ4UDyUo581ir1iqvim3MoHOIXbLP2Vjfkl
y02oTaRwcNLMPAHCanXTXl7NliORnWZm1y+EGSN+hGjVJpuMSpikSmCD/cGJJlzl
fw9odZnjOOMuJbpeb9qr6OiPrpifOKuFtf1tb1ZgM8lXy79U0AlMfHKSjB88N5mu
gkwQrVNzqvRyhwZ3/Bsh2YnJlexEs3/6YSbuGQIDAQABAoIBAQCdJxPb8zt6o2zc
98f8nJy378D7+3LccmjGrVBH98ZELXIKkDy9RGqYfQcmiaBOZKv4U1OeBwSIdXKK
f6O9ZuSC/AEeeSbyRysmmFuYhlewNrmgKyyelqsNDBIv8fIHUTh2i9Xj8B4G2XBi
QGR5vcnYGLqRdBGTx63Nb0iKuksDCwPAuPA/e0ySz9HdWL1j4bqpVSYsOIXsqTDr
CVnxUeSIL0fFQnRm3IASXQD7zdq9eEFX7vESeleZoz8qNcKb4Na/C3N6crScjgH7
qyNZ2zNLfy1LT84k8uc1TMX2KcEVEmfdDv5cCnUH2ic12CwXMZ0vgId5LJTaHx4x
ytIQIe5hAoGBANB+TsRXP4KzcjZlUUfiAp/pWUM4kVktbsfZa1R2NEuIGJUxPk3P
7WS0WX5W75QKRg+UWTubg5kfd0f9fklLgofmliBnY/HrpgdyugJmUZBgzIxmy0k+
aCe0biD1gULfyyrKtfe8k5wRFstzhfGszlOf2ebR87sSVNBuF2lEwPTvAoGBAN2M
0/XrsodGU4B9Mj86Go2gb2k2WU2izI0cO+tm2S5U5DvKmVEnmjXfPRaOFj2UUQjo
cljnDAinbN+O0+Inc35qsEeYdAIepNAPglzcpfTHagja9mhx2idLYTXGhbZLL+Ei
TRzMyP27NF+GVVfYU/cA86ns6NboG6spohmnqh13AoGAKPc4aNGv0/GIVnHP56zb
0SnbdR7PSFNp+fCZay4Slmi2U9IqKMXbIjdhgjZ4uoDORU9jvReQYuzQ1h9TyfkB
O8yt4M4P0D/6DmqXa9NI4XJznn6wIMMXWf3UybsTW913IQBVgsjVxAuDjBQ11Eec
/sdg3D6SgkZWzeFjzjZJJ5cCgYBSYVg7fE3hERxhjawOaJuRCBQFSklAngVzfwkk
yhR9ruFC/l2uGIy19XFwnprUgP700gIa3qbR3PeV1TUiRcsjOaacqKqSUzSzjODL
iNxIvZHHAyxWv+b/b38REOWNWD3QeAG2cMtX1bFux7OaO31VPkxcZhRaPOp05cE5
yudtlwKBgDBbR7RLYn03OPm3NDBLLjTybhD8Iu8Oj7UeNCiEWAdZpqIKYnwSxMzQ
kdo4aTENA/seEwq+XDV7TwbUIFFJg5gDXIhkcK2c9kiO2bObCAmKpBlQCcrp0a5X
NSBk1N/ZG/Qhqns7z8k01KN4LNcdpRoNiYYPgY+p3xbY8+nWhv+q
-----END RSA PRIVATE KEY-----

View File

@@ -0,0 +1,100 @@
import os
import unittest
from context_auth import ContextAuth
class TestCurrentContext(unittest.TestCase):
def get_kubeconfig_with_data(self) -> str:
"""
This function returns a test kubeconfig file as a string.
:return: a test kubeconfig file in string format (for unit testing purposes)
""" # NOQA
return """apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM5ekNDQWQrZ0F3SUJBZ0lVV01PTVBNMVUrRi9uNXN6TSthYzlMcGZISHB3d0RRWUpLb1pJaHZjTkFRRUwKQlFBd0hqRWNNQm9HQTFVRUF3d1RhM1ZpZFc1MGRTNXNiMk5oYkdSdmJXRnBiakFlRncweU1URXlNRFl4T0RBdwpNRFJhRncwek1URXlNRFF4T0RBd01EUmFNQjR4SERBYUJnTlZCQU1NRTJ0MVluVnVkSFV1Ykc5allXeGtiMjFoCmFXNHdnZ0VpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElCRHdBd2dnRUtBb0lCQVFDNExhcG00SDB0T1NuYTNXVisKdzI4a0tOWWRwaHhYOUtvNjUwVGlOK2c5ZFNQU3VZK0V6T1JVOWVONlgyWUZkMEJmVFNodno4Y25rclAvNysxegpETEoxQ3MwRi9haEV3ZDQxQXN5UGFjbnRiVE80dGRLWm9POUdyODR3YVdBN1hSZmtEc2ZxRGN1YW5UTmVmT1hpCkdGbmdDVzU5Q285M056alB1eEFrakJxdVF6eE5GQkgwRlJPbXJtVFJ4cnVLZXo0aFFuUW1OWEFUNnp0M21udzMKWUtWTzU4b2xlcUxUcjVHNlRtVFQyYTZpVGdtdWY2N0cvaVZlalJGbkw3YkNHWmgzSjlCSTNMcVpqRzE4dWxvbgpaVDdQcGQrQTlnaTJOTm9UZlI2TVB5SndxU1BCL0xZQU5ZNGRoZDVJYlVydDZzbmViTlRZSHV2T0tZTDdNTWRMCmVMSzFBZ01CQUFHakxUQXJNQWtHQTFVZEV3UUNNQUF3SGdZRFZSMFJCQmN3RllJVGEzVmlkVzUwZFM1c2IyTmgKYkdSdmJXRnBiakFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBQTVqUHVpZVlnMExySE1PSkxYY0N4d3EvVzBDNApZeFpncVd3VHF5VHNCZjVKdDlhYTk0SkZTc2dHQWdzUTN3NnA2SlBtL0MyR05MY3U4ZWxjV0E4UXViQWxueXRRCnF1cEh5WnYrZ08wMG83TXdrejZrTUxqQVZ0QllkRzJnZ21FRjViTEk5czBKSEhjUGpHUkl1VHV0Z0tHV1dPWHgKSEg4T0RzaG9wZHRXMktrR2c2aThKaEpYaWVIbzkzTHptM00xRUNGcXAvMEdtNkN1RFphVVA2SGpJMWRrYllLdgpsSHNVZ1U1SmZjSWhNYmJLdUllTzRkc1YvT3FHcm9iNW5vcmRjaExBQmRDTnc1cmU5T1NXZGZ1VVhSK0ViZVhrCjVFM0tFYzA1RGNjcGV2a1NTdlJ4SVQrQzNMOTltWGcxL3B5NEw3VUhvNFFLTXlqWXJXTWlLRlVKV1E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
server: https://127.0.0.1:6443
name: default
contexts:
- context:
cluster: default
namespace: default
user: testuser
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: testuser
user:
client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM5ekNDQWQrZ0F3SUJBZ0lVV01PTVBNMVUrRi9uNXN6TSthYzlMcGZISHB3d0RRWUpLb1pJaHZjTkFRRUwKQlFBd0hqRWNNQm9HQTFVRUF3d1RhM1ZpZFc1MGRTNXNiMk5oYkdSdmJXRnBiakFlRncweU1URXlNRFl4T0RBdwpNRFJhRncwek1URXlNRFF4T0RBd01EUmFNQjR4SERBYUJnTlZCQU1NRTJ0MVluVnVkSFV1Ykc5allXeGtiMjFoCmFXNHdnZ0VpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElCRHdBd2dnRUtBb0lCQVFDNExhcG00SDB0T1NuYTNXVisKdzI4a0tOWWRwaHhYOUtvNjUwVGlOK2c5ZFNQU3VZK0V6T1JVOWVONlgyWUZkMEJmVFNodno4Y25rclAvNysxegpETEoxQ3MwRi9haEV3ZDQxQXN5UGFjbnRiVE80dGRLWm9POUdyODR3YVdBN1hSZmtEc2ZxRGN1YW5UTmVmT1hpCkdGbmdDVzU5Q285M056alB1eEFrakJxdVF6eE5GQkgwRlJPbXJtVFJ4cnVLZXo0aFFuUW1OWEFUNnp0M21udzMKWUtWTzU4b2xlcUxUcjVHNlRtVFQyYTZpVGdtdWY2N0cvaVZlalJGbkw3YkNHWmgzSjlCSTNMcVpqRzE4dWxvbgpaVDdQcGQrQTlnaTJOTm9UZlI2TVB5SndxU1BCL0xZQU5ZNGRoZDVJYlVydDZzbmViTlRZSHV2T0tZTDdNTWRMCmVMSzFBZ01CQUFHakxUQXJNQWtHQTFVZEV3UUNNQUF3SGdZRFZSMFJCQmN3RllJVGEzVmlkVzUwZFM1c2IyTmgKYkdSdmJXRnBiakFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBQTVqUHVpZVlnMExySE1PSkxYY0N4d3EvVzBDNApZeFpncVd3VHF5VHNCZjVKdDlhYTk0SkZTc2dHQWdzUTN3NnA2SlBtL0MyR05MY3U4ZWxjV0E4UXViQWxueXRRCnF1cEh5WnYrZ08wMG83TXdrejZrTUxqQVZ0QllkRzJnZ21FRjViTEk5czBKSEhjUGpHUkl1VHV0Z0tHV1dPWHgKSEg4T0RzaG9wZHRXMktrR2c2aThKaEpYaWVIbzkzTHptM00xRUNGcXAvMEdtNkN1RFphVVA2SGpJMWRrYllLdgpsSHNVZ1U1SmZjSWhNYmJLdUllTzRkc1YvT3FHcm9iNW5vcmRjaExBQmRDTnc1cmU5T1NXZGZ1VVhSK0ViZVhrCjVFM0tFYzA1RGNjcGV2a1NTdlJ4SVQrQzNMOTltWGcxL3B5NEw3VUhvNFFLTXlqWXJXTWlLRlVKV1E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
client-key-data: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2QUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktZd2dnU2lBZ0VBQW9JQkFRQzRMYXBtNEgwdE9TbmEKM1dWK3cyOGtLTllkcGh4WDlLbzY1MFRpTitnOWRTUFN1WStFek9SVTllTjZYMllGZDBCZlRTaHZ6OGNua3JQLwo3KzF6RExKMUNzMEYvYWhFd2Q0MUFzeVBhY250YlRPNHRkS1pvTzlHcjg0d2FXQTdYUmZrRHNmcURjdWFuVE5lCmZPWGlHRm5nQ1c1OUNvOTNOempQdXhBa2pCcXVRenhORkJIMEZST21ybVRSeHJ1S2V6NGhRblFtTlhBVDZ6dDMKbW53M1lLVk81OG9sZXFMVHI1RzZUbVRUMmE2aVRnbXVmNjdHL2lWZWpSRm5MN2JDR1poM0o5QkkzTHFaakcxOAp1bG9uWlQ3UHBkK0E5Z2kyTk5vVGZSNk1QeUp3cVNQQi9MWUFOWTRkaGQ1SWJVcnQ2c25lYk5UWUh1dk9LWUw3Ck1NZExlTEsxQWdNQkFBRUNnZ0VBQ28rank4NW5ueVk5L2l6ZjJ3cjkzb2J3OERaTVBjYnIxQURhOUZYY1hWblEKT2c4bDZhbU9Ga2tiU0RNY09JZ0VDdkx6dEtXbmQ5OXpydU5sTEVtNEdmb0trNk5kK01OZEtKRUdoZHE5RjM1Qgpqdi91R1owZTIyRE5ZLzFHNVdDTE5DcWMwQkVHY2RFOTF0YzJuMlppRVBTNWZ6WVJ6L1k4cmJ5K1NqbzJkWE9RCmRHYWRlUFplbi9UbmlHTFlqZWhrbXZNQjJvU0FDbVMycTd2OUNrcmdmR1RZbWJzeGVjSU1QK0JONG9KS3BOZ28KOUpnRWJ5SUxkR1pZS2pQb2lLaHNjMVhmSy8zZStXSmxuYjJBaEE5Y1JMUzhMcDdtcEYySWp4SjNSNE93QTg3WQpNeGZvZWFGdnNuVUFHWUdFWFo4Z3BkWmhQMEoxNWRGdERjajIrcngrQVFLQmdRRDFoSE9nVGdFbERrVEc5bm5TCjE1eXYxRzUxYnJMQU1UaWpzNklEMU1qelhzck0xY2ZvazVaaUlxNVJsQ3dReTlYNDdtV1RhY0lZRGR4TGJEcXEKY0IydjR5Wm1YK1VleGJ3cDU1OWY0V05HdzF5YzQrQjdaNFF5aTRFelN4WmFjbldjMnBzcHJMUFVoOUFXRXVNcApOaW1vcXNiVGNnNGs5QWRxeUIrbWhIWmJRUUtCZ1FEQUNzU09qNXZMU1VtaVpxYWcrOVMySUxZOVNOdDZzS1VyCkprcjdCZEVpN3N2YmU5cldRR2RBb0xkQXNzcU94aENydmtPNkpSSHB1YjlRRjlYdlF4Riszc2ZpZm4yYkQ0ZloKMlVsclA1emF3RlNrNDNLbjdMZzRscURpaVUxVGlqTkJBL3dUcFlmbTB4dW5WeFRWNDZpNVViQW1XRk12TWV0bQozWUZYQmJkK2RRS0JnRGl6Q1B6cFpzeEcrazAwbUxlL2dYajl4ekNwaXZCbHJaM29teTdsVWk4YUloMmg5VlBaCjJhMzZNbVcyb1dLVG9HdW5xcCtibWU1eUxRRGlFcjVQdkJ0bGl2V3ppYmRNbFFMY2Nlcnpveml4WDA4QU5WUnEKZUpZdnIzdklDSGFFM25LRjdiVjNJK1NlSk1ra1BYL0QrV1R4WTQ5clZLYm1FRnh4c1JXRW04ekJBb0dBWEZ3UgpZanJoQTZqUW1DRmtYQ0loa0NJMVkwNEorSHpDUXZsY3NGT0EzSnNhUWduVUdwekl5OFUvdlFiLzhpQ0IzZ2RZCmpVck16YXErdnVkbnhYVnRFYVpWWGJIVitPQkVSdHFBdStyUkprZS9yYm1SNS84cUxsVUxOVWd4ZjA4RkRXeTgKTERxOUhKOUZPbnJnRTJvMU9FTjRRMGpSWU81U041dXFXODd0REEwQ2dZQXpXbk1KSFgrbmlyMjhRRXFyVnJKRAo4ZUEwOHIwWTJRMDhMRlcvMjNIVWQ4WU12VnhTUTdwcUwzaE41RXVJQ2dCbEpGVFI3TndBREo3eDY2M002akFMCm1DNlI4dWxSZStwa08xN2Y0UUs3MnVRanJGZEhESnlXQmdDL0RKSkV6d1dwY0Q4VVNPK3A5bVVIbllLTUJTOEsKTVB1ejYrZ3h0VEtsRU5pZUVacXhxZz09Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K
username: testuser
password: testpassword
token: sha256~fFyEqjf1xxFMO0tbEyGRvWeNOd7QByuEgS4hyEq_A9o
""" # NOQA
def get_kubeconfig_with_paths(self) -> str:
"""
This function returns a test kubeconfig file as a string.
:return: a test kubeconfig file in string format (for unit testing purposes)
""" # NOQA
return """apiVersion: v1
clusters:
- cluster:
certificate-authority: fixtures/ca.crt
server: https://127.0.0.1:6443
name: default
contexts:
- context:
cluster: default
namespace: default
user: testuser
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: testuser
user:
client-certificate: fixtures/client.crt
client-key: fixtures/client.key
username: testuser
password: testpassword
token: sha256~fFyEqjf1xxFMO0tbEyGRvWeNOd7QByuEgS4hyEq_A9o
""" # NOQA
def test_current_context(self):
cwd = os.getcwd()
current_context_data = ContextAuth()
current_context_data.fetch_auth_data(self.get_kubeconfig_with_data())
self.assertIsNotNone(current_context_data.clusterCertificateData)
self.assertIsNotNone(current_context_data.clientCertificateData)
self.assertIsNotNone(current_context_data.clientKeyData)
self.assertIsNotNone(current_context_data.username)
self.assertIsNotNone(current_context_data.password)
self.assertIsNotNone(current_context_data.bearerToken)
self.assertIsNotNone(current_context_data.clusterHost)
current_context_no_data = ContextAuth()
current_context_no_data.fetch_auth_data(self.get_kubeconfig_with_paths())
self.assertIsNotNone(current_context_no_data.clusterCertificate)
self.assertIsNotNone(current_context_no_data.clusterCertificateData)
self.assertIsNotNone(current_context_no_data.clientCertificate)
self.assertIsNotNone(current_context_no_data.clientCertificateData)
self.assertIsNotNone(current_context_no_data.clientKey)
self.assertIsNotNone(current_context_no_data.clientKeyData)
self.assertIsNotNone(current_context_no_data.username)
self.assertIsNotNone(current_context_no_data.password)
self.assertIsNotNone(current_context_no_data.bearerToken)
self.assertIsNotNone(current_context_data.clusterHost)

View File

@@ -96,7 +96,10 @@ def alerts(distribution, prometheus_url, prometheus_bearer_token, start_time, en
)
try:
logging.info("Running kube-burner to capture the metrics: %s" % command)
subprocess.run(command, shell=True, universal_newlines=True)
output = subprocess.run(command, shell=True, universal_newlines=True)
if output.returncode != 0:
logging.error("command exited with a non-zero rc, please check the logs for errors or critical alerts")
sys.exit(output.returncode)
except Exception as e:
logging.error("Failed to run kube-burner, error: %s" % (e))
sys.exit(1)

View File

@@ -25,13 +25,13 @@ def initialize_clients(kubeconfig_path):
global custom_object_client
try:
config.load_kube_config(kubeconfig_path)
cli = client.CoreV1Api()
batch_cli = client.BatchV1Api()
watch_resource = watch.Watch()
api_client = client.ApiClient()
custom_object_client = client.CustomObjectsApi()
k8s_client = config.new_client_from_config()
k8s_client = config.new_client_from_config(config_file=kubeconfig_path)
cli = client.CoreV1Api(k8s_client)
batch_cli = client.BatchV1Api(k8s_client)
custom_object_client = client.CustomObjectsApi(k8s_client)
dyn_client = DynamicClient(k8s_client)
watch_resource = watch.Watch()
except ApiException as e:
logging.error("Failed to initialize kubernetes client: %s\n" % e)
sys.exit(1)
@@ -175,6 +175,26 @@ def list_killable_nodes(label_selector=None):
return nodes
# List managedclusters attached to the hub that can be killed
def list_killable_managedclusters(label_selector=None):
managedclusters = []
try:
ret = custom_object_client.list_cluster_custom_object(
group="cluster.open-cluster-management.io",
version="v1",
plural="managedclusters",
label_selector=label_selector
)
except ApiException as e:
logging.error("Exception when calling CustomObjectsApi->list_cluster_custom_object: %s\n" % e)
raise e
for managedcluster in ret['items']:
conditions = managedcluster['status']['conditions']
available = list(filter(lambda condition: condition['reason'] == 'ManagedClusterAvailable', conditions))
if available and available[0]['status'] == 'True':
managedclusters.append(managedcluster['metadata']['name'])
return managedclusters
# List pods in the given namespace
def list_pods(namespace, label_selector=None):
pods = []
@@ -362,6 +382,33 @@ def create_job(body, namespace="default"):
raise
def create_manifestwork(body, namespace):
try:
api_response = custom_object_client.create_namespaced_custom_object(
group="work.open-cluster-management.io",
version="v1",
plural="manifestworks",
body=body,
namespace=namespace
)
return api_response
except ApiException as e:
print("Exception when calling CustomObjectsApi->create_namespaced_custom_object: %s\n" % e)
def delete_manifestwork(namespace):
try:
api_response = custom_object_client.delete_namespaced_custom_object(
group="work.open-cluster-management.io",
version="v1",
plural="manifestworks",
name="managedcluster-scenarios-template",
namespace=namespace
)
return api_response
except ApiException as e:
print("Exception when calling CustomObjectsApi->delete_namespaced_custom_object: %s\n" % e)
def get_job_status(name, namespace="default"):
try:
return batch_cli.read_namespaced_job_status(
@@ -814,6 +861,30 @@ def watch_node_status(node, status, timeout, resource_version):
watch_resource.stop()
# Watch for a specific managedcluster status
# TODO: Implement this with a watcher instead of polling
def watch_managedcluster_status(managedcluster, status, timeout):
elapsed_time = 0
while True:
conditions = custom_object_client.get_cluster_custom_object_status(
"cluster.open-cluster-management.io", "v1", "managedclusters", managedcluster
)['status']['conditions']
available = list(filter(lambda condition: condition['reason'] == 'ManagedClusterAvailable', conditions))
if status == "True":
if available and available[0]['status'] == "True":
logging.info("Status of managedcluster " + managedcluster + ": Available")
return True
else:
if not available:
logging.info("Status of managedcluster " + managedcluster + ": Unavailable")
return True
time.sleep(2)
elapsed_time += 2
if elapsed_time >= timeout:
logging.info("Timeout waiting for managedcluster " + managedcluster + " to become: " + status)
return False
# Get the resource version for the specified node
def get_node_resource_version(node):
return cli.read_node(name=node).metadata.resource_version

View File

@@ -0,0 +1,34 @@
import random
import logging
import kraken.kubernetes.client as kubecli
# Pick a random managedcluster with specified label selector
def get_managedcluster(managedcluster_name, label_selector, instance_kill_count):
if managedcluster_name in kubecli.list_killable_managedclusters():
return [managedcluster_name]
elif managedcluster_name:
logging.info("managedcluster with provided managedcluster_name does not exist or the managedcluster might " "be in unavailable state.")
managedclusters = kubecli.list_killable_managedclusters(label_selector)
if not managedclusters:
raise Exception("Available managedclusters with the provided label selector do not exist")
logging.info("Available managedclusters with the label selector %s: %s" % (label_selector, managedclusters))
number_of_managedclusters = len(managedclusters)
if instance_kill_count == number_of_managedclusters:
return managedclusters
managedclusters_to_return = []
for i in range(instance_kill_count):
managedcluster_to_add = managedclusters[random.randint(0, len(managedclusters) - 1)]
managedclusters_to_return.append(managedcluster_to_add)
managedclusters.remove(managedcluster_to_add)
return managedclusters_to_return
# Wait until the managedcluster status becomes Available
def wait_for_available_status(managedcluster, timeout):
kubecli.watch_managedcluster_status(managedcluster, "True", timeout)
# Wait until the managedcluster status becomes Not Available
def wait_for_unavailable_status(managedcluster, timeout):
kubecli.watch_managedcluster_status(managedcluster, "Unknown", timeout)

View File

@@ -0,0 +1,140 @@
from jinja2 import Environment, FileSystemLoader
import os
import time
import logging
import sys
import yaml
import html
import kraken.kubernetes.client as kubecli
import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
class GENERAL:
def __init__(self):
pass
class managedcluster_scenarios():
def __init__(self):
self.general = GENERAL()
# managedcluster scenario to start the managedcluster
def managedcluster_start_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting managedcluster_start_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3 &
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 1 -n open-cluster-management-agent""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("managedcluster_start_scenario has been successfully injected!")
logging.info("Waiting for the specified timeout: %s" % timeout)
common_managedcluster_functions.wait_for_available_status(managedcluster, timeout)
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop the managedcluster
def managedcluster_stop_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting managedcluster_stop_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)),encoding='utf-8')
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0 &&
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 0 -n open-cluster-management-agent""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("managedcluster_stop_scenario has been successfully injected!")
logging.info("Waiting for the specified timeout: %s" % timeout)
common_managedcluster_functions.wait_for_unavailable_status(managedcluster, timeout)
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop and then start the managedcluster
def managedcluster_stop_start_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("Starting managedcluster_stop_start_scenario injection")
self.managedcluster_stop_scenario(instance_kill_count, managedcluster, timeout)
time.sleep(10)
self.managedcluster_start_scenario(instance_kill_count, managedcluster, timeout)
logging.info("managedcluster_stop_start_scenario has been successfully injected!")
# managedcluster scenario to terminate the managedcluster
def managedcluster_termination_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster termination is not implemented, " "no action is going to be taken")
# managedcluster scenario to reboot the managedcluster
def managedcluster_reboot_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster reboot is not implemented," " no action is going to be taken")
# managedcluster scenario to start the klusterlet
def start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting start_klusterlet_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("start_klusterlet_scenario has been successfully injected!")
time.sleep(30) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop the klusterlet
def stop_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting stop_klusterlet_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("stop_klusterlet_scenario has been successfully injected!")
time.sleep(30) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop and start the klusterlet
def stop_start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("Starting stop_start_klusterlet_scenario injection")
self.stop_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
time.sleep(10)
self.start_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
logging.info("stop_start_klusterlet_scenario has been successfully injected!")
# managedcluster scenario to crash the managedcluster
def managedcluster_crash_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster crash scenario is not implemented, " "no action is going to be taken")

View File

@@ -0,0 +1,68 @@
apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
namespace: {{managedcluster_name}}
name: managedcluster-scenarios-template
spec:
workload:
manifests:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: scale-deploy
namespace: open-cluster-management
rules:
- apiGroups: ["apps"]
resources: ["deployments/scale"]
verbs: ["patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get"]
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scale-deploy-to-sa
namespace: open-cluster-management
subjects:
- kind: ServiceAccount
name: internal-kubectl
namespace: open-cluster-management
roleRef:
kind: ClusterRole
name: scale-deploy
apiGroup: rbac.authorization.k8s.io
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scale-deploy-to-sa
namespace: open-cluster-management-agent
subjects:
- kind: ServiceAccount
name: internal-kubectl
namespace: open-cluster-management
roleRef:
kind: ClusterRole
name: scale-deploy
apiGroup: rbac.authorization.k8s.io
- apiVersion: v1
kind: ServiceAccount
metadata:
name: internal-kubectl
namespace: open-cluster-management
- apiVersion: batch/v1
kind: Job
metadata:
name: managedcluster-scenarios-template
namespace: open-cluster-management
spec:
template:
spec:
serviceAccountName: internal-kubectl
containers:
- name: kubectl
image: quay.io/sighup/kubectl-kustomize:1.21.6_3.9.1
command: ["/bin/sh", "-c"]
args:
- {{args}}
restartPolicy: Never
backoffLimit: 0

View File

@@ -0,0 +1,66 @@
import yaml
import logging
import time
from kraken.managedcluster_scenarios.managedcluster_scenarios import managedcluster_scenarios
import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
import kraken.cerberus.setup as cerberus
# Get the managedcluster scenarios object of specfied cloud type
def get_managedcluster_scenario_object(managedcluster_scenario):
return managedcluster_scenarios()
# Run defined scenarios
def run(scenarios_list, config, wait_duration):
for managedcluster_scenario_config in scenarios_list:
with open(managedcluster_scenario_config, "r") as f:
managedcluster_scenario_config = yaml.full_load(f)
for managedcluster_scenario in managedcluster_scenario_config["managedcluster_scenarios"]:
managedcluster_scenario_object = get_managedcluster_scenario_object(managedcluster_scenario)
if managedcluster_scenario["actions"]:
for action in managedcluster_scenario["actions"]:
start_time = int(time.time())
inject_managedcluster_scenario(action, managedcluster_scenario, managedcluster_scenario_object)
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.get_status(config, start_time, end_time)
logging.info("")
# Inject the specified managedcluster scenario
def inject_managedcluster_scenario(action, managedcluster_scenario, managedcluster_scenario_object):
# Get the managedcluster scenario configurations
run_kill_count = managedcluster_scenario.get("runs", 1)
instance_kill_count = managedcluster_scenario.get("instance_count", 1)
managedcluster_name = managedcluster_scenario.get("managedcluster_name", "")
label_selector = managedcluster_scenario.get("label_selector", "")
timeout = managedcluster_scenario.get("timeout", 120)
# Get the managedcluster to apply the scenario
if managedcluster_name:
managedcluster_name_list = managedcluster_name.split(",")
else:
managedcluster_name_list = [managedcluster_name]
for single_managedcluster_name in managedcluster_name_list:
managedclusters = common_managedcluster_functions.get_managedcluster(single_managedcluster_name, label_selector, instance_kill_count)
for single_managedcluster in managedclusters:
if action == "managedcluster_start_scenario":
managedcluster_scenario_object.managedcluster_start_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_stop_scenario":
managedcluster_scenario_object.managedcluster_stop_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_stop_start_scenario":
managedcluster_scenario_object.managedcluster_stop_start_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_termination_scenario":
managedcluster_scenario_object.managedcluster_termination_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_reboot_scenario":
managedcluster_scenario_object.managedcluster_reboot_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "stop_start_klusterlet_scenario":
managedcluster_scenario_object.stop_start_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "start_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "stop_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_crash_scenario":
managedcluster_scenario_object.managedcluster_crash_scenario(run_kill_count, single_managedcluster, timeout)
else:
logging.info("There is no managedcluster action that matches %s, skipping scenario" % action)

View File

@@ -34,7 +34,7 @@ def run(scenarios_list, config, wait_duration):
for single_node_name in node_name_list:
nodelst.extend(common_node_functions.get_node(single_node_name, test_node_label, test_instance_count))
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader)
env = Environment(loader=file_loader, autoescape=True)
pod_template = env.get_template("pod.j2")
test_interface = verify_interface(test_interface, nodelst, pod_template)
joblst = []

View File

@@ -17,7 +17,7 @@ class Azure:
credentials = DefaultAzureCredential()
logging.info("credential " + str(credentials))
az_account = runcommand.invoke("az account list -o yaml")
az_account_yaml = yaml.load(az_account, Loader=yaml.FullLoader)
az_account_yaml = yaml.safe_load(az_account, Loader=yaml.FullLoader)
subscription_id = az_account_yaml[0]["id"]
self.compute_client = ComputeManagementClient(credentials, subscription_id)

View File

@@ -73,8 +73,9 @@ def check_service_status(node, service, ssh_private_key, timeout):
)
if connection is None:
break
except Exception:
pass
except Exception as e:
logging.error("Failed to ssh to instance: %s within the timeout duration of %s: %s" % (node, timeout, e))
for service_name in service:
logging.info("Checking status of Service: %s" % (service_name))
stdin, stdout, stderr = ssh.exec_command(

View File

@@ -0,0 +1,109 @@
import kraken.node_actions.common_node_functions as nodeaction
from kraken.node_actions.abstract_node_scenarios import abstract_node_scenarios
import logging
import sys
import docker
class Docker:
def __init__(self):
self.client = docker.from_env()
def get_container_id(self, node_name):
container = self.client.containers.get(node_name)
return container.id
# Start the node instance
def start_instances(self, node_name):
container = self.client.containers.get(node_name)
container.start()
# Stop the node instance
def stop_instances(self, node_name):
container = self.client.containers.get(node_name)
container.stop()
# Reboot the node instance
def reboot_instances(self, node_name):
container = self.client.containers.get(node_name)
container.restart()
# Terminate the node instance
def terminate_instances(self, node_name):
container = self.client.containers.get(node_name)
container.stop()
container.remove()
class docker_node_scenarios(abstract_node_scenarios):
def __init__(self):
self.docker = Docker()
# Node scenario to start the node
def node_start_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_start_scenario injection")
container_id = self.docker.get_container_id(node)
logging.info("Starting the node %s with container ID: %s " % (node, container_id))
self.docker.start_instances(node)
nodeaction.wait_for_ready_status(node, timeout)
logging.info("Node with container ID: %s is in running state" % (container_id))
logging.info("node_start_scenario has been successfully injected!")
except Exception as e:
logging.error(
"Failed to start node instance. Encountered following " "exception: %s. Test Failed" % (e)
)
logging.error("node_start_scenario injection failed!")
sys.exit(1)
# Node scenario to stop the node
def node_stop_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_stop_scenario injection")
container_id = self.docker.get_container_id(node)
logging.info("Stopping the node %s with container ID: %s " % (node, container_id))
self.docker.stop_instances(node)
logging.info("Node with container ID: %s is in stopped state" % (container_id))
nodeaction.wait_for_unknown_status(node, timeout)
except Exception as e:
logging.error("Failed to stop node instance. Encountered following exception: %s. " "Test Failed" % (e))
logging.error("node_stop_scenario injection failed!")
sys.exit(1)
# Node scenario to terminate the node
def node_termination_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_termination_scenario injection")
container_id = self.docker.get_container_id(node)
logging.info("Terminating the node %s with container ID: %s " % (node, container_id))
self.docker.terminate_instances(node)
logging.info("Node with container ID: %s has been terminated" % (container_id))
logging.info("node_termination_scenario has been successfuly injected!")
except Exception as e:
logging.error(
"Failed to terminate node instance. Encountered following exception:" " %s. Test Failed" % (e)
)
logging.error("node_termination_scenario injection failed!")
sys.exit(1)
# Node scenario to reboot the node
def node_reboot_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_reboot_scenario injection")
container_id = self.docker.get_container_id(node)
logging.info("Rebooting the node %s with container ID: %s " % (node, container_id))
self.docker.reboot_instances(node)
nodeaction.wait_for_unknown_status(node, timeout)
nodeaction.wait_for_ready_status(node, timeout)
logging.info("Node with container ID: %s has been rebooted" % (container_id))
logging.info("node_reboot_scenario has been successfuly injected!")
except Exception as e:
logging.error(
"Failed to reboot node instance. Encountered following exception:" " %s. Test Failed" % (e)
)
logging.error("node_reboot_scenario injection failed!")
sys.exit(1)

View File

@@ -9,6 +9,7 @@ from kraken.node_actions.gcp_node_scenarios import gcp_node_scenarios
from kraken.node_actions.openstack_node_scenarios import openstack_node_scenarios
from kraken.node_actions.alibaba_node_scenarios import alibaba_node_scenarios
from kraken.node_actions.bm_node_scenarios import bm_node_scenarios
from kraken.node_actions.docker_node_scenarios import docker_node_scenarios
import kraken.node_actions.common_node_functions as common_node_functions
import kraken.cerberus.setup as cerberus
@@ -36,6 +37,8 @@ def get_node_scenario_object(node_scenario):
return bm_node_scenarios(
node_scenario.get("bmc_info"), node_scenario.get("bmc_user", None), node_scenario.get("bmc_password", None)
)
elif node_scenario["cloud_type"] == "docker":
return docker_node_scenarios()
else:
logging.error(
"Cloud type " + node_scenario["cloud_type"] + " is not currently supported; "

View File

@@ -7,19 +7,19 @@ import sys
# Installs a mutable grafana on the Kubernetes/OpenShift cluster and loads the performance dashboards
def setup(repo, distribution):
if distribution == "kubernetes":
command = "cd /tmp/performance-dashboards/dittybopper && ./k8s-deploy.sh"
command = "cd performance-dashboards/dittybopper && ./k8s-deploy.sh"
elif distribution == "openshift":
command = "cd /tmp/performance-dashboards/dittybopper && ./deploy.sh"
command = "cd performance-dashboards/dittybopper && ./deploy.sh"
else:
logging.error("Provided distribution: %s is not supported" % (distribution))
sys.exit(1)
delete_repo = "rm -rf /tmp/performance-dashboards || exit 0"
delete_repo = "rm -rf performance-dashboards || exit 0"
logging.info("Cloning, installing mutable grafana on the cluster and loading the dashboards")
try:
# delete repo to clone the latest copy if exists
subprocess.run(delete_repo, shell=True, universal_newlines=True, timeout=45)
# clone the repo
git.Repo.clone_from(repo, "/tmp/performance-dashboards")
git.Repo.clone_from(repo, "performance-dashboards")
# deploy performance dashboards
subprocess.run(command, shell=True, universal_newlines=True)
except Exception as e:

View File

@@ -3,12 +3,15 @@ import json
import logging
from os.path import abspath
from typing import List, Dict
import time
from arcaflow_plugin_sdk import schema, serialization, jsonschema
import kraken.plugins.vmware.vmware_plugin as vmware_plugin
from kraken.plugins.pod_plugin import kill_pods, wait_for_pods
from arcaflow_plugin_kill_pod import kill_pods, wait_for_pods
import kraken.plugins.node_scenarios.vmware_plugin as vmware_plugin
import kraken.plugins.node_scenarios.ibmcloud_plugin as ibmcloud_plugin
from kraken.plugins.run_python_plugin import run_python_file
from kraken.plugins.network.ingress_shaping import network_chaos
from kraken.plugins.pod_network_outage.pod_network_outage_plugin import pod_outage
@dataclasses.dataclass
@@ -39,7 +42,7 @@ class Plugins:
)
self.steps_by_id[step.schema.id] = step
def run(self, file: str, kubeconfig_path: str):
def run(self, file: str, kubeconfig_path: str, kraken_config: str):
"""
Run executes a series of steps
"""
@@ -86,6 +89,8 @@ class Plugins:
unserialized_input = step.schema.input.unserialize(entry["config"])
if "kubeconfig_path" in step.schema.input.properties:
unserialized_input.kubeconfig_path = kubeconfig_path
if "kraken_config" in step.schema.input.properties:
unserialized_input.kraken_config = kraken_config
output_id, output_data = step.schema(unserialized_input)
logging.info(step.render_output(output_id, output_data) + "\n")
if output_id in step.error_output_ids:
@@ -180,7 +185,31 @@ PLUGINS = Plugins(
]
),
PluginStep(
network_chaos,
ibmcloud_plugin.node_start,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_stop,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_reboot,
[
"error"
]
),
PluginStep(
ibmcloud_plugin.node_terminate,
[
"error"
]
),
PluginStep(
pod_outage,
[
"error"
]
@@ -189,12 +218,16 @@ PLUGINS = Plugins(
)
def run(scenarios: List[str], kubeconfig_path: str, failed_post_scenarios: List[str]) -> List[str]:
def run(scenarios: List[str], kubeconfig_path: str, kraken_config: str, failed_post_scenarios: List[str], wait_duration: int) -> List[str]:
for scenario in scenarios:
logging.info('scenario '+ str(scenario))
try:
PLUGINS.run(scenario, kubeconfig_path)
PLUGINS.run(scenario, kubeconfig_path, kraken_config)
except Exception as e:
failed_post_scenarios.append(scenario)
logging.error("Error while running {}: {}".format(scenario, e))
return failed_post_scenarios
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
return failed_post_scenarios

View File

@@ -0,0 +1,563 @@
#!/usr/bin/env python
import sys
import time
import typing
from os import environ
from dataclasses import dataclass, field
import random
from traceback import format_exc
import logging
from kraken.plugins.node_scenarios import kubernetes_functions as kube_helper
from arcaflow_plugin_sdk import validation, plugin
from kubernetes import client, watch
from ibm_vpc import VpcV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_cloud_sdk_core import ApiException
import requests
import sys
class IbmCloud:
def __init__(self):
"""
Initialize the ibm cloud client by using the the env variables:
'IBMC_APIKEY' 'IBMC_URL'
"""
apiKey = environ.get("IBMC_APIKEY")
service_url = environ.get("IBMC_URL")
if not apiKey:
raise Exception(
"Environmental variable 'IBMC_APIKEY' is not set"
)
if not service_url:
raise Exception(
"Environmental variable 'IBMC_URL' is not set"
)
try:
authenticator = IAMAuthenticator(apiKey)
self.service = VpcV1(authenticator=authenticator)
self.service.set_service_url(service_url)
except Exception as e:
logging.error("error authenticating" + str(e))
sys.exit(1)
def delete_instance(self, instance_id):
"""
Deletes the Instance whose name is given by 'instance_id'
"""
try:
self.service.delete_instance(instance_id)
logging.info("Deleted Instance -- '{}'".format(instance_id))
except Exception as e:
logging.info(
"Instance '{}' could not be deleted. ".format(
instance_id
)
)
return False
def reboot_instances(self, instance_id):
"""
Reboots the Instance whose name is given by 'instance_id'. Returns True if successful, or
returns False if the Instance is not powered on
"""
try:
self.service.create_instance_action(
instance_id,
type='reboot',
)
logging.info("Reset Instance -- '{}'".format(instance_id))
return True
except Exception as e:
logging.info(
"Instance '{}' could not be rebooted".format(
instance_id
)
)
return False
def stop_instances(self, instance_id):
"""
Stops the Instance whose name is given by 'instance_id'. Returns True if successful, or
returns False if the Instance is already stopped
"""
try:
self.service.create_instance_action(
instance_id,
type='stop',
)
logging.info("Stopped Instance -- '{}'".format(instance_id))
return True
except Exception as e:
logging.info(
"Instance '{}' could not be stopped".format(instance_id)
)
logging.info("error" + str(e))
return False
def start_instances(self, instance_id):
"""
Stops the Instance whose name is given by 'instance_id'. Returns True if successful, or
returns False if the Instance is already running
"""
try:
self.service.create_instance_action(
instance_id,
type='start',
)
logging.info("Started Instance -- '{}'".format(instance_id))
return True
except Exception as e:
logging.info("Instance '{}' could not start running".format(instance_id))
return False
def list_instances(self):
"""
Returns a list of Instances present in the datacenter
"""
instance_names = []
try:
instances_result = self.service.list_instances().get_result()
instances_list = instances_result['instances']
for vpc in instances_list:
instance_names.append({"vpc_name": vpc['name'], "vpc_id": vpc['id']})
starting_count = instances_result['total_count']
while instances_result['total_count'] == instances_result['limit']:
instances_result = self.service.list_instances(start=starting_count).get_result()
instances_list = instances_result['instances']
starting_count += instances_result['total_count']
for vpc in instances_list:
instance_names.append({"vpc_name": vpc.name, "vpc_id": vpc.id})
except Exception as e:
logging.error("Error listing out instances: " + str(e))
sys.exit(1)
return instance_names
def find_id_in_list(self, name, vpc_list):
for vpc in vpc_list:
if vpc['vpc_name'] == name:
return vpc['vpc_id']
def get_instance_status(self, instance_id):
"""
Returns the status of the Instance whose name is given by 'instance_id'
"""
try:
instance = self.service.get_instance(instance_id).get_result()
state = instance['status']
return state
except Exception as e:
logging.error(
"Failed to get node instance status %s. Encountered following "
"exception: %s." % (instance_id, e)
)
return None
def wait_until_deleted(self, instance_id, timeout):
"""
Waits until the instance is deleted or until the timeout. Returns True if
the instance is successfully deleted, else returns False
"""
time_counter = 0
vpc = self.get_instance_status(instance_id)
while vpc is not None:
vpc = self.get_instance_status(instance_id)
logging.info(
"Instance %s is still being deleted, sleeping for 5 seconds" % instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info(
"Instance %s is still not deleted in allotted time" % instance_id
)
return False
return True
def wait_until_running(self, instance_id, timeout):
"""
Waits until the Instance switches to running state or until the timeout.
Returns True if the Instance switches to running, else returns False
"""
time_counter = 0
status = self.get_instance_status(instance_id)
while status != "running":
status = self.get_instance_status(instance_id)
logging.info(
"Instance %s is still not running, sleeping for 5 seconds" % instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Instance %s is still not ready in allotted time" % instance_id)
return False
return True
def wait_until_stopped(self, instance_id, timeout):
"""
Waits until the Instance switches to stopped state or until the timeout.
Returns True if the Instance switches to stopped, else returns False
"""
time_counter = 0
status = self.get_instance_status(instance_id)
while status != "stopped":
status = self.get_instance_status(instance_id)
logging.info(
"Instance %s is still not stopped, sleeping for 5 seconds" % instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Instance %s is still not stopped in allotted time" % instance_id)
return False
return True
def wait_until_rebooted(self, instance_id, timeout):
"""
Waits until the Instance switches to restarting state and then running state or until the timeout.
Returns True if the Instance switches back to running, else returns False
"""
time_counter = 0
status = self.get_instance_status(instance_id)
while status == "starting":
status = self.get_instance_status(instance_id)
logging.info(
"Instance %s is still restarting, sleeping for 5 seconds" % instance_id
)
time.sleep(5)
time_counter += 5
if time_counter >= timeout:
logging.info("Instance %s is still restarting after allotted time" % instance_id)
return False
self.wait_until_running(instance_id, timeout)
return True
@dataclass
class Node:
name: str
@dataclass
class NodeScenarioSuccessOutput:
nodes: typing.Dict[int, Node] = field(
metadata={
"name": "Nodes started/stopped/terminated/rebooted",
"description": """Map between timestamps and the pods started/stopped/terminated/rebooted.
The timestamp is provided in nanoseconds""",
}
)
action: kube_helper.Actions = field(
metadata={
"name": "The action performed on the node",
"description": """The action performed or attempted to be performed on the node. Possible values
are : Start, Stop, Terminate, Reboot""",
}
)
@dataclass
class NodeScenarioErrorOutput:
error: str
action: kube_helper.Actions = field(
metadata={
"name": "The action performed on the node",
"description": """The action attempted to be performed on the node. Possible values are : Start
Stop, Terminate, Reboot""",
}
)
@dataclass
class NodeScenarioConfig:
name: typing.Annotated[
typing.Optional[str],
validation.required_if_not("label_selector"),
validation.required_if("skip_openshift_checks"),
] = field(
default=None,
metadata={
"name": "Name",
"description": "Name(s) for target nodes. Required if label_selector is not set.",
},
)
runs: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=1,
metadata={
"name": "Number of runs per node",
"description": "Number of times to inject each scenario under actions (will perform on same node each time)",
},
)
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name")
] = field(
default=None,
metadata={
"name": "Label selector",
"description": "Kubernetes label selector for the target nodes. Required if name is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details.",
},
)
timeout: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=180,
metadata={
"name": "Timeout",
"description": "Timeout to wait for the target pod(s) to be removed in seconds.",
},
)
instance_count: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=1,
metadata={
"name": "Instance Count",
"description": "Number of nodes to perform action/select that match the label selector.",
},
)
skip_openshift_checks: typing.Optional[bool] = field(
default=False,
metadata={
"name": "Skip Openshift Checks",
"description": "Skip checking the status of the openshift nodes.",
},
)
kubeconfig_path: typing.Optional[str] = field(
default=None,
metadata={
"name": "Kubeconfig path",
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for "
"details.",
},
)
@plugin.step(
id="ibmcloud-node-start",
name="Start the node",
description="Start the node(s) by starting the Ibmcloud Instance on which the node is configured",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_start(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
ibmcloud = IbmCloud()
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.START, core_v1)
node_name_id_list = ibmcloud.list_instances()
nodes_started = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info("Starting node_start_scenario injection")
logging.info("Starting the node %s " % (name))
instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
if instance_id:
vm_started = ibmcloud.start_instances(instance_id)
if vm_started:
ibmcloud.wait_until_running(instance_id, cfg.timeout)
if not cfg.skip_openshift_checks:
kube_helper.wait_for_ready_status(
name, cfg.timeout, watch_resource, core_v1
)
nodes_started[int(time.time_ns())] = Node(name=name)
logging.info("Node with instance ID: %s is in running state" % name)
logging.info("node_start_scenario has been successfully injected!")
else:
logging.error("Failed to find node that matched instances on ibm cloud in region")
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.START
)
except Exception as e:
logging.error("Failed to start node instance. Test Failed")
logging.error("node_start_scenario injection failed!")
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.START
)
return "success", NodeScenarioSuccessOutput(
nodes_started, kube_helper.Actions.START
)
@plugin.step(
id="ibmcloud-node-stop",
name="Stop the node",
description="Stop the node(s) by starting the Ibmcloud Instance on which the node is configured",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_stop(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
ibmcloud = IbmCloud()
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
logging.info('set up done')
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.STOP, core_v1)
logging.info("set node list" + str(node_list))
node_name_id_list = ibmcloud.list_instances()
logging.info('node names' + str(node_name_id_list))
nodes_stopped = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info("Starting node_stop_scenario injection")
logging.info("Stopping the node %s " % (name))
instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
if instance_id:
vm_stopped = ibmcloud.stop_instances(instance_id)
if vm_stopped:
ibmcloud.wait_until_stopped(instance_id, cfg.timeout)
if not cfg.skip_openshift_checks:
kube_helper.wait_for_ready_status(
name, cfg.timeout, watch_resource, core_v1
)
nodes_stopped[int(time.time_ns())] = Node(name=name)
logging.info("Node with instance ID: %s is in stopped state" % name)
logging.info("node_stop_scenario has been successfully injected!")
else:
logging.error("Failed to find node that matched instances on ibm cloud in region")
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.STOP
)
except Exception as e:
logging.error("Failed to stop node instance. Test Failed")
logging.error("node_stop_scenario injection failed!")
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.STOP
)
return "success", NodeScenarioSuccessOutput(
nodes_stopped, kube_helper.Actions.STOP
)
@plugin.step(
id="ibmcloud-node-reboot",
name="Reboot Ibmcloud Instance",
description="Reboot the node(s) by starting the Ibmcloud Instance on which the node is configured",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_reboot(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
ibmcloud = IbmCloud()
core_v1 = client.CoreV1Api(cli)
watch_resource = watch.Watch()
node_list = kube_helper.get_node_list(cfg, kube_helper.Actions.REBOOT, core_v1)
node_name_id_list = ibmcloud.list_instances()
nodes_rebooted = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info("Starting node_reboot_scenario injection")
logging.info("Rebooting the node %s " % (name))
instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
if instance_id:
ibmcloud.reboot_instances(instance_id)
ibmcloud.wait_until_rebooted(instance_id, cfg.timeout)
if not cfg.skip_openshift_checks:
kube_helper.wait_for_unknown_status(
name, cfg.timeout, watch_resource, core_v1
)
kube_helper.wait_for_ready_status(
name, cfg.timeout, watch_resource, core_v1
)
nodes_rebooted[int(time.time_ns())] = Node(name=name)
logging.info(
"Node with instance ID: %s has rebooted successfully" % name
)
logging.info("node_reboot_scenario has been successfully injected!")
else:
logging.error("Failed to find node that matched instances on ibm cloud in region")
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.REBOOT
)
except Exception as e:
logging.error("Failed to reboot node instance. Test Failed")
logging.error("node_reboot_scenario injection failed!")
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.REBOOT
)
return "success", NodeScenarioSuccessOutput(
nodes_rebooted, kube_helper.Actions.REBOOT
)
@plugin.step(
id="ibmcloud-node-terminate",
name="Reboot Ibmcloud Instance",
description="Wait for node to be deleted",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_terminate(
cfg: NodeScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NodeScenarioSuccessOutput, NodeScenarioErrorOutput]
]:
with kube_helper.setup_kubernetes(None) as cli:
ibmcloud = IbmCloud()
core_v1 = client.CoreV1Api(cli)
node_list = kube_helper.get_node_list(
cfg, kube_helper.Actions.TERMINATE, core_v1
)
node_name_id_list = ibmcloud.list_instances()
nodes_terminated = {}
for name in node_list:
try:
for _ in range(cfg.runs):
logging.info(
"Starting node_termination_scenario injection by first stopping the node"
)
instance_id = ibmcloud.find_id_in_list(name, node_name_id_list)
logging.info("Deleting the node with instance ID: %s " % (name))
if instance_id:
ibmcloud.delete_instance(instance_id)
ibmcloud.wait_until_released(name, cfg.timeout)
nodes_terminated[int(time.time_ns())] = Node(name=name)
logging.info("Node with instance ID: %s has been released" % name)
logging.info("node_terminate_scenario has been successfully injected!")
else:
logging.error("Failed to find instances that matched the node specifications on ibm cloud in the set region")
return "error", NodeScenarioErrorOutput(
"No matching vpc with node name " + name, kube_helper.Actions.TERMINATE
)
except Exception as e:
logging.error("Failed to terminate node instance. Test Failed")
logging.error("node_terminate_scenario injection failed!")
return "error", NodeScenarioErrorOutput(
format_exc(), kube_helper.Actions.TERMINATE
)
return "success", NodeScenarioSuccessOutput(
nodes_terminated, kube_helper.Actions.TERMINATE
)

View File

@@ -7,7 +7,6 @@ import typing
from dataclasses import dataclass, field
from os import environ
from traceback import format_exc
import requests
from arcaflow_plugin_sdk import plugin, validation
from com.vmware.vapi.std.errors_client import (AlreadyInDesiredState,
@@ -17,7 +16,7 @@ from com.vmware.vcenter_client import VM, ResourcePool
from kubernetes import client, watch
from vmware.vapi.vsphere.client import create_vsphere_client
from kraken.plugins.vmware import kubernetes_functions as kube_helper
from kraken.plugins.node_scenarios import kubernetes_functions as kube_helper
class vSphere:
@@ -534,7 +533,7 @@ class NodeScenarioConfig:
@plugin.step(
id="node_start_scenario",
id="vmware-node-start",
name="Start the node",
description="Start the node(s) by starting the VMware VM "
"on which the node is configured",
@@ -593,7 +592,7 @@ def node_start(
@plugin.step(
id="node_stop_scenario",
id="vmware-node-stop",
name="Stop the node",
description="Stop the node(s) by starting the VMware VM "
"on which the node is configured",
@@ -652,7 +651,7 @@ def node_stop(
@plugin.step(
id="node_reboot_scenario",
id="vmware-node-reboot",
name="Reboot VMware VM",
description="Reboot the node(s) by starting the VMware VM "
"on which the node is configured",
@@ -713,13 +712,10 @@ def node_reboot(
@plugin.step(
id="node_terminate_scenario",
id="vmware-node-terminate",
name="Reboot VMware VM",
description="Wait for the specified number of pods to be present",
outputs={
"success": NodeScenarioSuccessOutput,
"error": NodeScenarioErrorOutput
},
description="Wait for the node to be terminated",
outputs={"success": NodeScenarioSuccessOutput, "error": NodeScenarioErrorOutput},
)
def node_terminate(
cfg: NodeScenarioConfig,

View File

@@ -0,0 +1,157 @@
import logging
import requests
import sys
import json
def get_status(config, start_time, end_time):
"""
Function to get Cerberus status
Args:
config
- Kraken config dictionary
start_time
- The time when chaos is injected
end_time
- The time when chaos is removed
Returns:
Cerberus status
"""
cerberus_status = True
check_application_routes = False
application_routes_status = True
if config["cerberus"]["cerberus_enabled"]:
cerberus_url = config["cerberus"]["cerberus_url"]
check_application_routes = config["cerberus"]["check_applicaton_routes"]
if not cerberus_url:
logging.error(
"url where Cerberus publishes True/False signal is not provided.")
sys.exit(1)
cerberus_status = requests.get(cerberus_url, timeout=60).content
cerberus_status = True if cerberus_status == b"True" else False
# Fail if the application routes monitored by cerberus experience
# downtime during the chaos
if check_application_routes:
application_routes_status, unavailable_routes = application_status(
cerberus_url, start_time, end_time)
if not application_routes_status:
logging.error(
"Application routes: %s monitored by cerberus encountered downtime during the run, failing"
% unavailable_routes
)
else:
logging.info(
"Application routes being monitored didn't encounter any downtime during the run!")
if not cerberus_status:
logging.error(
"Received a no-go signal from Cerberus, looks like "
"the cluster is unhealthy. Please check the Cerberus "
"report for more details. Test failed."
)
if not application_routes_status or not cerberus_status:
sys.exit(1)
else:
logging.info(
"Received a go signal from Ceberus, the cluster is healthy. "
"Test passed.")
return cerberus_status
def publish_kraken_status(config, failed_post_scenarios, start_time, end_time):
"""
Function to publish Kraken status to Cerberus
Args:
config
- Kraken config dictionary
failed_post_scenarios
- String containing the failed post scenarios
start_time
- The time when chaos is injected
end_time
- The time when chaos is removed
"""
cerberus_status = get_status(config, start_time, end_time)
if not cerberus_status:
if failed_post_scenarios:
if config["kraken"]["exit_on_failure"]:
logging.info(
"Cerberus status is not healthy and post action scenarios " "are still failing, exiting kraken run"
)
sys.exit(1)
else:
logging.info(
"Cerberus status is not healthy and post action scenarios "
"are still failing")
else:
if failed_post_scenarios:
if config["kraken"]["exit_on_failure"]:
logging.info(
"Cerberus status is healthy but post action scenarios " "are still failing, exiting kraken run"
)
sys.exit(1)
else:
logging.info(
"Cerberus status is healthy but post action scenarios "
"are still failing")
def application_status(cerberus_url, start_time, end_time):
"""
Function to check application availability
Args:
cerberus_url
- url where Cerberus publishes True/False signal
start_time
- The time when chaos is injected
end_time
- The time when chaos is removed
Returns:
Application status and failed routes
"""
if not cerberus_url:
logging.error(
"url where Cerberus publishes True/False signal is not provided.")
sys.exit(1)
else:
duration = (end_time - start_time) / 60
url = cerberus_url + "/" + "history" + \
"?" + "loopback=" + str(duration)
logging.info(
"Scraping the metrics for the test duration from cerberus url: %s" %
url)
try:
failed_routes = []
status = True
metrics = requests.get(url, timeout=60).content
metrics_json = json.loads(metrics)
for entry in metrics_json["history"]["failures"]:
if entry["component"] == "route":
name = entry["name"]
failed_routes.append(name)
status = False
else:
continue
except Exception as e:
logging.error(
"Failed to scrape metrics from cerberus API at %s: %s" %
(url, e))
sys.exit(1)
return status, set(failed_routes)

View File

@@ -0,0 +1,25 @@
apiVersion: batch/v1
kind: Job
metadata:
name: chaos-{{jobname}}
spec:
template:
spec:
nodeName: {{nodename}}
hostNetwork: true
containers:
- name: networkchaos
image: docker.io/fedora/tools
command: ["chroot", "/host", "/bin/sh", "-c", "{{cmd}}"]
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
restartPolicy: Never
backoffLimit: 0

View File

@@ -0,0 +1,263 @@
from kubernetes import config, client
from kubernetes.client.rest import ApiException
from kubernetes.stream import stream
import sys
import time
import logging
import random
def setup_kubernetes(kubeconfig_path) -> client.ApiClient:
"""
Sets up the Kubernetes client
"""
if kubeconfig_path is None:
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
client_config = config.load_kube_config(kubeconfig_path)
return client.ApiClient(client_config)
def create_job(batch_cli, body, namespace="default"):
"""
Function used to create a job from a YAML config
"""
try:
api_response = batch_cli.create_namespaced_job(
body=body, namespace=namespace)
return api_response
except ApiException as api:
logging.warn(
"Exception when calling \
BatchV1Api->create_job: %s"
% api
)
if api.status == 409:
logging.warn("Job already present")
except Exception as e:
logging.error(
"Exception when calling \
BatchV1Api->create_namespaced_job: %s"
% e
)
raise
def delete_pod(cli, name, namespace):
"""
Function that deletes a pod and waits until deletion is complete
"""
try:
cli.delete_namespaced_pod(name=name, namespace=namespace)
while cli.read_namespaced_pod(name=name, namespace=namespace):
time.sleep(1)
except ApiException as e:
if e.status == 404:
logging.info("Pod deleted")
else:
logging.error("Failed to delete pod %s" % e)
raise e
def create_pod(cli, body, namespace, timeout=120):
"""
Function used to create a pod from a YAML config
"""
try:
pod_stat = None
pod_stat = cli.create_namespaced_pod(body=body, namespace=namespace)
end_time = time.time() + timeout
while True:
pod_stat = cli.read_namespaced_pod(
name=body["metadata"]["name"], namespace=namespace)
if pod_stat.status.phase == "Running":
break
if time.time() > end_time:
raise Exception("Starting pod failed")
time.sleep(1)
except Exception as e:
logging.error("Pod creation failed %s" % e)
if pod_stat:
logging.error(pod_stat.status.container_statuses)
delete_pod(cli, body["metadata"]["name"], namespace)
sys.exit(1)
def exec_cmd_in_pod(cli, command, pod_name, namespace, container=None):
"""
Function used to execute a command in a running pod
"""
exec_command = command
try:
if container:
ret = stream(
cli.connect_get_namespaced_pod_exec,
pod_name,
namespace,
container=container,
command=exec_command,
stderr=True,
stdin=False,
stdout=True,
tty=False,
)
else:
ret = stream(
cli.connect_get_namespaced_pod_exec,
pod_name,
namespace,
command=exec_command,
stderr=True,
stdin=False,
stdout=True,
tty=False,
)
except BaseException:
return False
return ret
def list_pods(cli, namespace, label_selector=None):
"""
Function used to list pods in a given namespace and having a certain label
"""
pods = []
try:
if label_selector:
ret = cli.list_namespaced_pod(
namespace, pretty=True, label_selector=label_selector)
else:
ret = cli.list_namespaced_pod(namespace, pretty=True)
except ApiException as e:
logging.error(
"Exception when calling \
CoreV1Api->list_namespaced_pod: %s\n"
% e
)
raise e
for pod in ret.items:
pods.append(pod.metadata.name)
return pods
def get_job_status(batch_cli, name, namespace="default"):
"""
Function that retrieves the status of a running job in a given namespace
"""
try:
return batch_cli.read_namespaced_job_status(
name=name, namespace=namespace)
except Exception as e:
logging.error(
"Exception when calling \
BatchV1Api->read_namespaced_job_status: %s"
% e
)
raise
def get_pod_log(cli, name, namespace="default"):
"""
Function that retrieves the logs of a running pod in a given namespace
"""
return cli.read_namespaced_pod_log(
name=name, namespace=namespace, _return_http_data_only=True, _preload_content=False
)
def read_pod(cli, name, namespace="default"):
"""
Function that retrieves the info of a running pod in a given namespace
"""
return cli.read_namespaced_pod(name=name, namespace=namespace)
def delete_job(batch_cli, name, namespace="default"):
"""
Deletes a job with the input name and namespace
"""
try:
api_response = batch_cli.delete_namespaced_job(
name=name,
namespace=namespace,
body=client.V1DeleteOptions(
propagation_policy="Foreground", grace_period_seconds=0),
)
logging.debug("Job deleted. status='%s'" % str(api_response.status))
return api_response
except ApiException as api:
logging.warn(
"Exception when calling \
BatchV1Api->create_namespaced_job: %s"
% api
)
logging.warn("Job already deleted\n")
except Exception as e:
logging.error(
"Exception when calling \
BatchV1Api->delete_namespaced_job: %s\n"
% e
)
sys.exit(1)
def list_ready_nodes(cli, label_selector=None):
"""
Returns a list of ready nodes
"""
nodes = []
try:
if label_selector:
ret = cli.list_node(pretty=True, label_selector=label_selector)
else:
ret = cli.list_node(pretty=True)
except ApiException as e:
logging.error("Exception when calling CoreV1Api->list_node: %s\n" % e)
raise e
for node in ret.items:
for cond in node.status.conditions:
if str(cond.type) == "Ready" and str(cond.status) == "True":
nodes.append(node.metadata.name)
return nodes
def get_node(node_name, label_selector, instance_kill_count, cli):
"""
Returns active node(s) on which the scenario can be performed
"""
if node_name in list_ready_nodes(cli):
return [node_name]
elif node_name:
logging.info(
"Node with provided node_name does not exist or the node might "
"be in NotReady state."
)
nodes = list_ready_nodes(cli, label_selector)
if not nodes:
raise Exception(
"Ready nodes with the provided label selector do not exist")
logging.info(
"Ready nodes with the label selector %s: %s" % (label_selector, nodes)
)
number_of_nodes = len(nodes)
if instance_kill_count == number_of_nodes:
return nodes
nodes_to_return = []
for i in range(instance_kill_count):
node_to_add = nodes[random.randint(0, len(nodes) - 1)]
nodes_to_return.append(node_to_add)
nodes.remove(node_to_add)
return nodes_to_return

View File

@@ -0,0 +1,30 @@
apiVersion: v1
kind: Pod
metadata:
name: modtools
spec:
nodeName: {{nodename}}
containers:
- name: modtools
image: docker.io/fedora/tools
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- "trap : TERM INT; sleep infinity & wait"
tty: true
stdin: true
stdinOnce: true
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
hostNetwork: true
hostIPC: true
hostPID: true
restartPolicy: Never

View File

@@ -0,0 +1,731 @@
#!/usr/bin/env python3
import sys
import os
import typing
import yaml
import logging
import time
import random
from dataclasses import dataclass, field
from traceback import format_exc
from jinja2 import Environment, FileSystemLoader
from arcaflow_plugin_sdk import plugin, validation
from kubernetes import client
from kubernetes.client.api.core_v1_api import CoreV1Api
from kubernetes.client.api.batch_v1_api import BatchV1Api
from kubernetes.client.api.apiextensions_v1_api import ApiextensionsV1Api
from kubernetes.client.api.custom_objects_api import CustomObjectsApi
from . import kubernetes_functions as kube_helper
from . import cerberus
def get_test_pods(
pod_name: str,
pod_label: str,
namespace: str,
cli: CoreV1Api
) -> typing.List[str]:
"""
Function that returns a list of pods to apply network policy
Args:
pod_name (string)
- pod on which network policy need to be applied
pod_label (string)
- pods matching the label on which network policy
need to be applied
namepsace (string)
- namespace in which the pod is present
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns:
pod names (string) in the namespace
"""
pods_list = []
pods_list = kube_helper.list_pods(
cli,
label_selector=pod_label,
namespace=namespace
)
if pod_name and pod_name not in pods_list:
raise Exception(
"pod name not found in namespace "
)
elif pod_name and pod_name in pods_list:
pods_list.clear()
pods_list.append(pod_name)
return pods_list
else:
return pods_list
def get_job_pods(cli: CoreV1Api, api_response):
"""
Function that gets the pod corresponding to the job
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
api_response
- The API response for the job status
Returns
Pod corresponding to the job
"""
controllerUid = api_response.metadata.labels["controller-uid"]
pod_label_selector = "controller-uid=" + controllerUid
pods_list = kube_helper.list_pods(
cli,
label_selector=pod_label_selector,
namespace="default"
)
return pods_list[0]
def delete_jobs(
cli: CoreV1Api,
batch_cli: BatchV1Api,
job_list: typing.List[str]
):
"""
Function that deletes jobs
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
batch_cli (BatchV1Api)
- Object to interact with Kubernetes Python client's BatchV1 API
job_list (List of strings)
- The list of jobs to delete
"""
for job_name in job_list:
try:
api_response = kube_helper.get_job_status(
batch_cli,
job_name,
namespace="default"
)
if api_response.status.failed is not None:
pod_name = get_job_pods(cli, api_response)
pod_stat = kube_helper.read_pod(
cli,
name=pod_name,
namespace="default"
)
logging.error(pod_stat.status.container_statuses)
pod_log_response = kube_helper.get_pod_log(
cli,
name=pod_name,
namespace="default"
)
pod_log = pod_log_response.data.decode("utf-8")
logging.error(pod_log)
except Exception as e:
logging.warn("Exception in getting job status: %s" % str(e))
api_response = kube_helper.delete_job(
batch_cli,
name=job_name,
namespace="default"
)
def wait_for_job(
batch_cli: BatchV1Api,
job_list: typing.List[str],
timeout: int = 300
) -> None:
"""
Function that waits for a list of jobs to finish within a time period
Args:
batch_cli (BatchV1Api)
- Object to interact with Kubernetes Python client's BatchV1 API
job_list (List of strings)
- The list of jobs to check for completion
timeout (int)
- Max duration to wait for checking whether the jobs are completed
"""
wait_time = time.time() + timeout
count = 0
job_len = len(job_list)
while count != job_len:
for job_name in job_list:
try:
api_response = kube_helper.get_job_status(
batch_cli,
job_name,
namespace="default"
)
if (
api_response.status.succeeded is not None or
api_response.status.failed is not None
):
count += 1
job_list.remove(job_name)
except Exception:
logging.warn("Exception in getting job status")
if time.time() > wait_time:
raise Exception(
"Jobs did not complete within "
"the {0}s timeout period".format(timeout)
)
time.sleep(5)
def get_bridge_name(
cli: ApiextensionsV1Api,
custom_obj: CustomObjectsApi
) -> str:
"""
Function that gets the pod corresponding to the job
Args:
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
api_response
- The API response for the job status
Returns
Pod corresponding to the job
"""
current_crds = [x['metadata']['name'].lower()
for x in cli.list_custom_resource_definition().to_dict()['items']]
if 'networks.config.openshift.io' not in current_crds:
raise Exception(
"OpenShiftSDN or OVNKubernetes not found in cluster "
)
else:
resource = custom_obj.get_cluster_custom_object(
group="config.openshift.io", version="v1", name="cluster", plural="networks")
network_type = resource["spec"]["networkType"]
if network_type == 'OpenShiftSDN':
bridge = 'br0'
elif network_type == 'OVNKubernetes':
bridge = 'br-int'
else:
raise Exception(
f'OpenShiftSDN or OVNKubernetes not found in cluster {network_type}'
)
return bridge
def apply_net_policy(
node_dict: typing.Dict[str, str],
ports: typing.List[str],
job_template,
pod_template,
direction: str,
duration: str,
bridge_name: str,
cli: CoreV1Api,
batch_cli: BatchV1Api
) -> typing.List[str]:
"""
Function that applies filters(ingress or egress) to block traffic.
Args:
node_dict (Dict)
- node to pod IP mapping
ports (List)
- List of ports to block
job_template (jinja2.environment.Template)
- The YAML template used to instantiate a job to apply and remove
the filters on the interfaces
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interface
direction (string)
- Duration for which the traffic control is to be done
bridge_name (string):
- bridge to which filter rules need to be applied
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
batch_cli (BatchV1Api)
- Object to interact with Kubernetes Python client's BatchV1Api API
Returns:
The name of the job created that executes the commands on a node
for ingress chaos scenario
"""
job_list = []
cookie = random.randint(100, 10000)
net_direction = {'egress': 'nw_src', 'ingress': 'nw_dst'}
br = 'br0'
table = 0
if bridge_name == 'br-int':
br = 'br-int'
table = 8
for node, ips in node_dict.items():
while len(check_cookie(node, pod_template, br, cookie, cli)) > 2:
cookie = random.randint(100, 10000)
exec_cmd = ''
for ip in ips:
for port in ports:
target_port = port
exec_cmd = f'{exec_cmd}ovs-ofctl -O OpenFlow13 add-flow {br} cookie={cookie},table={table},priority=65535,tcp,{net_direction[direction]}={ip},tp_dst={target_port},actions=drop;'
exec_cmd = f'{exec_cmd}ovs-ofctl -O OpenFlow13 add-flow {br} cookie={cookie},table={table},priority=65535,udp,{net_direction[direction]}={ip},tp_dst={target_port},actions=drop;'
if not ports:
exec_cmd = f'{exec_cmd}ovs-ofctl -O OpenFlow13 add-flow {br} cookie={cookie},table={table},priority=65535,ip,{net_direction[direction]}={ip},actions=drop;'
exec_cmd = f'{exec_cmd}sleep {duration};ovs-ofctl -O OpenFlow13 del-flows {br} cookie={cookie}/-1'
logging.info("Executing %s on node %s" % (exec_cmd, node))
job_body = yaml.safe_load(
job_template.render(
jobname=str(hash(node))[:5] + str(random.randint(0, 10000)),
nodename=node,
cmd=exec_cmd
)
)
api_response = kube_helper.create_job(batch_cli, job_body)
if api_response is None:
raise Exception("Error creating job")
job_list.append(job_body["metadata"]["name"])
return job_list
def list_bridges(
node: str,
pod_template,
cli: CoreV1Api
) -> typing.List[str]:
"""
Function that returns a list of bridges on the node
Args:
node (string)
- Node from which the list of bridges is to be returned
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interface
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns:
List of bridges on the node.
"""
pod_body = yaml.safe_load(pod_template.render(nodename=node))
logging.info("Creating pod to query bridge on node %s" % node)
kube_helper.create_pod(cli, pod_body, "default", 300)
try:
cmd = ["chroot", "/host", "ovs-vsctl", "list-br"]
output = kube_helper.exec_cmd_in_pod(cli, cmd, "modtools", "default")
if not output:
logging.error("Exception occurred while executing command in pod")
sys.exit(1)
bridges = output.split('\n')
finally:
logging.info("Deleting pod to query interface on node")
kube_helper.delete_pod(cli, "modtools", "default")
return bridges
def check_cookie(
node: str,
pod_template,
br_name,
cookie,
cli: CoreV1Api
) -> str:
"""
Function to check for matching flow rules
Args:
node (string):
- node in which to check for the flow rules
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interfaces
br_name (string):
- bridge against which the flows rules need to be checked
cookie (string):
- flows matching the cookie are listed
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns
Returns the matching flow rules
"""
pod_body = yaml.safe_load(pod_template.render(nodename=node))
logging.info("Creating pod to query duplicate rules on node %s" % node)
kube_helper.create_pod(cli, pod_body, "default", 300)
try:
cmd = ["chroot", "/host", "ovs-ofctl", "-O", "OpenFlow13",
"dump-flows", br_name, f'cookie={cookie}/-1']
output = kube_helper.exec_cmd_in_pod(cli, cmd, "modtools", "default")
if not output:
logging.error("Exception occurred while executing command in pod")
sys.exit(1)
flow_list = output.split('\n')
finally:
logging.info("Deleting pod to query interface on node")
kube_helper.delete_pod(cli, "modtools", "default")
return flow_list
def check_bridge_interface(
node_name: str,
pod_template,
bridge_name: str,
cli: CoreV1Api
) -> bool:
"""
Function is used to check if the required OVS or OVN bridge is found in
in the node.
Args:
node_name (string):
- node in which to check for the bridge interface
pod_template (jinja2.environment.Template)
- The YAML template used to instantiate a pod to query
the node's interfaces
bridge_name (string):
- bridge name to check for in the node.
cli (CoreV1Api)
- Object to interact with Kubernetes Python client's CoreV1 API
Returns:
Returns True if the bridge is found in the node.
"""
nodes = kube_helper.get_node(node_name, None, 1, cli)
node_bridge = []
for node in nodes:
node_bridge = list_bridges(
node,
pod_template,
cli
)
if bridge_name not in node_bridge:
raise Exception(
f'OVS bridge {bridge_name} not found on the node '
)
return True
@dataclass
class InputParams:
"""
This is the data structure for the input parameters of the step defined below.
"""
namespace: typing.Annotated[str, validation.min(1)] = field(
metadata={
"name": "Namespace",
"description":
"Namespace of the pod to which filter need to be applied"
"for details."
}
)
direction: typing.List[str] = field(
default_factory=lambda: ['ingress', 'egress'],
metadata={
"name": "Direction",
"description":
"List of directions to apply filters"
"Default both egress and ingress."
}
)
ingress_ports: typing.List[int] = field(
default_factory=list,
metadata={
"name": "Ingress ports",
"description":
"List of ports to block traffic on"
"Default [], i.e. all ports"
}
)
egress_ports: typing.List[int] = field(
default_factory=list,
metadata={
"name": "Egress ports",
"description":
"List of ports to block traffic on"
"Default [], i.e. all ports"
}
)
kubeconfig_path: typing.Optional[str] = field(
default=None,
metadata={
"name": "Kubeconfig path",
"description": "Kubeconfig file as string\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for "
"details.",
},
)
pod_name: typing.Annotated[
typing.Optional[str], validation.required_if_not("label_selector"),
] = field(
default=None,
metadata={
"name": "Pod name",
"description":
"When label_selector is not specified, pod matching the name will be"
"selected for the chaos scenario"
}
)
label_selector: typing.Annotated[
typing.Optional[str], validation.required_if_not("pod_name")
] = field(
default=None,
metadata={
"name": "Label selector",
"description":
"Kubernetes label selector for the target pod. "
"When pod_name is not specified, pod with matching label_selector is selected for chaos scenario"
}
)
kraken_config: typing.Optional[str] = field(
default=None,
metadata={
"name": "Kraken Config",
"description":
"Path to the config file of Kraken. "
"Set this field if you wish to publish status onto Cerberus"
}
)
test_duration: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=120,
metadata={
"name": "Test duration",
"description":
"Duration for which each step of the ingress chaos testing "
"is to be performed.",
},
)
wait_duration: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=300,
metadata={
"name": "Wait Duration",
"description":
"Wait duration for finishing a test and its cleanup."
"Ensure that it is significantly greater than wait_duration"
}
)
instance_count: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=1,
metadata={
"name": "Instance Count",
"description":
"Number of pods to perform action/select that match "
"the label selector.",
}
)
@dataclass
class PodOutageSuccessOutput:
"""
This is the output data structure for the success case.
"""
test_pods: typing.List[str] = field(
metadata={
"name": "Test pods",
"description":
"List of test pods where the selected for chaos scenario"
}
)
direction: typing.List[str] = field(
metadata={
"name": "Direction",
"description":
"List of directions to which the filters were applied."
}
)
ingress_ports: typing.List[int] = field(
metadata={
"name": "Ingress ports",
"description":
"List of ports to block traffic on"
}
)
egress_ports: typing.List[int] = field(
metadata={
"name": "Egress ports",
"description": "List of ports to block traffic on"
}
)
@dataclass
class PodOutageErrorOutput:
error: str = field(
metadata={
"name": "Error",
"description":
"Error message when there is a run-time error during "
"the execution of the scenario"
}
)
@plugin.step(
id="pod_network_outage",
name="Pod Outage",
description="Blocks ingress and egress network traffic at pod level",
outputs={"success": PodOutageSuccessOutput, "error": PodOutageErrorOutput},
)
def pod_outage(
params: InputParams,
) -> typing.Tuple[str, typing.Union[PodOutageSuccessOutput, PodOutageErrorOutput]]:
"""
Function that performs pod outage chaos scenario based
on the provided configuration
Args:
params (InputParams,)
- The object containing the configuration for the scenario
Returns
A 'success' or 'error' message along with their details
"""
direction = ['ingress', 'egress']
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader)
job_template = env.get_template("job.j2")
pod_module_template = env.get_template("pod_module.j2")
test_namespace = params.namespace
test_label_selector = params.label_selector
test_pod_name = params.pod_name
filter_dict = {}
job_list = []
publish = False
if params.kraken_config:
failed_post_scenarios = ""
try:
with open(params.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error(
"Error reading Kraken config from %s" % params.kraken_config
)
return "error", PodOutageErrorOutput(
format_exc()
)
publish = True
for i in params.direction:
filter_dict[i] = eval(f"params.{i}_ports")
try:
ip_set = set()
node_dict = {}
label_set = set()
api = kube_helper.setup_kubernetes(params.kubeconfig_path)
cli = client.CoreV1Api(api)
batch_cli = client.BatchV1Api(api)
api_ext = client.ApiextensionsV1Api(api)
custom_obj = client.CustomObjectsApi(api)
br_name = get_bridge_name(api_ext, custom_obj)
pods_list = get_test_pods(
test_pod_name, test_label_selector, test_namespace, cli)
while not len(pods_list) <= params.instance_count:
pods_list.pop(random.randint(0, len(pods_list) - 1))
for pod_name in pods_list:
pod_stat = kube_helper.read_pod(cli, pod_name, test_namespace)
ip_set.add(pod_stat.status.pod_ip)
node_dict.setdefault(pod_stat.spec.node_name, [])
node_dict[pod_stat.spec.node_name].append(pod_stat.status.pod_ip)
for key, value in pod_stat.metadata.labels.items():
label_set.add("%s=%s" % (key, value))
check_bridge_interface(list(node_dict.keys())[
0], pod_module_template, br_name, cli)
for direction, ports in filter_dict.items():
job_list.extend(apply_net_policy(node_dict, ports, job_template, pod_module_template,
direction, params.test_duration, br_name, cli, batch_cli))
start_time = int(time.time())
logging.info("Waiting for job to finish")
wait_for_job(batch_cli, job_list[:], params.wait_duration + 20)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
)
return "success", PodOutageSuccessOutput(
test_pods=pods_list,
direction=params.direction,
ingress_ports=params.ingress_ports,
egress_ports=params.egress_ports
)
except Exception as e:
logging.error("Pod network outage scenario exiting due to Exception - %s" % e)
return "error", PodOutageErrorOutput(
format_exc()
)
finally:
logging.info("Deleting jobs(if any)")
delete_jobs(cli, batch_cli, job_list[:])

View File

@@ -1,269 +0,0 @@
#!/usr/bin/env python
import re
import sys
import time
import typing
from dataclasses import dataclass, field
import random
from datetime import datetime
from traceback import format_exc
from kubernetes import config, client
from kubernetes.client import V1PodList, V1Pod, ApiException, V1DeleteOptions
from arcaflow_plugin_sdk import validation, plugin, schema
def setup_kubernetes(kubeconfig_path):
if kubeconfig_path is None:
kubeconfig_path = config.KUBE_CONFIG_DEFAULT_LOCATION
kubeconfig = config.kube_config.KubeConfigMerger(kubeconfig_path)
if kubeconfig.config is None:
raise Exception(
'Invalid kube-config file: %s. '
'No configuration found.' % kubeconfig_path
)
loader = config.kube_config.KubeConfigLoader(
config_dict=kubeconfig.config,
)
client_config = client.Configuration()
loader.load_and_set(client_config)
return client.ApiClient(configuration=client_config)
def _find_pods(core_v1, label_selector, name_pattern, namespace_pattern):
pods: typing.List[V1Pod] = []
_continue = None
finished = False
while not finished:
pod_response: V1PodList = core_v1.list_pod_for_all_namespaces(
watch=False,
label_selector=label_selector
)
for pod in pod_response.items:
pod: V1Pod
if (name_pattern is None or name_pattern.match(pod.metadata.name)) and \
namespace_pattern.match(pod.metadata.namespace):
pods.append(pod)
_continue = pod_response.metadata._continue
if _continue is None:
finished = True
return pods
@dataclass
class Pod:
namespace: str
name: str
@dataclass
class PodKillSuccessOutput:
pods: typing.Dict[int, Pod] = field(metadata={
"name": "Pods removed",
"description": "Map between timestamps and the pods removed. The timestamp is provided in nanoseconds."
})
@dataclass
class PodWaitSuccessOutput:
pods: typing.List[Pod] = field(metadata={
"name": "Pods",
"description": "List of pods that have been found to run."
})
@dataclass
class PodErrorOutput:
error: str
@dataclass
class KillPodConfig:
"""
This is a configuration structure specific to pod kill scenario. It describes which pod from which
namespace(s) to select for killing and how many pods to kill.
"""
namespace_pattern: re.Pattern = field(metadata={
"name": "Namespace pattern",
"description": "Regular expression for target pod namespaces."
})
name_pattern: typing.Annotated[
typing.Optional[re.Pattern],
validation.required_if_not("label_selector")
] = field(default=None, metadata={
"name": "Name pattern",
"description": "Regular expression for target pods. Required if label_selector is not set."
})
kill: typing.Annotated[int, validation.min(1)] = field(
default=1,
metadata={"name": "Number of pods to kill", "description": "How many pods should we attempt to kill?"}
)
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name_pattern")
] = field(default=None, metadata={
"name": "Label selector",
"description": "Kubernetes label selector for the target pods. Required if name_pattern is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for details."
})
kubeconfig_path: typing.Optional[str] = field(default=None, metadata={
"name": "Kubeconfig path",
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ for "
"details."
})
timeout: int = field(default=180, metadata={
"name": "Timeout",
"description": "Timeout to wait for the target pod(s) to be removed in seconds."
})
backoff: int = field(default=1, metadata={
"name": "Backoff",
"description": "How many seconds to wait between checks for the target pod status."
})
@plugin.step(
"kill-pods",
"Kill pods",
"Kill pods as specified by parameters",
{"success": PodKillSuccessOutput, "error": PodErrorOutput}
)
def kill_pods(cfg: KillPodConfig) -> typing.Tuple[str, typing.Union[PodKillSuccessOutput, PodErrorOutput]]:
try:
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
# region Select target pods
pods = _find_pods(core_v1, cfg.label_selector, cfg.name_pattern, cfg.namespace_pattern)
if len(pods) < cfg.kill:
return "error", PodErrorOutput(
"Not enough pods match the criteria, expected {} but found only {} pods".format(cfg.kill, len(pods))
)
random.shuffle(pods)
# endregion
# region Remove pods
killed_pods: typing.Dict[int, Pod] = {}
watch_pods: typing.List[Pod] = []
for i in range(cfg.kill):
pod = pods[i]
core_v1.delete_namespaced_pod(pod.metadata.name, pod.metadata.namespace, body=V1DeleteOptions(
grace_period_seconds=0,
))
p = Pod(
pod.metadata.namespace,
pod.metadata.name
)
killed_pods[int(time.time_ns())] = p
watch_pods.append(p)
# endregion
# region Wait for pods to be removed
start_time = time.time()
while len(watch_pods) > 0:
time.sleep(cfg.backoff)
new_watch_pods: typing.List[Pod] = []
for p in watch_pods:
try:
read_pod = core_v1.read_namespaced_pod(p.name, p.namespace)
new_watch_pods.append(p)
except ApiException as e:
if e.status != 404:
raise
watch_pods = new_watch_pods
current_time = time.time()
if current_time - start_time > cfg.timeout:
return "error", PodErrorOutput("Timeout while waiting for pods to be removed.")
return "success", PodKillSuccessOutput(killed_pods)
# endregion
except Exception:
return "error", PodErrorOutput(
format_exc()
)
@dataclass
class WaitForPodsConfig:
"""
WaitForPodsConfig is a configuration structure for wait-for-pod steps.
"""
namespace_pattern: re.Pattern
name_pattern: typing.Annotated[
typing.Optional[re.Pattern],
validation.required_if_not("label_selector")
] = None
label_selector: typing.Annotated[
typing.Optional[str],
validation.min(1),
validation.required_if_not("name_pattern")
] = None
count: typing.Annotated[int, validation.min(1)] = field(
default=1,
metadata={"name": "Pod count", "description": "Wait for at least this many pods to exist"}
)
timeout: typing.Annotated[int, validation.min(1)] = field(
default=180,
metadata={"name": "Timeout", "description": "How many seconds to wait for?"}
)
backoff: int = field(default=1, metadata={
"name": "Backoff",
"description": "How many seconds to wait between checks for the target pod status."
})
kubeconfig_path: typing.Optional[str] = None
@plugin.step(
"wait-for-pods",
"Wait for pods",
"Wait for the specified number of pods to be present",
{"success": PodWaitSuccessOutput, "error": PodErrorOutput}
)
def wait_for_pods(cfg: WaitForPodsConfig) -> typing.Tuple[str, typing.Union[PodWaitSuccessOutput, PodErrorOutput]]:
try:
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
timeout = False
start_time = datetime.now()
while not timeout:
pods = _find_pods(core_v1, cfg.label_selector, cfg.name_pattern, cfg.namespace_pattern)
if len(pods) >= cfg.count:
return "success", \
PodWaitSuccessOutput(list(map(lambda p: Pod(p.metadata.namespace, p.metadata.name), pods)))
time.sleep(cfg.backoff)
now_time = datetime.now()
time_diff = now_time - start_time
if time_diff.seconds > cfg.timeout:
return "error", PodErrorOutput(
"timeout while waiting for pods to come up"
)
except Exception:
return "error", PodErrorOutput(
format_exc()
)
if __name__ == "__main__":
sys.exit(plugin.run(plugin.build_schema(
kill_pods,
wait_for_pods,
)))

View File

@@ -1,7 +1,7 @@
import logging
from arcaflow_plugin_sdk import serialization
from kraken.plugins import pod_plugin
import arcaflow_plugin_kill_pod
import kraken.cerberus.setup as cerberus
import kraken.post_actions.actions as post_actions
@@ -26,8 +26,8 @@ def run(kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_dur
input = serialization.load_from_file(pod_scenario)
s = pod_plugin.get_schema()
input_data: pod_plugin.KillPodConfig = s.unserialize_input("pod", input)
s = arcaflow_plugin_kill_pod.get_schema()
input_data: arcaflow_plugin_kill_pod.KillPodConfig = s.unserialize_input("pod", input)
if kubeconfig_path is not None:
input_data.kubeconfig_path = kubeconfig_path
@@ -35,10 +35,10 @@ def run(kubeconfig_path, scenarios_list, config, failed_post_scenarios, wait_dur
output_id, output_data = s.call_step("pod", input_data)
if output_id == "error":
data: pod_plugin.PodErrorOutput = output_data
data: arcaflow_plugin_kill_pod.PodErrorOutput = output_data
logging.error("Failed to run pod scenario: {}".format(data.error))
else:
data: pod_plugin.PodSuccessOutput = output_data
data: arcaflow_plugin_kill_pod.PodSuccessOutput = output_data
for pod in data.pods:
print("Deleted pod {} in namespace {}\n".format(pod.pod_name, pod.pod_namespace))
except Exception as e:

View File

@@ -1,5 +1,37 @@
import urllib3
import logging
import prometheus_api_client
import sys
import kraken.invoke.command as runcommand
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Initialize the client
def initialize_prom_client(distribution, prometheus_url, prometheus_bearer_token):
global prom_cli
prometheus_url, prometheus_bearer_token = instance(distribution, prometheus_url, prometheus_bearer_token)
if prometheus_url and prometheus_bearer_token:
bearer = "Bearer " + prometheus_bearer_token
headers = {"Authorization": bearer}
try:
prom_cli = prometheus_api_client.PrometheusConnect(url=prometheus_url, headers=headers, disable_ssl=True)
except Exception as e:
logging.error("Not able to initialize the client %s" % e)
sys.exit(1)
else:
prom_cli = None
# Process custom prometheus query
def process_prom_query(query):
if prom_cli:
try:
return prom_cli.custom_query(query=query, params=None)
except Exception as e:
logging.error("Failed to get the metrics: %s" % e)
sys.exit(1)
else:
logging.info("Skipping the prometheus query as the prometheus client couldn't " "be initilized\n")
# Get prometheus details
def instance(distribution, prometheus_url, prometheus_bearer_token):
@@ -10,7 +42,8 @@ def instance(distribution, prometheus_url, prometheus_bearer_token):
prometheus_url = "https://" + url
if distribution == "openshift" and not prometheus_bearer_token:
prometheus_bearer_token = runcommand.invoke(
"oc -n openshift-monitoring sa get-token prometheus-k8s "
"|| oc create token -n openshift-monitoring prometheus-k8s"
"oc create token -n openshift-monitoring prometheus-k8s --duration=12h "
"|| oc -n openshift-monitoring sa get-token prometheus-k8s "
"|| oc sa new-token -n openshift-monitoring prometheus-k8s"
)
return prometheus_url, prometheus_bearer_token

View File

@@ -13,19 +13,27 @@ oauth2client>=4.1.3
python-openstackclient
gitpython
paramiko
setuptools==63.4.1
setuptools==65.5.1
openshift-client
python-ipmi
podman-compose
docker-compose
docker
jinja2==3.0.3
itsdangerous==2.0.1
werkzeug==2.0.3
werkzeug==2.2.3
lxml >= 4.3.0
pyVmomi >= 6.7
zope.interface==5.4.0
aliyun-python-sdk-core==2.13.36
aliyun-python-sdk-ecs==4.24.25
arcaflow-plugin-sdk
arcaflow-plugin-sdk>=0.9.0
wheel
service_identity
git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.0.0
git+https://github.com/redhat-chaos/arcaflow-plugin-kill-pod.git
arcaflow >= 0.4.1
prometheus_api_client
ibm_cloud_sdk_core
ibm_vpc
pytest

View File

@@ -16,12 +16,15 @@ import kraken.pod_scenarios.setup as pod_scenarios
import kraken.namespace_actions.common_namespace_functions as namespace_actions
import kraken.shut_down.common_shut_down_func as shut_down
import kraken.node_actions.run as nodeaction
import kraken.managedcluster_scenarios.run as managedcluster_scenarios
import kraken.kube_burner.client as kube_burner
import kraken.zone_outage.actions as zone_outages
import kraken.application_outage.actions as application_outage
import kraken.pvc.pvc_scenario as pvc_scenario
import kraken.network_chaos.actions as network_chaos
import kraken.arcaflow_plugin as arcaflow_plugin
import server as server
import kraken.prometheus.client as promcli
from kraken import plugins
KUBE_BURNER_URL = (
@@ -41,14 +44,16 @@ def main(cfg):
if os.path.isfile(cfg):
with open(cfg, "r") as f:
config = yaml.full_load(f)
global kubeconfig_path, wait_duration
global kubeconfig_path, wait_duration, kraken_config
distribution = config["kraken"].get("distribution", "openshift")
kubeconfig_path = os.path.expanduser(config["kraken"].get("kubeconfig_path", ""))
chaos_scenarios = config["kraken"].get("chaos_scenarios", [])
publish_running_status = config["kraken"].get(
"publish_kraken_status", False
kubeconfig_path = os.path.expanduser(
config["kraken"].get("kubeconfig_path", "")
)
port = config["kraken"].get("port", "8081")
kraken_config = cfg
chaos_scenarios = config["kraken"].get("chaos_scenarios", [])
publish_running_status = config["kraken"].get("publish_kraken_status", False)
port = config["kraken"].get("port")
signal_address = config["kraken"].get("signal_address")
run_signal = config["kraken"].get("signal_state", "RUN")
litmus_install = config["kraken"].get("litmus_install", True)
litmus_version = config["kraken"].get("litmus_version", "v1.9.1")
@@ -63,15 +68,12 @@ def main(cfg):
"deploy_dashboards", False
)
dashboard_repo = config["performance_monitoring"].get(
"repo",
"https://github.com/cloud-bulldozer/performance-dashboards.git"
)
capture_metrics = config["performance_monitoring"].get(
"capture_metrics", False
"repo", "https://github.com/cloud-bulldozer/performance-dashboards.git"
)
capture_metrics = config["performance_monitoring"].get("capture_metrics", False)
kube_burner_url = config["performance_monitoring"].get(
"kube_burner_binary_url",
KUBE_BURNER_URL.format(version=KUBE_BURNER_VERSION)
KUBE_BURNER_URL.format(version=KUBE_BURNER_VERSION),
)
config_path = config["performance_monitoring"].get(
"config_path", "config/kube_burner.yaml"
@@ -79,25 +81,19 @@ def main(cfg):
metrics_profile = config["performance_monitoring"].get(
"metrics_profile_path", "config/metrics-aggregated.yaml"
)
prometheus_url = config["performance_monitoring"].get(
"prometheus_url", ""
)
prometheus_url = config["performance_monitoring"].get("prometheus_url", "")
prometheus_bearer_token = config["performance_monitoring"].get(
"prometheus_bearer_token", ""
)
run_uuid = config["performance_monitoring"].get("uuid", "")
enable_alerts = config["performance_monitoring"].get(
"enable_alerts", False
)
alert_profile = config["performance_monitoring"].get(
"alert_profile", ""
)
enable_alerts = config["performance_monitoring"].get("enable_alerts", False)
alert_profile = config["performance_monitoring"].get("alert_profile", "")
check_critical_alerts = config["performance_monitoring"].get("check_critical_alerts", False)
# Initialize clients
if not os.path.isfile(kubeconfig_path):
logging.error(
"Cannot read the kubeconfig file at %s, please check" %
kubeconfig_path
"Cannot read the kubeconfig file at %s, please check" % kubeconfig_path
)
sys.exit(1)
logging.info("Initializing client to talk to the Kubernetes cluster")
@@ -109,11 +105,12 @@ def main(cfg):
# Set up kraken url to track signal
if not 0 <= int(port) <= 65535:
logging.info(
"Using port 8081 as %s isn't a valid port number" % (port)
)
port = 8081
address = ("0.0.0.0", port)
logging.error("%s isn't a valid port number, please check" % (port))
sys.exit(1)
if not signal_address:
logging.error("Please set the signal address in the config")
sys.exit(1)
address = (signal_address, port)
# If publish_running_status is False this should keep us going
# in our loop below
@@ -121,12 +118,11 @@ def main(cfg):
server_address = address[0]
port = address[1]
logging.info(
"Publishing kraken status at http://%s:%s" % (
server_address,
port
)
"Publishing kraken status at http://%s:%s" % (server_address, port)
)
logging.info(
"Publishing kraken status at http://%s:%s" % (server_address, port)
)
logging.info("Publishing kraken status at http://%s:%s" % (server_address, port))
server.start_server(address, run_signal)
# Cluster info
@@ -158,15 +154,13 @@ def main(cfg):
# Set the number of iterations to loop to infinity if daemon mode is
# enabled or else set it to the provided iterations count in the config
if daemon_mode:
logging.info(
"Daemon mode enabled, kraken will cause chaos forever\n"
)
logging.info("Daemon mode enabled, kraken will cause chaos forever\n")
logging.info("Ignoring the iterations set")
iterations = float("inf")
else:
logging.info(
"Daemon mode not enabled, will run through %s iterations\n" %
str(iterations)
"Daemon mode not enabled, will run through %s iterations\n"
% str(iterations)
)
iterations = int(iterations)
@@ -188,8 +182,7 @@ def main(cfg):
while publish_running_status and run_signal == "PAUSE":
logging.info(
"Pausing Kraken run, waiting for %s seconds"
" and will re-poll signal"
% str(wait_duration)
" and will re-poll signal" % str(wait_duration)
)
time.sleep(wait_duration)
run_signal = server.get_status(address)
@@ -207,30 +200,39 @@ def main(cfg):
"kill-pods configuration instead."
)
sys.exit(1)
elif scenario_type == "arcaflow_scenarios":
failed_post_scenarios = arcaflow_plugin.run(
scenarios_list, kubeconfig_path
)
elif scenario_type == "plugin_scenarios":
failed_post_scenarios = plugins.run(
scenarios_list,
kubeconfig_path,
failed_post_scenarios
kraken_config,
failed_post_scenarios,
wait_duration,
)
elif scenario_type == "container_scenarios":
logging.info("Running container scenarios")
failed_post_scenarios = \
pod_scenarios.container_run(
kubeconfig_path,
scenarios_list,
config,
failed_post_scenarios,
wait_duration
)
failed_post_scenarios = pod_scenarios.container_run(
kubeconfig_path,
scenarios_list,
config,
failed_post_scenarios,
wait_duration,
)
# Inject node chaos scenarios specified in the config
elif scenario_type == "node_scenarios":
logging.info("Running node scenarios")
nodeaction.run(
scenarios_list,
config,
wait_duration
nodeaction.run(scenarios_list, config, wait_duration)
# Inject managedcluster chaos scenarios specified in the config
elif scenario_type == "managedcluster_scenarios":
logging.info("Running managedcluster scenarios")
managedcluster_scenarios.run(
scenarios_list, config, wait_duration
)
# Inject time skew chaos scenarios specified
@@ -238,11 +240,7 @@ def main(cfg):
elif scenario_type == "time_scenarios":
if distribution == "openshift":
logging.info("Running time skew scenarios")
time_actions.run(
scenarios_list,
config,
wait_duration
)
time_actions.run(scenarios_list, config, wait_duration)
else:
logging.error(
"Litmus scenarios are currently "
@@ -258,24 +256,19 @@ def main(cfg):
if litmus_install:
# Remove Litmus resources
# before running the scenarios
common_litmus.delete_chaos(
litmus_namespace
)
common_litmus.delete_chaos(litmus_namespace)
common_litmus.delete_chaos_experiments(
litmus_namespace
)
if litmus_uninstall_before_run:
common_litmus.uninstall_litmus(
litmus_version,
litmus_namespace
litmus_version, litmus_namespace
)
common_litmus.install_litmus(
litmus_version,
litmus_namespace
litmus_version, litmus_namespace
)
common_litmus.deploy_all_experiments(
litmus_version,
litmus_namespace
litmus_version, litmus_namespace
)
litmus_installed = True
common_litmus.run(
@@ -294,11 +287,7 @@ def main(cfg):
# Inject cluster shutdown scenarios
elif scenario_type == "cluster_shut_down_scenarios":
shut_down.run(
scenarios_list,
config,
wait_duration
)
shut_down.run(scenarios_list, config, wait_duration)
# Inject namespace chaos scenarios
elif scenario_type == "namespace_scenarios":
@@ -308,25 +297,19 @@ def main(cfg):
config,
wait_duration,
failed_post_scenarios,
kubeconfig_path
kubeconfig_path,
)
# Inject zone failures
elif scenario_type == "zone_outages":
logging.info("Inject zone outages")
zone_outages.run(
scenarios_list,
config,
wait_duration
)
zone_outages.run(scenarios_list, config, wait_duration)
# Application outages
elif scenario_type == "application_outages":
logging.info("Injecting application outage")
application_outage.run(
scenarios_list,
config,
wait_duration
scenarios_list, config, wait_duration
)
# PVC scenarios
@@ -337,11 +320,21 @@ def main(cfg):
# Network scenarios
elif scenario_type == "network_chaos":
logging.info("Running Network Chaos")
network_chaos.run(
scenarios_list,
config,
wait_duration
)
network_chaos.run(scenarios_list, config, wait_duration)
# Check for critical alerts when enabled
if check_critical_alerts:
logging.info("Checking for critical alerts firing post choas")
promcli.initialize_prom_client(distribution, prometheus_url, prometheus_bearer_token)
query = r"""ALERTS{severity="critical"}"""
critical_alerts = promcli.process_prom_query(query)
critical_alerts_count = len(critical_alerts)
if critical_alerts_count > 0:
logging.error("Critical alerts are firing: %s", critical_alerts)
logging.error("Please check, exiting")
sys.exit(1)
else:
logging.info("No critical alerts are firing!!")
iteration += 1
logging.info("")
@@ -380,7 +373,7 @@ def main(cfg):
else:
logging.error("Alert profile is not defined")
sys.exit(1)
if litmus_uninstall and litmus_installed:
common_litmus.delete_chaos(litmus_namespace)
common_litmus.delete_chaos_experiments(litmus_namespace)
@@ -395,8 +388,7 @@ def main(cfg):
run_dir = os.getcwd() + "/kraken.report"
logging.info(
"Successfully finished running Kraken. UUID for the run: "
"%s. Report generated at %s. Exiting"
% (run_uuid, run_dir)
"%s. Report generated at %s. Exiting" % (run_uuid, run_dir)
)
else:
logging.error("Cannot find a config at %s, please check" % (cfg))
@@ -419,7 +411,7 @@ if __name__ == "__main__":
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[
logging.FileHandler("kraken.report", mode="w"),
logging.StreamHandler()
logging.StreamHandler(),
],
)
if options.cfg is None:

View File

@@ -0,0 +1,11 @@
---
deployer:
connection: {}
type: kubernetes
log:
level: debug
logged_outputs:
error:
level: error
success:
level: debug

View File

@@ -0,0 +1,14 @@
input_list:
- cpu_count: 1
cpu_load_percentage: 80
cpu_method: all
duration: 30s
node_selector: {}
# node selector example
# node_selector:
# kubernetes.io/hostname: master
kubeconfig: ""
namespace: default
# duplicate this section to run simultaneous stressors in the same run

View File

@@ -0,0 +1,93 @@
input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
kubeconfig:
display:
description: The complete kubeconfig file as a string
name: Kubeconfig file contents
type:
type_id: string
required: true
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
duration:
display:
name: duration the scenario expressed in seconds
description: stop stress test after T seconds. One can also specify the units of time in
seconds, minutes, hours, days or years with the suffix s, m, h, d or y
type:
type_id: string
required: true
cpu_count:
display:
description: Number of CPU cores to be used (0 means all)
name: number of CPUs
type:
type_id: integer
required: true
cpu_method:
display:
description: CPU stress method
name: fine grained control of which cpu stressors to use (ackermann, cfloat etc.)
type:
type_id: string
required: true
cpu_load_percentage:
display:
description: load CPU by percentage
name: CPU load
type:
type_id: integer
required: true
steps:
kubeconfig:
plugin: quay.io/arcalot/arcaflow-plugin-kubeconfig:latest
input:
kubeconfig: !expr $.input.kubeconfig
stressng:
plugin: quay.io/arcalot/arcaflow-plugin-stressng:latest
step: workload
input:
StressNGParams:
timeout: !expr $.input.duration
cleanup: "true"
items:
- stressor: cpu
cpu_count: !expr $.input.cpu_count
cpu_method: !expr $.input.cpu_method
cpu_load: !expr $.input.cpu_load_percentage
deploy:
type: kubernetes
connection: !expr $.steps.kubeconfig.outputs.success.connection
pod:
metadata:
namespace: !expr $.input.namespace
labels:
arcaflow: stressng
spec:
nodeSelector: !expr $.input.node_selector
pluginContainer:
imagePullPolicy: Always
outputs:
success:
stressng: !expr $.steps.stressng.outputs.success

View File

@@ -0,0 +1,76 @@
input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
input_list:
type:
type_id: list
items:
id: input_item
type_id: object
properties:
kubeconfig:
display:
description: The complete kubeconfig file as a string
name: Kubeconfig file contents
type:
type_id: string
required: true
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
duration:
display:
name: duration the scenario expressed in seconds
description: stop stress test after T seconds. One can also specify the units of time in
seconds, minutes, hours, days or years with the suffix s, m, h, d or y
type:
type_id: string
required: true
cpu_count:
display:
description: Number of CPU cores to be used (0 means all)
name: number of CPUs
type:
type_id: integer
required: true
cpu_method:
display:
description: CPU stress method
name: fine grained control of which cpu stressors to use (ackermann, cfloat etc.)
type:
type_id: string
required: true
cpu_load_percentage:
display:
description: load CPU by percentage
name: CPU load
type:
type_id: integer
required: true
steps:
workload_loop:
kind: foreach
items: !expr $.input.input_list
workflow: sub-workflow.yaml
parallelism: 1000
outputs:
success:
workloads: !expr $.steps.workload_loop.outputs.success.data

View File

@@ -0,0 +1,11 @@
---
deployer:
connection: {}
type: kubernetes
log:
level: debug
logged_outputs:
error:
level: error
success:
level: debug

View File

@@ -0,0 +1,19 @@
input_list:
- duration: 30s
io_block_size: 1m
io_workers: 1
io_write_bytes: 10m
target_pod_folder: /data
target_pod_volume:
hostPath:
path: /
name: node-volume
node_selector: { }
# node selector example
# node_selector:
# kubernetes.io/hostname: master
kubeconfig: ""
namespace: default
# duplicate this section to run simultaneous stressors in the same run

View File

@@ -0,0 +1,136 @@
input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
kubeconfig:
display:
description: The complete kubeconfig file as a string
name: Kubeconfig file contents
type:
type_id: string
required: true
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
duration:
display:
name: duration the scenario expressed in seconds
description: stop stress test after T seconds. One can also specify the units of time in
seconds, minutes, hours, days or years with the suffix s, m, h, d or y
type:
type_id: string
required: true
io_workers:
display:
description: number of workers
name: start N workers continually writing, reading and removing temporary files
type:
type_id: integer
required: true
io_block_size:
display:
description: single write size
name: specify size of each write in bytes. Size can be from 1 byte to 4MB.
type:
type_id: string
required: true
io_write_bytes:
display:
description: Total number of bytes written
name: write N bytes for each hdd process, the default is 1 GB. One can specify the size
as % of free space on the file system or in units of Bytes, KBytes, MBytes and
GBytes using the suffix b, k, m or g
type:
type_id: string
required: true
target_pod_folder:
display:
description: Target Folder
name: Folder in the pod where the test will be executed and the test files will be written
type:
type_id: string
required: true
target_pod_volume:
display:
name: kubernetes volume definition
description: the volume that will be attached to the pod. In order to stress
the node storage only hosPath mode is currently supported
type:
type_id: object
id: k8s_volume
properties:
name:
display:
description: name of the volume (must match the name in pod definition)
type:
type_id: string
required: true
hostPath:
display:
description: hostPath options expressed as string map (key-value)
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
required: true
steps:
kubeconfig:
plugin: quay.io/arcalot/arcaflow-plugin-kubeconfig:latest
input:
kubeconfig: !expr $.input.kubeconfig
stressng:
plugin: quay.io/arcalot/arcaflow-plugin-stressng:latest
step: workload
input:
StressNGParams:
timeout: !expr $.input.duration
cleanup: "true"
workdir: !expr $.input.target_pod_folder
items:
- stressor: hdd
hdd: !expr $.input.io_workers
hdd_bytes: !expr $.input.io_write_bytes
hdd_write_size: !expr $.input.io_block_size
deploy:
type: kubernetes
connection: !expr $.steps.kubeconfig.outputs.success.connection
pod:
metadata:
namespace: !expr $.input.namespace
labels:
arcaflow: stressng
spec:
nodeSelector: !expr $.input.node_selector
pluginContainer:
imagePullPolicy: Always
volumeMounts:
- mountPath: /data
name: node-volume
volumes:
- !expr $.input.target_pod_volume
outputs:
success:
stressng: !expr $.steps.stressng.outputs.success

View File

@@ -0,0 +1,113 @@
input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
input_list:
type:
type_id: list
items:
id: input_item
type_id: object
properties:
kubeconfig:
display:
description: The complete kubeconfig file as a string
name: Kubeconfig file contents
type:
type_id: string
required: true
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
duration:
display:
name: duration the scenario expressed in seconds
description: stop stress test after T seconds. One can also specify the units of time in
seconds, minutes, hours, days or years with the suffix s, m, h, d or y
type:
type_id: string
required: true
io_workers:
display:
description: number of workers
name: start N workers continually writing, reading and removing temporary files
type:
type_id: integer
required: true
io_block_size:
display:
description: single write size
name: specify size of each write in bytes. Size can be from 1 byte to 4MB.
type:
type_id: string
required: true
io_write_bytes:
display:
description: Total number of bytes written
name: write N bytes for each hdd process, the default is 1 GB. One can specify the size
as % of free space on the file system or in units of Bytes, KBytes, MBytes and
GBytes using the suffix b, k, m or g
type:
type_id: string
required: true
target_pod_folder:
display:
description: Target Folder
name: Folder in the pod where the test will be executed and the test files will be written
type:
type_id: string
required: true
target_pod_volume:
display:
name: kubernetes volume definition
description: the volume that will be attached to the pod. In order to stress
the node storage only hosPath mode is currently supported
type:
type_id: object
id: k8s_volume
properties:
name:
display:
description: name of the volume (must match the name in pod definition)
type:
type_id: string
required: true
hostPath:
display:
description: hostPath options expressed as string map (key-value)
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
required: true
steps:
workload_loop:
kind: foreach
items: !expr $.input.input_list
workflow: sub-workflow.yaml
parallelism: 1000
outputs:
success:
workloads: !expr $.steps.workload_loop.outputs.success.data

View File

@@ -0,0 +1,11 @@
---
deployer:
connection: {}
type: kubernetes
log:
level: debug
logged_outputs:
error:
level: error
success:
level: debug

View File

@@ -0,0 +1,13 @@
input_list:
- duration: 30s
vm_bytes: 10%
vm_workers: 2
node_selector: { }
# node selector example
# node_selector:
# kubernetes.io/hostname: master
kubeconfig: ""
namespace: default
# duplicate this section to run simultaneous stressors in the same run

View File

@@ -0,0 +1,85 @@
input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
kubeconfig:
display:
description: The complete kubeconfig file as a string
name: Kubeconfig file contents
type:
type_id: string
required: true
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
duration:
display:
name: duration the scenario expressed in seconds
description: stop stress test after T seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y
type:
type_id: string
required: true
vm_workers:
display:
description: Number of VM stressors to be run (0 means 1 stressor per CPU)
name: Number of VM stressors
type:
type_id: integer
required: true
vm_bytes:
display:
description: N bytes per vm process, the default is 256MB. The size can be expressed in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
name: Kubeconfig file contents
type:
type_id: string
required: true
steps:
kubeconfig:
plugin: quay.io/arcalot/arcaflow-plugin-kubeconfig:latest
input:
kubeconfig: !expr $.input.kubeconfig
stressng:
plugin: quay.io/arcalot/arcaflow-plugin-stressng:latest
step: workload
input:
StressNGParams:
timeout: !expr $.input.duration
cleanup: "true"
items:
- stressor: vm
vm: !expr $.input.vm_workers
vm_bytes: !expr $.input.vm_bytes
deploy:
type: kubernetes
connection: !expr $.steps.kubeconfig.outputs.success.connection
pod:
metadata:
namespace: !expr $.input.namespace
labels:
arcaflow: stressng
spec:
nodeSelector: !expr $.input.node_selector
pluginContainer:
imagePullPolicy: Always
outputs:
success:
stressng: !expr $.steps.stressng.outputs.success

View File

@@ -0,0 +1,72 @@
input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
input_list:
type:
type_id: list
items:
id: input_item
type_id: object
properties:
kubeconfig:
display:
description: The complete kubeconfig file as a string
name: Kubeconfig file contents
type:
type_id: string
required: true
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
duration:
display:
name: duration the scenario expressed in seconds
description: stop stress test after T seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y
type:
type_id: string
required: true
vm_workers:
display:
description: Number of VM stressors to be run (0 means 1 stressor per CPU)
name: Number of VM stressors
type:
type_id: integer
required: true
vm_bytes:
display:
description: N bytes per vm process, the default is 256MB. The size can be expressed in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
name: Kubeconfig file contents
type:
type_id: string
required: true
steps:
workload_loop:
kind: foreach
items: !expr $.input.input_list
workflow: sub-workflow.yaml
parallelism: 1000
outputs:
success:
workloads: !expr $.steps.workload_loop.outputs.success.data

View File

@@ -0,0 +1,16 @@
node_scenarios:
- actions: # node chaos scenarios to be injected
- node_stop_start_scenario
node_name: kind-worker # node on which scenario has to be injected; can set multiple names separated by comma
# label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
instance_count: 1 # Number of nodes to perform action/select that match the label selector
runs: 1 # number of times to inject each scenario under actions (will perform on same node each time)
timeout: 120 # duration to wait for completion of node scenario injection
cloud_type: docker # cloud type on which Kubernetes/OpenShift runs
- actions:
- node_reboot_scenario
node_name: kind-worker
# label_selector: node-role.kubernetes.io/infra
instance_count: 1
timeout: 120
cloud_type: docker

10
scenarios/kind/scheduler.yml Executable file
View File

@@ -0,0 +1,10 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^kube-system$
label_selector: component=kube-scheduler
- id: wait-for-pods
config:
namespace_pattern: ^kube-system$
label_selector: component=kube-scheduler
count: 3

View File

@@ -0,0 +1,17 @@
managedcluster_scenarios:
- actions: # ManagedCluster chaos scenarios to be injected
- managedcluster_stop_start_scenario
managedcluster_name: cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
instance_count: 1 # Number of managedcluster to perform action/select that match the label selector
runs: 1 # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
timeout: 420 # Duration to wait for completion of ManagedCluster scenario injection
# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
# (default leaseDurationSeconds = 60 sec)
- actions:
- stop_start_klusterlet_scenario
managedcluster_name: cluster1
# label_selector:
instance_count: 1
runs: 1
timeout: 60

View File

@@ -5,4 +5,4 @@ scenarios:
container_name: "etcd"
action: "kill 1"
count: 1
retry_wait: 60
expected_recovery_time: 60

View File

@@ -0,0 +1,9 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: <ibmcloud-node-terminate/ibmcloud-node-reboot/ibmcloud-node-stop/ibmcloud-node-start>
config:
name: ""
label_selector: "node-role.kubernetes.io/worker" # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
runs: 1 # Number of times to inject each scenario under actions (will perform on same node each time)
instance_count: 1 # Number of nodes to perform action/select that match the label selector
timeout: 30 # Duration to wait for completion of node scenario injection
skip_openshift_checks: False # Set to True if you don't want to wait for the status of the nodes to change on OpenShift before passing the scenario

View File

@@ -1,37 +1,10 @@
# yaml-language-server: $schema=../pod.schema.json
namespace_pattern: openshift-kube-apiserver
kill: 1
config:
runStrategy:
runs: 1
maxSecondsBetweenRuns: 30
minSecondsBetweenRuns: 1
scenarios:
- name: "delete openshift-kube-apiserver pods"
steps:
- podAction:
matches:
- labels:
namespace: "openshift-kube-apiserver"
selector: "app=openshift-kube-apiserver"
filters:
- randomSample:
size: 1
# The actions will be executed in the order specified
actions:
- kill:
probability: 1
force: true
- podAction:
matches:
- labels:
namespace: "openshift-kube-apiserver"
selector: "app=openshift-kube-apiserver"
retries:
retriesTimeout:
timeout: 180
actions:
- checkPodCount:
count: 3
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^openshift-kube-apiserver$
label_selector: app=openshift-kube-apiserver
- id: wait-for-pods
config:
namespace_pattern: ^openshift-kube-apiserver$
label_selector: app=openshift-kube-apiserver
count: 3

View File

@@ -0,0 +1,15 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: pod_network_outage
config:
namespace: <namespace> # Required - Namespace of the pod to which filter need to be applied
direction: # Optioinal - List of directions to apply filters
- <egress/ingress> # Default both egress and ingress
ingress_ports: # Optional - List of ports to block traffic on
- <port number> # Default [], i.e. all ports
egress_ports: # Optional - List of ports to block traffic on
- <port number> # Default [], i.e. all ports
pod_name: <pod name> # When label_selector is not specified, pod matching the name will be selected for the chaos scenario
label_selector: <label_selector> # When pod_name is not specified, pod with matching label_selector is selected for chaos scenario
instance_count: <number> # Number of nodes to perform action/select that match the label selector
wait_duration: <time_duration> # Default is 300. Ensure that it is at least about twice of test_duration
test_duration: <time_duration> # Default is 120

View File

@@ -20,7 +20,7 @@ def run(cmd):
# Get cluster operators and return yaml
def get_cluster_operators():
operators_status = run("kubectl get co -o yaml")
status_yaml = yaml.load(operators_status, Loader=yaml.FullLoader)
status_yaml = yaml.safe_load(operators_status, Loader=yaml.FullLoader)
return status_yaml

View File

@@ -1,5 +1,5 @@
# yaml-language-server: $schema=../plugin.schema.json
- id: <node_stop_scenario/node_start_scenario/node_reboot_scenario/node_terminate_scenario>
- id: <vmware-node-stop/vmware-node-start/vmware-node-reboot/vmware-node-terminate>
config:
name: <node_name> # Node on which scenario has to be injected; can set multiple names separated by comma
label_selector: <label_selector> # When node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection

File diff suppressed because it is too large Load Diff

View File

@@ -1,175 +0,0 @@
import random
import re
import string
import threading
import unittest
from arcaflow_plugin_sdk import plugin
from kubernetes.client import V1Pod, V1ObjectMeta, V1PodSpec, V1Container, ApiException
from kraken.plugins import pod_plugin
from kraken.plugins.pod_plugin import setup_kubernetes, KillPodConfig, PodKillSuccessOutput
from kubernetes import client
class KillPodTest(unittest.TestCase):
def test_serialization(self):
plugin.test_object_serialization(
pod_plugin.KillPodConfig(
namespace_pattern=re.compile(".*"),
name_pattern=re.compile(".*")
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodKillSuccessOutput(
pods={}
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodErrorOutput(
error="Hello world!"
),
self.fail,
)
def test_not_enough_pods(self):
name = ''.join(random.choices(string.ascii_lowercase, k=8))
output_id, output_data = pod_plugin.kill_pods(KillPodConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^unit-test-" + re.escape(name) + "$"),
))
if output_id != "error":
self.fail("Not enough pods did not result in an error.")
print(output_data.error)
def test_kill_pod(self):
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
pod = core_v1.create_namespaced_pod("default", V1Pod(
metadata=V1ObjectMeta(
generate_name="test-",
),
spec=V1PodSpec(
containers=[
V1Container(
name="test",
image="alpine",
tty=True,
)
]
),
))
def remove_test_pod():
try:
core_v1.delete_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
except ApiException as e:
if e.status != 404:
raise
self.addCleanup(remove_test_pod)
output_id, output_data = pod_plugin.kill_pods(KillPodConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^" + re.escape(pod.metadata.name) + "$"),
))
if output_id == "error":
self.fail(output_data.error)
self.assertIsInstance(output_data, PodKillSuccessOutput)
out: PodKillSuccessOutput = output_data
self.assertEqual(1, len(out.pods))
pod_list = list(out.pods.values())
self.assertEqual(pod.metadata.name, pod_list[0].name)
try:
core_v1.read_namespaced_pod(pod_list[0].name, pod_list[0].namespace)
self.fail("Killed pod is still present.")
except ApiException as e:
if e.status != 404:
self.fail("Incorrect API exception encountered: {}".format(e))
class WaitForPodTest(unittest.TestCase):
def test_serialization(self):
plugin.test_object_serialization(
pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile(".*"),
name_pattern=re.compile(".*")
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile(".*"),
label_selector="app=nginx"
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodWaitSuccessOutput(
pods=[]
),
self.fail,
)
plugin.test_object_serialization(
pod_plugin.PodErrorOutput(
error="Hello world!"
),
self.fail,
)
def test_timeout(self):
name = "watch-test-" + ''.join(random.choices(string.ascii_lowercase, k=8))
output_id, output_data = pod_plugin.wait_for_pods(pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^" + re.escape(name) + "$"),
timeout=1
))
self.assertEqual("error", output_id)
def test_watch(self):
with setup_kubernetes(None) as cli:
core_v1 = client.CoreV1Api(cli)
name = "watch-test-" + ''.join(random.choices(string.ascii_lowercase, k=8))
def create_test_pod():
core_v1.create_namespaced_pod("default", V1Pod(
metadata=V1ObjectMeta(
name=name,
),
spec=V1PodSpec(
containers=[
V1Container(
name="test",
image="alpine",
tty=True,
)
]
),
))
def remove_test_pod():
try:
core_v1.delete_namespaced_pod(name, "default")
except ApiException as e:
if e.status != 404:
raise
self.addCleanup(remove_test_pod)
t = threading.Timer(10, create_test_pod)
t.start()
output_id, output_data = pod_plugin.wait_for_pods(pod_plugin.WaitForPodsConfig(
namespace_pattern=re.compile("^default$"),
name_pattern=re.compile("^" + re.escape(name) + "$"),
timeout=60
))
self.assertEqual("success", output_id)
if __name__ == '__main__':
unittest.main()

View File

@@ -2,8 +2,8 @@ import unittest
import os
import logging
from arcaflow_plugin_sdk import plugin
from kraken.plugins.vmware.kubernetes_functions import Actions
from kraken.plugins.vmware import vmware_plugin
from kraken.plugins.node_scenarios.kubernetes_functions import Actions
from kraken.plugins.node_scenarios import vmware_plugin
class NodeScenariosTest(unittest.TestCase):