diff --git a/README.md b/README.md index bae3e446..44ff2e77 100644 --- a/README.md +++ b/README.md @@ -1,113 +1,43 @@ # Kraken -Chaos and resiliency testing tool for Kubernetes and OpenShift +Chaos and resiliency testing tool for Kubernetes and OpenShift. +Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to turbulent conditions. -Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to failures. ### Workflow ![Kraken workflow](media/kraken-workflow.png) -### Install the dependencies -``` -$ pip3 install -r requirements.txt -``` -### Usage +### Installation and usage +Instructions on how to setup, configure and run Kraken can be found at [Installation](docs/installation.md). -#### Config -Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at config/config.yaml. Kraken uses [powerfulseal](https://github.com/bloomberg/powerfulseal) tool for pod based scenarios, a sample config looks like: -``` -kraken: - kubeconfig_path: /root/.kube/config # Path to kubeconfig - scenarios: # List of policies/chaos scenarios to load - - scenarios/etcd.yml - - scenarios/openshift-kube-apiserver.yml - - scenarios/openshift-apiserver.yml - node_scenarios: # List of chaos node scenarios to load - - scenarios/node_scenarios_example.yml +### Config +Instructions on how to setup the config and the options supported can be found at [Config](docs/config.md). -tunings: - wait_duration: 60 # Duration to wait between each chaos scenario -``` - -#### Run -``` -$ python3 run_kraken.py --config -``` - -#### Run containerized version -Assuming that the latest docker ( 17.05 or greater with multi-build support ) is intalled on the host, run: -``` -$ docker pull quay.io/openshift-scale/kraken:latest -$ docker run --name=kraken --net=host -v :/root/.kube/config -v :/root/kraken/config/config.yaml -d quay.io/openshift-scale/kraken:latest -$ docker logs -f kraken -``` - -Similarly, podman can be used to achieve the same: -``` -$ podman pull quay.io/openshift-scale/kraken -$ podman run --name=kraken --net=host -v :/root/.kube/config:Z -v :/root/kraken/config/config.yaml:Z -d quay.io/openshift-scale/kraken:latest -$ podman logs -f kraken -``` - -If you want to build your own kraken image see [here](https://github.com/openshift-scale/kraken/tree/master/containers/build_own_image-README.md) - -#### Report -The report is generated in the run directory and it contains the information about each chaos scenario injection along with timestamps. - -#### Cerberus to help with cluster health checks -[Cerberus](https://github.com/openshift-scale/cerberus) can be used to monitor the cluster under test and the aggregated go/no-go signal generated by it can be consumed by Kraken to determine pass/fail. This is to make sure the Kubernetes/OpenShift environments are healthy on a cluster level instead of just the targeted components level. It is highly recommended to turn on the Cerberus health check feature avaliable in Kraken after installing and setting up Cerberus. To do that, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the config file. ### Kubernetes/OpenShift chaos scenarios supported -Kraken currently just supports pod and node based scenarios, we will be adding more soon. +Kraken supports pod and node based scenarios. -#### Node chaos scenarios -Following node chaos scenarios are supported: +- [Pod Scenarios](docs/pod_scenarios.md) -1. **node_start_scenario**: scenario to stop the node instance. -2. **node_stop_scenario**: scenario to stop the node instance. -3. **node_stop_start_scenario**: scenario to stop and then start the node instance. -4. **node_termination_scenario**: scenario to terminate the node instance. -5. **node_reboot_scenario**: scenario to reboot the node instance. -6. **stop_kubelet_scenario**: scenario to stop the kubelet of the node instance. -7. **stop_start_kubelet_scenario**: scenario to stop and start the kubelet of the node instance. -8. **node_crash_scenario**: scenario to crash the node instance. - -**NOTE**: If the node doesn't recover from the node_crash_scenario injection, reboot the node to get it back to Ready state. - -**NOTE**: node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported only on AWS as of now. - -**NOTE**: With AWS as the cloud type, make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) is installed. - -Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to [node_scenarios_example](https://github.com/openshift-scale/kraken/blob/master/scenarios/node_scenarios_example.yml) config file. +- [Node Scenarios](docs/node_scenarios.md) -``` -node_scenarios: - - actions: # node chaos scenarios to be injected - - node_stop_start_scenario - - stop_start_kubelet_scenario - - node_crash_scenario - node_name: # node on which scenario has to be injected - label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection - instance_kill_count: 1 # number of times to inject each scenario under actions - timeout: 120 # duration to wait for completion of node scenario injection - cloud_type: aws # cloud type on which Kubernetes/OpenShift runs - - actions: - - node_reboot_scenario - node_name: - label_selector: node-role.kubernetes.io/infra - instance_kill_count: 1 - timeout: 120 - cloud_type: aws -``` +### Kraken scenario pass/fail criteria and report +It's important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by: +- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks. +- Leveraging [Cerberus](https://github.com/openshift-scale/cerberus) to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail. It is highly recommended to turn on the Cerberus health check feature avaliable in Kraken. Instructions on installing and setting up Cerberus can be found [here](https://github.com/openshift-scale/cerberus#installation). Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. -#### Pod chaos scenarios -Following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today. Adding a new pod based scenario is as simple as adding a new config under scenarios directory and defining it in the config. -Component | Description | Working ------------------------- | ---------------------------------------------------------------------------------------------------| ------------------------- | -Etcd | Kills a single/multiple etcd replicas for the specified number of times in a loop | :heavy_check_mark: | -Kube ApiServer | Kills a single/multiple kube-apiserver replicas for the specified number of times in a loop | :heavy_check_mark: | -ApiServer | Kills a single/multiple apiserver replicas for the specified number of times in a loop | :heavy_check_mark: | -Prometheus | Kills a single/multiple prometheus replicas for the specified number of times in a loop | :heavy_check_mark: | +### Blogs and other useful resources +- https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes + + +### Contributions +We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github. + + +### Community +Key Members(slack_usernames): paigerube14, rook, mffiedler, mohit, dry923, rsevilla, ravi +* [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com) +* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com) diff --git a/config/config.yaml b/config/config.yaml index fee5aac1..1e1f58af 100644 --- a/config/config.yaml +++ b/config/config.yaml @@ -1,4 +1,3 @@ - kraken: kubeconfig_path: /root/.kube/config # Path to kubeconfig exit_on_failure: False # Exit when a post action scenario fails diff --git a/docs/config.md b/docs/config.md new file mode 100644 index 00000000..3b35a3cb --- /dev/null +++ b/docs/config.md @@ -0,0 +1,21 @@ +### Config +Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at config/config.yaml. A sample config looks like: + +``` +kraken: + kubeconfig_path: /root/.kube/config # Path to kubeconfig + scenarios: # List of policies/chaos scenarios to load + - scenarios/etcd.yml + - scenarios/openshift-kube-apiserver.yml + - scenarios/openshift-apiserver.yml + node_scenarios: # List of chaos node scenarios to load + - scenarios/node_scenarios_example.yml + +cerberus: + cerberus_enabled: False # Enable it when cerberus is previously installed + cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal + +tunings: + wait_duration: 60 # Duration to wait between each chaos scenario + iterations: 1 # Number of times to execute the scenarios + daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 00000000..90f67899 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,48 @@ +## Installation + +Following ways are supported to run Kraken: + +- Standalone python program through Git +- Containerized version using either Podman or Docker as the runtime +- Kubernetes or OpenShift deployment + +**NOTE**: It is recommended to run Kraken external to the cluster ( Standalone or Containerized ) hitting the Kubernetes/OpenShift API as running it internal to the cluster might be disruptive to itself and also might not report back the results if the chaos leads to cluster's API server instability. + +### Git + +#### Clone the repository +``` +$ git clone https://github.com/openshift-scale/kraken.git +$ cd kraken +``` + +#### Install the dependencies +``` +$ pip3 install -r requirements.txt +``` + +#### Run +``` +$ python3 run_kraken.py --config +``` + +### Run containerized version +Assuming that the latest docker ( 17.05 or greater with multi-build support ) is intalled on the host, run: +``` +$ docker pull quay.io/openshift-scale/kraken:latest +$ docker run --name=kraken --net=host -v :/root/.kube/config -v :/root/kraken/config/config.yaml -d quay.io/openshift-scale/kraken:latest +$ docker logs -f kraken +``` + +Similarly, podman can be used to achieve the same: +``` +$ podman pull quay.io/openshift-scale/kraken +$ podman run --name=kraken --net=host -v :/root/.kube/config:Z -v :/root/kraken/config/config.yaml:Z -d quay.io/openshift-scale/kraken:latest +$ podman logs -f kraken +``` + +If you want to build your own kraken image see [here](https://github.com/openshift-scale/kraken/tree/master/containers/build_own_image-README.md) + + +### Run Kraken as a Kubernetes deployment +Refer [Instructions](https://github.com/openshift-scale/kraken/blob/master/containers/README.md) on how to deploy and run Kraken as a Kubernetes/OpenShift deployment. diff --git a/docs/node_scenarios.md b/docs/node_scenarios.md new file mode 100644 index 00000000..3371fd5a --- /dev/null +++ b/docs/node_scenarios.md @@ -0,0 +1,43 @@ +### Node Scenarios + +Following node chaos scenarios are supported: + +1. **node_start_scenario**: scenario to stop the node instance. +2. **node_stop_scenario**: scenario to stop the node instance. +3. **node_stop_start_scenario**: scenario to stop and then start the node instance. +4. **node_termination_scenario**: scenario to terminate the node instance. +5. **node_reboot_scenario**: scenario to reboot the node instance. +6. **stop_kubelet_scenario**: scenario to stop the kubelet of the node instance. +7. **stop_start_kubelet_scenario**: scenario to stop and start the kubelet of the node instance. +8. **node_crash_scenario**: scenario to crash the node instance. + +**NOTE**: If the node doesn't recover from the node_crash_scenario injection, reboot the node to get it back to Ready state. + +**NOTE**: node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported only on AWS as of now. + +**NOTE**: AWS is the only cloud platform supported as of today but we are looking into adding more. Make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) is installed. + +**NOTE**: The `stop_start_kubelet_scenario` and `node_crash_scenario` scenarios are supported as they are independent of the cloud platform. + + +Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to [node_scenarios_example](https://github.com/openshift-scale/kraken/blob/master/scenarios/node_scenarios_example.yml) config file. + +``` +node_scenarios: + - actions: # node chaos scenarios to be injected + - node_stop_start_scenario + - stop_start_kubelet_scenario + - node_crash_scenario + node_name: # node on which scenario has to be injected + label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection + instance_kill_count: 1 # number of times to inject each scenario under actions + timeout: 120 # duration to wait for completion of node scenario injection + cloud_type: aws # cloud type on which Kubernetes/OpenShift runs + - actions: + - node_reboot_scenario + node_name: + label_selector: node-role.kubernetes.io/infra + instance_kill_count: 1 + timeout: 120 + cloud_type: aws +``` diff --git a/docs/pod_scenarios.md b/docs/pod_scenarios.md new file mode 100644 index 00000000..a7efffb4 --- /dev/null +++ b/docs/pod_scenarios.md @@ -0,0 +1,16 @@ +### Pod Scenarios +Kraken consumes [Powerfulseal](https://github.com/powerfulseal/powerfulseal) under the hood to run the pod scenarios. + + +#### Pod chaos scenarios +Following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today. Adding a new pod based scenario is as simple as adding a new config under scenarios directory and defining it in the config. + +Component | Description | Working +------------------------ | ---------------------------------------------------------------------------------------------------| ------------------------- | +Etcd | Kills a single/multiple etcd replicas for the specified number of times in a loop | :heavy_check_mark: | +Kube ApiServer | Kills a single/multiple kube-apiserver replicas for the specified number of times in a loop | :heavy_check_mark: | +ApiServer | Kills a single/multiple apiserver replicas for the specified number of times in a loop | :heavy_check_mark: | +Prometheus | Kills a single/multiple prometheus replicas for the specified number of times in a loop | :heavy_check_mark: | +OpenShift System Pods | kills random pods running in the OpenShift system namespaces | :heavy_check_mark: | + +**NOTE**: [Writing policies](https://powerfulseal.github.io/powerfulseal/policies) can be referred for more information on how to write new scenarios. diff --git a/media/kraken-workflow.png b/media/kraken-workflow.png index 72f52509..af43ea60 100644 Binary files a/media/kraken-workflow.png and b/media/kraken-workflow.png differ