Compare commits

...

20 Commits

Author SHA1 Message Date
Pablo Méndez Hernández
667798d588 Change API from 'Google API Client' to 'Google Cloud Python Client' (#723)
* Document how to use Google's credentials associated with a user acccount

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>

* Change API from 'Google API Client' to 'Google Cloud Python Client'

According to the 'Google API Client' GH page:

```
This library is considered complete and is in maintenance mode. This means
that we will address critical bugs and security issues but will not add any
new features.

This library is officially supported by Google. However, the maintainers of
this repository recommend using Cloud Client Libraries for Python, where
possible, for new code development.
```

So change the code accordingly to adapt it to 'Google Cloud Python Client'.

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>

---------

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
2024-12-12 22:34:45 -05:00
jtydlack
0c30d89a1b Add node_disk_detach_attach_scenario for aws under node scenarios
Resolves #678

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>

Add functions for aws detach disk scenario

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>

Add detach disk scenario in node scenario

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>

Add disk_deatch_attach_scenario in docs

Signed-off-by: jtydlack <139967002+jtydlack@users.noreply.github.com>
2024-12-10 09:21:05 -05:00
Paige Patton
2ba20fa483 adding code bock 2024-12-05 12:37:43 -05:00
Paige Patton
97035a765c adding get node name list changes
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-26 10:34:25 -05:00
Paige Patton
10ba53574e not equal to gcp
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-15 09:31:09 -07:00
Paige Patton
0ecba41082 adding multi label comment 2024-11-12 10:34:09 -07:00
Paige Patton
491f59d152 few small changes
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-12 10:34:09 -07:00
Tullio Sebastiani
2549c9a146 bump werkzeug to 3.0.6 to fix cve on krkn-hub baseimage 2024-11-12 09:42:50 -07:00
Henrick Goldwurm
949f1f09e0 Add support for user-provided default network ACL (#731)
* Add support for user-provided default network ACL

Signed-off-by: henrick <self@thehenrick.com>

* Add logs to notify user when their provided acl is used

Signed-off-by: henrick <self@thehenrick.com>

* Update docs to include optional default_acl_id parameter in zone_outage

Signed-off-by: henrick <self@thehenrick.com>

---------

Signed-off-by: henrick <self@thehenrick.com>
Co-authored-by: henrick <self@thehenrick.com>
2024-11-06 12:58:25 -05:00
Naga Ravi Chaitanya Elluri
959766254d Update status of the relevant work items under roadmap
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-11-04 08:36:11 -05:00
Paige Patton
0e68dedb12 adding ibm shut down scenario (#697)
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <auto@users.noreply.github.com>
Signed-off-by: Paige Patton <prubenda@redhat.com>
2024-11-01 15:16:07 -04:00
Tullio Sebastiani
34a676a795 block_size parameter for dd (#719)
removed log

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-28 11:45:33 -04:00
Naga Ravi Chaitanya Elluri
e5c5b35db3 Update kube-burner references to krkn
Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-10-28 11:03:52 -04:00
Pablo Méndez Hernández
93d2e60386 Fix typo in docs index
Replace "oraganization" with "organization" in table of contents.

Signed-off-by: Pablo Méndez Hernández <pablomh@redhat.com>
2024-10-24 15:10:55 -04:00
Naga Ravi Chaitanya Elluri
462c9ac67e Rename test suite name to chaos-krkn
This is needed for the TRT/component readiness integration to improve
dashboard readability and tie results back to chaos.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-10-21 14:38:37 -04:00
Tullio Sebastiani
04e44738d9 updated deprecated upload artfiact action (#717)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-11 17:03:24 +02:00
Tullio Sebastiani
f810cadad2 Fixes the Plugin scenario schema error (#718)
* reformatting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* schema refactoring

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

* plugin refactoring

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

---------

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-10 09:59:53 -04:00
Tullio Sebastiani
4b869bad83 added fallback on dd if fallocate is not in the $PATH (#716)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-10 11:15:03 +02:00
Matt Leader
a36b0c76b2 OCP Chaos Arcaflow Workflow (#699)
* add workflows

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* update readme

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* rm my kubeconfig path

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* add workflow details to readme

Signed-off-by: Matthew F Leader <mleader@redhat.com>

* mv arcaflow to utils

Signed-off-by: Matthew F Leader <mleader@redhat.com>

---------

Signed-off-by: Matthew F Leader <mleader@redhat.com>
2024-10-09 14:46:08 -04:00
Tullio Sebastiani
a17e16390c cluster events check removed from funtest (deprecated krkn-lib v4.0.0)
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-10-09 10:19:24 -04:00
34 changed files with 1440 additions and 739 deletions

View File

@@ -126,7 +126,7 @@ jobs:
cat ./CI/results.markdown >> $GITHUB_STEP_SUMMARY
echo >> $GITHUB_STEP_SUMMARY
- name: Upload CI logs
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: ci-logs
path: CI/out
@@ -140,13 +140,13 @@ jobs:
pip install html2text
html2text --ignore-images --ignore-links -b 0 htmlcov/index.html >> $GITHUB_STEP_SUMMARY
- name: Upload coverage data
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: coverage
path: htmlcov
if-no-files-found: error
- name: Upload json coverage
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: coverage.json
path: coverage.json
@@ -169,7 +169,7 @@ jobs:
path: krkn-lib-docs
ssh-key: ${{ secrets.KRKN_LIB_DOCS_PRIV_KEY }}
- name: Download json coverage
uses: actions/download-artifact@v4.1.7
uses: actions/download-artifact@v4
with:
name: coverage.json
- name: Set up Python

View File

@@ -26,7 +26,6 @@ function functional_test_telemetry {
RUN_FOLDER=`cat CI/out/test_telemetry.out | grep amazonaws.com | sed -rn "s#.*https:\/\/.*\/files/(.*)#\1#p"`
$AWS_CLI s3 ls "s3://$AWS_BUCKET/$RUN_FOLDER/" | awk '{ print $4 }' > s3_remote_files
echo "checking if telemetry files are uploaded on s3"
cat s3_remote_files | grep events-00.json || ( echo "FAILED: events-00.json not uploaded" && exit 1 )
cat s3_remote_files | grep critical-alerts-00.log || ( echo "FAILED: critical-alerts-00.log not uploaded" && exit 1 )
cat s3_remote_files | grep prometheus-00.tar || ( echo "FAILED: prometheus backup not uploaded" && exit 1 )
cat s3_remote_files | grep telemetry.json || ( echo "FAILED: telemetry.json not uploaded" && exit 1 )

View File

@@ -6,10 +6,11 @@ Following are a list of enhancements that we are planning to work on adding supp
- [x] [Centralized storage for chaos experiments artifacts](https://github.com/krkn-chaos/krkn/issues/423)
- [ ] [Support for causing DNS outages](https://github.com/krkn-chaos/krkn/issues/394)
- [x] [Chaos recommender](https://github.com/krkn-chaos/krkn/tree/main/utils/chaos-recommender) to suggest scenarios having probability of impacting the service under test using profiling results
- [ ] Chaos AI integration to improve and automate test coverage
- [] Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time
- [x] [Support for pod level network traffic shaping](https://github.com/krkn-chaos/krkn/issues/393)
- [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/krkn-chaos/krkn/issues/124)
- [ ] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [ ] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [ ] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
- [ ] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
- [x] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [x] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [x] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
- [x] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)

View File

@@ -38,11 +38,11 @@ A couple of [alert profiles](https://github.com/redhat-chaos/krkn/tree/main/conf
severity: critical
```
Kube-burner supports setting the severity for the alerts with each one having different effects:
Krkn supports setting the severity for the alerts with each one having different effects:
```
info: Prints an info message with the alarm description to stdout. By default all expressions have this severity.
warning: Prints a warning message with the alarm description to stdout.
error: Prints a error message with the alarm description to stdout and makes kube-burner rc = 1
error: Prints a error message with the alarm description to stdout and sets Krkn rc = 1
critical: Prints a fatal message with the alarm description to stdout and exits execution inmediatly with rc != 0
```

View File

@@ -13,13 +13,26 @@ Supported Cloud Providers:
**NOTE**: For clusters with AWS make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) is installed and properly [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html) using an AWS account
## GCP
**NOTE**: For clusters with GCP make sure [GCP CLI](https://cloud.google.com/sdk/docs/install#linux) is installed.
A google service account is required to give proper authentication to GCP for node actions. See [here](https://cloud.google.com/docs/authentication/getting-started) for how to create a service account.
In order to set up Application Default Credentials (ADC) for use by Cloud Client Libraries, you can provide either service account credentials or the credentials associated with your user acccount:
**NOTE**: A user with 'resourcemanager.projects.setIamPolicy' permission is required to grant project-level permissions to the service account.
- Using service account credentials:
After creating the service account you will need to enable the account using the following: ```export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"```
A google service account is required to give proper authentication to GCP for node actions. See [here](https://cloud.google.com/docs/authentication/getting-started) for how to create a service account.
**NOTE**: A user with 'resourcemanager.projects.setIamPolicy' permission is required to grant project-level permissions to the service account.
After creating the service account you will need to enable the account using the following: ```export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"```
- Using the credentials associated with your user acccount:
1. Make sure that the [GCP CLI](https://cloud.google.com/sdk/docs/install#linux) is installed and [initialized](https://cloud.google.com/sdk/docs/initializing) by running:
```gcloud init```
2. Create local authentication credentials for your user account:
```gcloud auth application-default login```
## Openstack
@@ -32,6 +45,7 @@ After creating the service account you will need to enable the account using the
To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.
Before running you will need to set the following:
1. ```export AZURE_SUBSCRIPTION_ID=<subscription_id>```
2. ```export AZURE_TENANT_ID=<tenant_id>```
@@ -66,9 +80,10 @@ Set the following environment variables
These are the credentials that you would normally use to access the vSphere client.
## IBMCloud
If no api key is set up with proper VPC resource permissions, use the following to create:
If no API key is set up with proper VPC resource permissions, use the following to create it:
* Access group
* Service id with the following access
* With policy **VPC Infrastructure Services**

View File

@@ -8,6 +8,7 @@ Current accepted cloud types:
* [GCP](cloud_setup.md#gcp)
* [AWS](cloud_setup.md#aws)
* [Openstack](cloud_setup.md#openstack)
* [IBMCloud](cloud_setup.md#ibmcloud)
```

View File

@@ -11,7 +11,7 @@
* [Scenarios](#scenarios)
* [Test Environment Recommendations - how and where to run chaos tests](#test-environment-recommendations---how-and-where-to-run-chaos-tests)
* [Chaos testing in Practice](#chaos-testing-in-practice)
* [OpenShift oraganization](#openshift-organization)
* [OpenShift organization](#openshift-organization)
* [startx-lab](#startx-lab)

View File

@@ -18,7 +18,7 @@ network_chaos: # Scenario to create an outage
```
##### Sample scenario config for ingress traffic shaping (using a plugin)
'''
```
- id: network_chaos
config:
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
@@ -35,7 +35,7 @@ network_chaos: # Scenario to create an outage
bandwidth: 10mbit
wait_duration: 120
test_duration: 60
'''
```
Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.

View File

@@ -4,7 +4,7 @@ The following node chaos scenarios are supported:
1. **node_start_scenario**: Scenario to stop the node instance.
2. **node_stop_scenario**: Scenario to stop the node instance.
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance. Not supported on VMware.
3. **node_stop_start_scenario**: Scenario to stop the node instance for specified duration and then start the node instance. Not supported on VMware.
4. **node_termination_scenario**: Scenario to terminate the node instance.
5. **node_reboot_scenario**: Scenario to reboot the node instance.
6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
@@ -12,6 +12,7 @@ The following node chaos scenarios are supported:
8. **restart_kubelet_scenario**: Scenario to restart the kubelet of the node instance.
9. **node_crash_scenario**: Scenario to crash the node instance.
10. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
11. **node_disk_detach_attach_scenario**: Scenario to detach node disk for specified duration.
**NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
@@ -20,6 +21,8 @@ The following node chaos scenarios are supported:
, node_reboot_scenario and stop_start_kubelet_scenario are supported on AWS, Azure, OpenStack, BareMetal, GCP
, VMware and Alibaba.
**NOTE**: node_disk_detach_attach_scenario is supported only on AWS and cannot detach root disk.
#### AWS
@@ -57,6 +60,8 @@ kind was primarily designed for testing Kubernetes itself, but may be used for l
#### GCP
Cloud setup instructions can be found [here](cloud_setup.md#gcp). Sample scenario config can be found [here](https://github.com/krkn-chaos/krkn/blob/main/scenarios/openshift/gcp_node_scenarios.yml).
NOTE: The parallel option is not available for GCP, the api doesn't perform processes at the same time
#### Openstack

View File

@@ -13,10 +13,12 @@ zone_outage: # Scenario to create an out
duration: 600 # Duration in seconds after which the zone will be back online.
vpc_id: # Cluster virtual private network to target.
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic.
default_acl_id: acl-xxxxxxxx # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.
```
**NOTE**: vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).
**NOTE**: Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.
**NOTE**: default_acl_id can be obtained from the AWS VPC Console by selecting "Network ACLs" from the left sidebar ( the ID will be in the format 'acl-xxxxxxxx' ). Make sure the selected ACL has the desired ingress/egress rules for your outage scenario ( i.e., deny all ).
##### Debugging steps in case of failures
In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in `NotReady` condition. Here is how to fix it:

View File

@@ -18,17 +18,14 @@ from kubernetes.client.api.batch_v1_api import BatchV1Api as BatchV1Api
@dataclass
class NetworkScenarioConfig:
node_interface_name: typing.Dict[
str, typing.List[str]
] = field(
node_interface_name: typing.Dict[str, typing.List[str]] = field(
default=None,
metadata={
"name": "Node Interface Name",
"description":
"Dictionary with node names as key and values as a list of "
"their test interfaces. "
"Required if label_selector is not set.",
}
"description": "Dictionary with node names as key and values as a list of "
"their test interfaces. "
"Required if label_selector is not set.",
},
)
label_selector: typing.Annotated[
@@ -37,93 +34,76 @@ class NetworkScenarioConfig:
default=None,
metadata={
"name": "Label selector",
"description":
"Kubernetes label selector for the target nodes. "
"Required if node_interface_name is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
"for details.",
}
)
test_duration: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
default=120,
metadata={
"name": "Test duration",
"description":
"Duration for which each step of the ingress chaos testing "
"is to be performed.",
"description": "Kubernetes label selector for the target nodes. "
"Required if node_interface_name is not set.\n"
"See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ " # noqa
"for details.",
},
)
wait_duration: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
test_duration: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=120,
metadata={
"name": "Test duration",
"description": "Duration for which each step of the ingress chaos testing "
"is to be performed.",
},
)
wait_duration: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=30,
metadata={
"name": "Wait Duration",
"description":
"Wait duration for finishing a test and its cleanup."
"Ensure that it is significantly greater than wait_duration"
}
"description": "Wait duration for finishing a test and its cleanup."
"Ensure that it is significantly greater than wait_duration",
},
)
instance_count: typing.Annotated[
typing.Optional[int],
validation.min(1)
] = field(
instance_count: typing.Annotated[typing.Optional[int], validation.min(1)] = field(
default=1,
metadata={
"name": "Instance Count",
"description":
"Number of nodes to perform action/select that match "
"the label selector.",
}
"description": "Number of nodes to perform action/select that match "
"the label selector.",
},
)
kubeconfig_path: typing.Optional[str] = field(
default=None,
metadata={
"name": "Kubeconfig path",
"description":
"Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
"for details.",
}
"description": "Path to your Kubeconfig file. Defaults to ~/.kube/config.\n"
"See https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/ " # noqa
"for details.",
},
)
execution_type: typing.Optional[str] = field(
default='parallel',
default="parallel",
metadata={
"name": "Execution Type",
"description":
"The order in which the ingress filters are applied. "
"Execution type can be 'serial' or 'parallel'"
}
"description": "The order in which the ingress filters are applied. "
"Execution type can be 'serial' or 'parallel'",
},
)
network_params: typing.Dict[str, str] = field(
default=None,
metadata={
"name": "Network Parameters",
"description":
"The network filters that are applied on the interface. "
"The currently supported filters are latency, "
"loss and bandwidth"
}
"description": "The network filters that are applied on the interface. "
"The currently supported filters are latency, "
"loss and bandwidth",
},
)
kraken_config: typing.Optional[str] = field(
default='',
default="",
metadata={
"name": "Kraken Config",
"description":
"Path to the config file of Kraken. "
"Set this field if you wish to publish status onto Cerberus"
}
"description": "Path to the config file of Kraken. "
"Set this field if you wish to publish status onto Cerberus",
},
)
@@ -132,33 +112,30 @@ class NetworkScenarioSuccessOutput:
filter_direction: str = field(
metadata={
"name": "Filter Direction",
"description":
"Direction in which the traffic control filters are applied "
"on the test interfaces"
"description": "Direction in which the traffic control filters are applied "
"on the test interfaces",
}
)
test_interfaces: typing.Dict[str, typing.List[str]] = field(
metadata={
"name": "Test Interfaces",
"description":
"Dictionary of nodes and their interfaces on which "
"the chaos experiment was performed"
"description": "Dictionary of nodes and their interfaces on which "
"the chaos experiment was performed",
}
)
network_parameters: typing.Dict[str, str] = field(
metadata={
"name": "Network Parameters",
"description":
"The network filters that are applied on the interfaces"
"description": "The network filters that are applied on the interfaces",
}
)
execution_type: str = field(
metadata={
"name": "Execution Type",
"description": "The order in which the filters are applied"
"description": "The order in which the filters are applied",
}
)
@@ -168,18 +145,13 @@ class NetworkScenarioErrorOutput:
error: str = field(
metadata={
"name": "Error",
"description":
"Error message when there is a run-time error during "
"the execution of the scenario"
"description": "Error message when there is a run-time error during "
"the execution of the scenario",
}
)
def get_default_interface(
node: str,
pod_template,
cli: CoreV1Api
) -> str:
def get_default_interface(node: str, pod_template, cli: CoreV1Api) -> str:
"""
Function that returns a random interface from a node
@@ -210,9 +182,9 @@ def get_default_interface(
logging.error("Exception occurred while executing command in pod")
sys.exit(1)
routes = output.split('\n')
routes = output.split("\n")
for route in routes:
if 'default' in route:
if "default" in route:
default_route = route
break
@@ -226,10 +198,7 @@ def get_default_interface(
def verify_interface(
input_interface_list: typing.List[str],
node: str,
pod_template,
cli: CoreV1Api
input_interface_list: typing.List[str], node: str, pod_template, cli: CoreV1Api
) -> typing.List[str]:
"""
Function that verifies whether a list of interfaces is present in the node.
@@ -258,22 +227,15 @@ def verify_interface(
try:
if input_interface_list == []:
cmd = ["ip", "r"]
output = kube_helper.exec_cmd_in_pod(
cli,
cmd,
"fedtools",
"default"
)
output = kube_helper.exec_cmd_in_pod(cli, cmd, "fedtools", "default")
if not output:
logging.error(
"Exception occurred while executing command in pod"
)
logging.error("Exception occurred while executing command in pod")
sys.exit(1)
routes = output.split('\n')
routes = output.split("\n")
for route in routes:
if 'default' in route:
if "default" in route:
default_route = route
break
@@ -281,20 +243,13 @@ def verify_interface(
else:
cmd = ["ip", "-br", "addr", "show"]
output = kube_helper.exec_cmd_in_pod(
cli,
cmd,
"fedtools",
"default"
)
output = kube_helper.exec_cmd_in_pod(cli, cmd, "fedtools", "default")
if not output:
logging.error(
"Exception occurred while executing command in pod"
)
logging.error("Exception occurred while executing command in pod")
sys.exit(1)
interface_ip = output.split('\n')
interface_ip = output.split("\n")
node_interface_list = [
interface.split()[0] for interface in interface_ip[:-1]
]
@@ -302,12 +257,12 @@ def verify_interface(
for interface in input_interface_list:
if interface not in node_interface_list:
logging.error(
"Interface %s not found in node %s interface list %s" %
(interface, node, node_interface_list)
"Interface %s not found in node %s interface list %s"
% (interface, node, node_interface_list)
)
raise Exception(
"Interface %s not found in node %s interface list %s" %
(interface, node, node_interface_list)
"Interface %s not found in node %s interface list %s"
% (interface, node, node_interface_list)
)
finally:
logging.info("Deleteing pod to query interface on node")
@@ -321,9 +276,8 @@ def get_node_interfaces(
label_selector: str,
instance_count: int,
pod_template,
cli: CoreV1Api
cli: CoreV1Api,
) -> typing.Dict[str, typing.List[str]]:
"""
Function that is used to process the input dictionary with the nodes and
its test interfaces.
@@ -364,11 +318,7 @@ def get_node_interfaces(
nodes = kube_helper.get_node(None, label_selector, instance_count, cli)
node_interface_dict = {}
for node in nodes:
node_interface_dict[node] = get_default_interface(
node,
pod_template,
cli
)
node_interface_dict[node] = get_default_interface(node, pod_template, cli)
else:
node_name_list = node_interface_dict.keys()
filtered_node_list = []
@@ -395,9 +345,8 @@ def apply_ingress_filter(
batch_cli: BatchV1Api,
cli: CoreV1Api,
create_interfaces: bool = True,
param_selector: str = 'all'
param_selector: str = "all",
) -> str:
"""
Function that applies the filters to shape incoming traffic to
the provided node's interfaces.
@@ -438,22 +387,18 @@ def apply_ingress_filter(
"""
network_params = cfg.network_params
if param_selector != 'all':
if param_selector != "all":
network_params = {param_selector: cfg.network_params[param_selector]}
if create_interfaces:
create_virtual_interfaces(cli, interface_list, node, pod_template)
exec_cmd = get_ingress_cmd(
interface_list, network_params, duration=cfg.test_duration
)
interface_list, network_params, duration=cfg.test_duration
)
logging.info("Executing %s on node %s" % (exec_cmd, node))
job_body = yaml.safe_load(
job_template.render(
jobname=str(hash(node))[:5],
nodename=node,
cmd=exec_cmd
)
job_template.render(jobname=str(hash(node))[:5], nodename=node, cmd=exec_cmd)
)
api_response = kube_helper.create_job(batch_cli, job_body)
@@ -464,10 +409,7 @@ def apply_ingress_filter(
def create_virtual_interfaces(
cli: CoreV1Api,
interface_list: typing.List[str],
node: str,
pod_template
cli: CoreV1Api, interface_list: typing.List[str], node: str, pod_template
) -> None:
"""
Function that creates a privileged pod and uses it to create
@@ -488,25 +430,20 @@ def create_virtual_interfaces(
- The YAML template used to instantiate a pod to create
virtual interfaces on the node
"""
pod_body = yaml.safe_load(
pod_template.render(nodename=node)
)
pod_body = yaml.safe_load(pod_template.render(nodename=node))
kube_helper.create_pod(cli, pod_body, "default", 300)
logging.info(
"Creating {0} virtual interfaces on node {1} using a pod".format(
len(interface_list),
node
len(interface_list), node
)
)
create_ifb(cli, len(interface_list), 'modtools')
create_ifb(cli, len(interface_list), "modtools")
logging.info("Deleting pod used to create virtual interfaces")
kube_helper.delete_pod(cli, "modtools", "default")
def delete_virtual_interfaces(
cli: CoreV1Api,
node_list: typing.List[str],
pod_template
cli: CoreV1Api, node_list: typing.List[str], pod_template
):
"""
Function that creates a privileged pod and uses it to delete all
@@ -529,14 +466,10 @@ def delete_virtual_interfaces(
"""
for node in node_list:
pod_body = yaml.safe_load(
pod_template.render(nodename=node)
)
pod_body = yaml.safe_load(pod_template.render(nodename=node))
kube_helper.create_pod(cli, pod_body, "default", 300)
logging.info(
"Deleting all virtual interfaces on node {0}".format(node)
)
delete_ifb(cli, 'modtools')
logging.info("Deleting all virtual interfaces on node {0}".format(node))
delete_ifb(cli, "modtools")
kube_helper.delete_pod(cli, "modtools", "default")
@@ -546,21 +479,13 @@ def create_ifb(cli: CoreV1Api, number: int, pod_name: str):
Makes use of modprobe commands
"""
exec_command = [
'chroot', '/host',
'modprobe', 'ifb', 'numifbs=' + str(number)
]
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
exec_command = ["chroot", "/host", "modprobe", "ifb", "numifbs=" + str(number)]
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, "default")
for i in range(0, number):
exec_command = ['chroot', '/host', 'ip', 'link', 'set', 'dev']
exec_command += ['ifb' + str(i), 'up']
kube_helper.exec_cmd_in_pod(
cli,
exec_command,
pod_name,
'default'
)
exec_command = ["chroot", "/host", "ip", "link", "set", "dev"]
exec_command += ["ifb" + str(i), "up"]
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, "default")
def delete_ifb(cli: CoreV1Api, pod_name: str):
@@ -569,8 +494,8 @@ def delete_ifb(cli: CoreV1Api, pod_name: str):
Makes use of modprobe command
"""
exec_command = ['chroot', '/host', 'modprobe', '-r', 'ifb']
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, 'default')
exec_command = ["chroot", "/host", "modprobe", "-r", "ifb"]
kube_helper.exec_cmd_in_pod(cli, exec_command, pod_name, "default")
def get_job_pods(cli: CoreV1Api, api_response):
@@ -591,18 +516,14 @@ def get_job_pods(cli: CoreV1Api, api_response):
controllerUid = api_response.metadata.labels["controller-uid"]
pod_label_selector = "controller-uid=" + controllerUid
pods_list = kube_helper.list_pods(
cli,
label_selector=pod_label_selector,
namespace="default"
cli, label_selector=pod_label_selector, namespace="default"
)
return pods_list[0]
def wait_for_job(
batch_cli: BatchV1Api,
job_list: typing.List[str],
timeout: int = 300
batch_cli: BatchV1Api, job_list: typing.List[str], timeout: int = 300
) -> None:
"""
Function that waits for a list of jobs to finish within a time period
@@ -625,13 +546,11 @@ def wait_for_job(
for job_name in job_list:
try:
api_response = kube_helper.get_job_status(
batch_cli,
job_name,
namespace="default"
batch_cli, job_name, namespace="default"
)
if (
api_response.status.succeeded is not None or
api_response.status.failed is not None
api_response.status.succeeded is not None
or api_response.status.failed is not None
):
count += 1
job_list.remove(job_name)
@@ -645,11 +564,7 @@ def wait_for_job(
time.sleep(5)
def delete_jobs(
cli: CoreV1Api,
batch_cli: BatchV1Api,
job_list: typing.List[str]
):
def delete_jobs(cli: CoreV1Api, batch_cli: BatchV1Api, job_list: typing.List[str]):
"""
Function that deletes jobs
@@ -667,38 +582,28 @@ def delete_jobs(
for job_name in job_list:
try:
api_response = kube_helper.get_job_status(
batch_cli,
job_name,
namespace="default"
batch_cli, job_name, namespace="default"
)
if api_response.status.failed is not None:
pod_name = get_job_pods(cli, api_response)
pod_stat = kube_helper.read_pod(
cli,
name=pod_name,
namespace="default"
)
pod_stat = kube_helper.read_pod(cli, name=pod_name, namespace="default")
logging.error(pod_stat.status.container_statuses)
pod_log_response = kube_helper.get_pod_log(
cli,
name=pod_name,
namespace="default"
cli, name=pod_name, namespace="default"
)
pod_log = pod_log_response.data.decode("utf-8")
logging.error(pod_log)
except Exception as e:
logging.warn("Exception in getting job status: %s" % str(e))
api_response = kube_helper.delete_job(
batch_cli,
name=job_name,
namespace="default"
batch_cli, name=job_name, namespace="default"
)
def get_ingress_cmd(
interface_list: typing.List[str],
network_parameters: typing.Dict[str, str],
duration: int = 300
duration: int = 300,
):
"""
Function that returns the commands to the ingress traffic shaping on
@@ -736,9 +641,7 @@ def get_ingress_cmd(
for i, interface in enumerate(interface_list):
if not interface_pattern.match(interface):
logging.error(
"Interface name can only consist of alphanumeric characters"
)
logging.error("Interface name can only consist of alphanumeric characters")
raise Exception(
"Interface '{0}' does not match the required regex pattern :"
r" ^[a-z0-9\-\@\_]+$".format(interface)
@@ -752,33 +655,23 @@ def get_ingress_cmd(
"follow the regex pattern ^ifb[0-9]+$".format(ifb_name)
)
tc_set += "tc qdisc add dev {0} handle ffff: ingress;".format(
interface
)
tc_set += "tc qdisc add dev {0} handle ffff: ingress;".format(interface)
tc_set += "tc filter add dev {0} parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev {1};".format( # noqa
interface,
ifb_name
interface, ifb_name
)
tc_set = "{0} tc qdisc add dev {1} root netem".format(tc_set, ifb_name)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, ifb_name)
tc_unset += "tc qdisc del dev {0} handle ffff: ingress;".format(
interface
)
tc_unset += "tc qdisc del dev {0} handle ffff: ingress;".format(interface)
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, ifb_name)
for parameter in network_parameters.keys():
tc_set += " {0} {1} ".format(
param_map[parameter],
network_parameters[parameter]
param_map[parameter], network_parameters[parameter]
)
tc_set += ";"
exec_cmd = "{0} {1} sleep {2};{3} sleep 20;{4}".format(
tc_set,
tc_ls,
duration,
tc_unset,
tc_ls
tc_set, tc_ls, duration, tc_unset, tc_ls
)
return exec_cmd
@@ -790,17 +683,14 @@ def get_ingress_cmd(
description="Applies filters to ihe ingress side of node(s) interfaces",
outputs={
"success": NetworkScenarioSuccessOutput,
"error": NetworkScenarioErrorOutput
"error": NetworkScenarioErrorOutput,
},
)
def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
str,
typing.Union[
NetworkScenarioSuccessOutput,
NetworkScenarioErrorOutput
]
def network_chaos(
cfg: NetworkScenarioConfig,
) -> typing.Tuple[
str, typing.Union[NetworkScenarioSuccessOutput, NetworkScenarioErrorOutput]
]:
"""
Function that performs the ingress network chaos scenario based
on the provided configuration
@@ -826,12 +716,10 @@ def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
cfg.label_selector,
cfg.instance_count,
pod_interface_template,
cli
cli,
)
except Exception:
return "error", NetworkScenarioErrorOutput(
format_exc()
)
return "error", NetworkScenarioErrorOutput(format_exc())
job_list = []
publish = False
if cfg.kraken_config:
@@ -840,16 +728,12 @@ def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
with open(cfg.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error(
"Error reading Kraken config from %s" % cfg.kraken_config
)
return "error", NetworkScenarioErrorOutput(
format_exc()
)
logging.error("Error reading Kraken config from %s" % cfg.kraken_config)
return "error", NetworkScenarioErrorOutput(format_exc())
publish = True
try:
if cfg.execution_type == 'parallel':
if cfg.execution_type == "parallel":
for node in node_interface_dict:
job_list.append(
apply_ingress_filter(
@@ -859,22 +743,19 @@ def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
pod_module_template,
job_template,
batch_cli,
cli
cli,
)
)
logging.info("Waiting for parallel job to finish")
start_time = int(time.time())
wait_for_job(batch_cli, job_list[:], cfg.test_duration+100)
wait_for_job(batch_cli, job_list[:], cfg.test_duration + 100)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
config, failed_post_scenarios, start_time, end_time
)
elif cfg.execution_type == 'serial':
elif cfg.execution_type == "serial":
create_interfaces = True
for param in cfg.network_params:
for node in node_interface_dict:
@@ -888,50 +769,39 @@ def network_chaos(cfg: NetworkScenarioConfig) -> typing.Tuple[
batch_cli,
cli,
create_interfaces=create_interfaces,
param_selector=param
param_selector=param,
)
)
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
wait_for_job(batch_cli, job_list[:], cfg.test_duration+100)
wait_for_job(batch_cli, job_list[:], cfg.test_duration + 100)
logging.info("Deleting jobs")
delete_jobs(cli, batch_cli, job_list[:])
job_list = []
logging.info(
"Waiting for wait_duration : %ss" % cfg.wait_duration
)
logging.info("Waiting for wait_duration : %ss" % cfg.wait_duration)
time.sleep(cfg.wait_duration)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config,
failed_post_scenarios,
start_time,
end_time
config, failed_post_scenarios, start_time, end_time
)
create_interfaces = False
else:
return "error", NetworkScenarioErrorOutput(
"Invalid execution type - serial and parallel are "
"the only accepted types"
)
"Invalid execution type - serial and parallel are "
"the only accepted types"
)
return "success", NetworkScenarioSuccessOutput(
filter_direction="ingress",
test_interfaces=node_interface_dict,
network_parameters=cfg.network_params,
execution_type=cfg.execution_type
execution_type=cfg.execution_type,
)
except Exception as e:
logging.error("Network Chaos exiting due to Exception - %s" % e)
return "error", NetworkScenarioErrorOutput(
format_exc()
)
return "error", NetworkScenarioErrorOutput(format_exc())
finally:
delete_virtual_interfaces(
cli,
node_interface_dict.keys(),
pod_module_template
)
delete_virtual_interfaces(cli, node_interface_dict.keys(), pod_module_template)
logging.info("Deleting jobs(if any)")
delete_jobs(cli, batch_cli, job_list[:])

View File

@@ -34,7 +34,16 @@ class IbmCloud:
self.service.set_service_url(service_url)
except Exception as e:
logging.error("error authenticating" + str(e))
sys.exit(1)
# Get the instance ID of the node
def get_instance_id(self, node_name):
node_list = self.list_instances()
for node in node_list:
if node_name == node["vpc_name"]:
return node["vpc_id"]
logging.error("Couldn't find node with name " + str(node_name) + ", you could try another region")
sys.exit(1)
def delete_instance(self, instance_id):
"""

View File

@@ -42,8 +42,7 @@ def get_test_pods(
pod names (string) in the namespace
"""
pods_list = []
pods_list = kubecli.list_pods(
label_selector=pod_label, namespace=namespace)
pods_list = kubecli.list_pods(label_selector=pod_label, namespace=namespace)
if pod_name and pod_name not in pods_list:
raise Exception("pod name not found in namespace ")
elif pod_name and pod_name in pods_list:
@@ -92,8 +91,7 @@ def delete_jobs(kubecli: KrknKubernetes, job_list: typing.List[str]):
for job_name in job_list:
try:
api_response = kubecli.get_job_status(
job_name, namespace="default")
api_response = kubecli.get_job_status(job_name, namespace="default")
if api_response.status.failed is not None:
pod_name = get_job_pods(kubecli, api_response)
pod_stat = kubecli.read_pod(name=pod_name, namespace="default")
@@ -131,8 +129,7 @@ def wait_for_job(
while count != job_len:
for job_name in job_list:
try:
api_response = kubecli.get_job_status(
job_name, namespace="default")
api_response = kubecli.get_job_status(job_name, namespace="default")
if (
api_response.status.succeeded is not None
or api_response.status.failed is not None
@@ -149,8 +146,7 @@ def wait_for_job(
time.sleep(5)
def get_bridge_name(cli: ApiextensionsV1Api,
custom_obj: CustomObjectsApi) -> str:
def get_bridge_name(cli: ApiextensionsV1Api, custom_obj: CustomObjectsApi) -> str:
"""
Function that gets OVS bridge present in node.
@@ -328,16 +324,13 @@ def apply_ingress_policy(
create_virtual_interfaces(kubecli, len(ips), node, pod_template)
for count, pod_ip in enumerate(set(ips)):
pod_inf = get_pod_interface(
node, pod_ip, pod_template, bridge_name, kubecli)
pod_inf = get_pod_interface(node, pod_ip, pod_template, bridge_name, kubecli)
exec_cmd = get_ingress_cmd(
test_execution, pod_inf, mod, count, network_params, duration
)
logging.info("Executing %s on pod %s in node %s" %
(exec_cmd, pod_ip, node))
logging.info("Executing %s on pod %s in node %s" % (exec_cmd, pod_ip, node))
job_body = yaml.safe_load(
job_template.render(jobname=mod + str(pod_ip),
nodename=node, cmd=exec_cmd)
job_template.render(jobname=mod + str(pod_ip), nodename=node, cmd=exec_cmd)
)
job_list.append(job_body["metadata"]["name"])
api_response = kubecli.create_job(job_body)
@@ -405,16 +398,13 @@ def apply_net_policy(
job_list = []
for pod_ip in set(ips):
pod_inf = get_pod_interface(
node, pod_ip, pod_template, bridge_name, kubecli)
pod_inf = get_pod_interface(node, pod_ip, pod_template, bridge_name, kubecli)
exec_cmd = get_egress_cmd(
test_execution, pod_inf, mod, network_params, duration
)
logging.info("Executing %s on pod %s in node %s" %
(exec_cmd, pod_ip, node))
logging.info("Executing %s on pod %s in node %s" % (exec_cmd, pod_ip, node))
job_body = yaml.safe_load(
job_template.render(jobname=mod + str(pod_ip),
nodename=node, cmd=exec_cmd)
job_template.render(jobname=mod + str(pod_ip), nodename=node, cmd=exec_cmd)
)
job_list.append(job_body["metadata"]["name"])
api_response = kubecli.create_job(job_body)
@@ -456,18 +446,16 @@ def get_ingress_cmd(
Returns:
str: ingress filter
"""
ifb_dev = 'ifb{0}'.format(count)
ifb_dev = "ifb{0}".format(count)
tc_set = tc_unset = tc_ls = ""
param_map = {"latency": "delay", "loss": "loss", "bandwidth": "rate"}
tc_set = "tc qdisc add dev {0} ingress ;".format(test_interface)
tc_set = "{0} tc filter add dev {1} ingress matchall action mirred egress redirect dev {2} ;".format(
tc_set, test_interface, ifb_dev)
tc_set = "{0} tc qdisc replace dev {1} root netem".format(
tc_set, ifb_dev)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(
tc_unset, ifb_dev)
tc_unset = "{0} tc qdisc del dev {1} ingress".format(
tc_unset, test_interface)
tc_set, test_interface, ifb_dev
)
tc_set = "{0} tc qdisc replace dev {1} root netem".format(tc_set, ifb_dev)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, ifb_dev)
tc_unset = "{0} tc qdisc del dev {1} ingress".format(tc_unset, test_interface)
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, ifb_dev)
if execution == "parallel":
for val in vallst.keys():
@@ -475,8 +463,7 @@ def get_ingress_cmd(
tc_set += ";"
else:
tc_set += " {0} {1} ;".format(param_map[mod], vallst[mod])
exec_cmd = "{0} {1} sleep {2};{3}".format(
tc_set, tc_ls, duration, tc_unset)
exec_cmd = "{0} {1} sleep {2};{3}".format(tc_set, tc_ls, duration, tc_unset)
return exec_cmd
@@ -512,10 +499,8 @@ def get_egress_cmd(
"""
tc_set = tc_unset = tc_ls = ""
param_map = {"latency": "delay", "loss": "loss", "bandwidth": "rate"}
tc_set = "{0} tc qdisc replace dev {1} root netem".format(
tc_set, test_interface)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(
tc_unset, test_interface)
tc_set = "{0} tc qdisc replace dev {1} root netem".format(tc_set, test_interface)
tc_unset = "{0} tc qdisc del dev {1} root ;".format(tc_unset, test_interface)
tc_ls = "{0} tc qdisc ls dev {1} ;".format(tc_ls, test_interface)
if execution == "parallel":
for val in vallst.keys():
@@ -523,17 +508,13 @@ def get_egress_cmd(
tc_set += ";"
else:
tc_set += " {0} {1} ;".format(param_map[mod], vallst[mod])
exec_cmd = "{0} {1} sleep {2};{3}".format(
tc_set, tc_ls, duration, tc_unset)
exec_cmd = "{0} {1} sleep {2};{3}".format(tc_set, tc_ls, duration, tc_unset)
return exec_cmd
def create_virtual_interfaces(
kubecli: KrknKubernetes,
nummber: int,
node: str,
pod_template
kubecli: KrknKubernetes, nummber: int, node: str, pod_template
) -> None:
"""
Function that creates a privileged pod and uses it to create
@@ -554,25 +535,18 @@ def create_virtual_interfaces(
- The YAML template used to instantiate a pod to create
virtual interfaces on the node
"""
pod_body = yaml.safe_load(
pod_template.render(nodename=node)
)
pod_body = yaml.safe_load(pod_template.render(nodename=node))
kubecli.create_pod(pod_body, "default", 300)
logging.info(
"Creating {0} virtual interfaces on node {1} using a pod".format(
nummber,
node
)
"Creating {0} virtual interfaces on node {1} using a pod".format(nummber, node)
)
create_ifb(kubecli, nummber, 'modtools')
create_ifb(kubecli, nummber, "modtools")
logging.info("Deleting pod used to create virtual interfaces")
kubecli.delete_pod("modtools", "default")
def delete_virtual_interfaces(
kubecli: KrknKubernetes,
node_list: typing.List[str],
pod_template
kubecli: KrknKubernetes, node_list: typing.List[str], pod_template
):
"""
Function that creates a privileged pod and uses it to delete all
@@ -595,14 +569,10 @@ def delete_virtual_interfaces(
"""
for node in node_list:
pod_body = yaml.safe_load(
pod_template.render(nodename=node)
)
pod_body = yaml.safe_load(pod_template.render(nodename=node))
kubecli.create_pod(pod_body, "default", 300)
logging.info(
"Deleting all virtual interfaces on node {0}".format(node)
)
delete_ifb(kubecli, 'modtools')
logging.info("Deleting all virtual interfaces on node {0}".format(node))
delete_ifb(kubecli, "modtools")
kubecli.delete_pod("modtools", "default")
@@ -612,24 +582,14 @@ def create_ifb(kubecli: KrknKubernetes, number: int, pod_name: str):
Makes use of modprobe commands
"""
exec_command = [
'/host',
'modprobe', 'ifb', 'numifbs=' + str(number)
]
kubecli.exec_cmd_in_pod(
exec_command,
pod_name,
'default',
base_command="chroot")
exec_command = ["/host", "modprobe", "ifb", "numifbs=" + str(number)]
kubecli.exec_cmd_in_pod(exec_command, pod_name, "default", base_command="chroot")
for i in range(0, number):
exec_command = ['/host', 'ip', 'link', 'set', 'dev']
exec_command += ['ifb' + str(i), 'up']
exec_command = ["/host", "ip", "link", "set", "dev"]
exec_command += ["ifb" + str(i), "up"]
kubecli.exec_cmd_in_pod(
exec_command,
pod_name,
'default',
base_command="chroot"
exec_command, pod_name, "default", base_command="chroot"
)
@@ -639,17 +599,11 @@ def delete_ifb(kubecli: KrknKubernetes, pod_name: str):
Makes use of modprobe command
"""
exec_command = ['/host', 'modprobe', '-r', 'ifb']
kubecli.exec_cmd_in_pod(
exec_command,
pod_name,
'default',
base_command="chroot")
exec_command = ["/host", "modprobe", "-r", "ifb"]
kubecli.exec_cmd_in_pod(exec_command, pod_name, "default", base_command="chroot")
def list_bridges(
node: str, pod_template, kubecli: KrknKubernetes
) -> typing.List[str]:
def list_bridges(node: str, pod_template, kubecli: KrknKubernetes) -> typing.List[str]:
"""
Function that returns a list of bridges on the node
@@ -787,7 +741,7 @@ def get_pod_interface(
find_ip = f"external-ids:ip_addresses={ip}/23"
else:
find_ip = f"external-ids:ip={ip}"
cmd = [
"/host",
"ovs-vsctl",
@@ -797,24 +751,20 @@ def get_pod_interface(
"interface",
find_ip,
]
output = kubecli.exec_cmd_in_pod(
cmd, "modtools", "default", base_command="chroot"
)
if not output:
cmd= [
"/host",
"ip",
"addr",
"show"
]
cmd = ["/host", "ip", "addr", "show"]
output = kubecli.exec_cmd_in_pod(
cmd, "modtools", "default", base_command="chroot")
cmd, "modtools", "default", base_command="chroot"
)
for if_str in output.split("\n"):
if re.search(ip,if_str):
inf = if_str.split(' ')[-1]
if re.search(ip, if_str):
inf = if_str.split(" ")[-1]
else:
inf = output
inf = output
finally:
logging.info("Deleting pod to query interface on node")
kubecli.delete_pod("modtools", "default")
@@ -927,11 +877,11 @@ class InputParams:
},
)
kraken_config: typing.Optional[str] = field(
kraken_config: typing.Dict[str, typing.Any] = field(
default=None,
metadata={
"name": "Kraken Config",
"description": "Path to the config file of Kraken. "
"description": "Kraken config file dictionary "
"Set this field if you wish to publish status onto Cerberus",
},
)
@@ -1043,14 +993,6 @@ def pod_outage(
publish = False
if params.kraken_config:
failed_post_scenarios = ""
try:
with open(params.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error("Error reading Kraken config from %s" %
params.kraken_config)
return "error", PodOutageErrorOutput(format_exc())
publish = True
for i in params.direction:
@@ -1106,7 +1048,7 @@ def pod_outage(
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
params.kraken_config, "", start_time, end_time
)
return "success", PodOutageSuccessOutput(
@@ -1116,8 +1058,7 @@ def pod_outage(
egress_ports=params.egress_ports,
)
except Exception as e:
logging.error(
"Pod network outage scenario exiting due to Exception - %s" % e)
logging.error("Pod network outage scenario exiting due to Exception - %s" % e)
return "error", PodOutageErrorOutput(format_exc())
finally:
logging.info("Deleting jobs(if any)")
@@ -1179,11 +1120,11 @@ class EgressParams:
},
)
kraken_config: typing.Optional[str] = field(
kraken_config: typing.Dict[str, typing.Any] = field(
default=None,
metadata={
"name": "Kraken Config",
"description": "Path to the config file of Kraken. "
"description": "Krkn config file dictionary "
"Set this field if you wish to publish status onto Cerberus",
},
)
@@ -1276,8 +1217,7 @@ class PodEgressNetShapingErrorOutput:
def pod_egress_shaping(
params: EgressParams,
) -> typing.Tuple[
str, typing.Union[PodEgressNetShapingSuccessOutput,
PodEgressNetShapingErrorOutput]
str, typing.Union[PodEgressNetShapingSuccessOutput, PodEgressNetShapingErrorOutput]
]:
"""
Function that performs egress pod traffic shaping based
@@ -1302,14 +1242,6 @@ def pod_egress_shaping(
publish = False
if params.kraken_config:
failed_post_scenarios = ""
try:
with open(params.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error("Error reading Kraken config from %s" %
params.kraken_config)
return "error", PodEgressNetShapingErrorOutput(format_exc())
publish = True
try:
@@ -1344,30 +1276,30 @@ def pod_egress_shaping(
for mod in mod_lst:
for node, ips in node_dict.items():
job_list.extend( apply_net_policy(
mod,
node,
ips,
job_template,
pod_module_template,
params.network_params,
params.test_duration,
br_name,
kubecli,
params.execution_type,
))
job_list.extend(
apply_net_policy(
mod,
node,
ips,
job_template,
pod_module_template,
params.network_params,
params.test_duration,
br_name,
kubecli,
params.execution_type,
)
)
if params.execution_type == "serial":
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
wait_for_job(job_list[:], kubecli,
params.test_duration + 20)
logging.info("Waiting for wait_duration %s" %
params.test_duration)
wait_for_job(job_list[:], kubecli, params.test_duration + 20)
logging.info("Waiting for wait_duration %s" % params.test_duration)
time.sleep(params.test_duration)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
params.kraken_config, "", start_time, end_time
)
if params.execution_type == "parallel":
break
@@ -1380,7 +1312,7 @@ def pod_egress_shaping(
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
params.kraken_config, "", start_time, end_time
)
return "success", PodEgressNetShapingSuccessOutput(
@@ -1389,8 +1321,7 @@ def pod_egress_shaping(
execution_type=params.execution_type,
)
except Exception as e:
logging.error(
"Pod network Shaping scenario exiting due to Exception - %s" % e)
logging.error("Pod network Shaping scenario exiting due to Exception - %s" % e)
return "error", PodEgressNetShapingErrorOutput(format_exc())
finally:
logging.info("Deleting jobs(if any)")
@@ -1452,7 +1383,7 @@ class IngressParams:
},
)
kraken_config: typing.Optional[str] = field(
kraken_config: typing.Dict[str, typing.Any] = field(
default=None,
metadata={
"name": "Kraken Config",
@@ -1549,8 +1480,8 @@ class PodIngressNetShapingErrorOutput:
def pod_ingress_shaping(
params: IngressParams,
) -> typing.Tuple[
str, typing.Union[PodIngressNetShapingSuccessOutput,
PodIngressNetShapingErrorOutput]
str,
typing.Union[PodIngressNetShapingSuccessOutput, PodIngressNetShapingErrorOutput],
]:
"""
Function that performs ingress pod traffic shaping based
@@ -1575,14 +1506,6 @@ def pod_ingress_shaping(
publish = False
if params.kraken_config:
failed_post_scenarios = ""
try:
with open(params.kraken_config, "r") as f:
config = yaml.full_load(f)
except Exception:
logging.error("Error reading Kraken config from %s" %
params.kraken_config)
return "error", PodIngressNetShapingErrorOutput(format_exc())
publish = True
try:
@@ -1617,30 +1540,30 @@ def pod_ingress_shaping(
for mod in mod_lst:
for node, ips in node_dict.items():
job_list.extend(apply_ingress_policy(
mod,
node,
ips,
job_template,
pod_module_template,
params.network_params,
params.test_duration,
br_name,
kubecli,
params.execution_type,
))
job_list.extend(
apply_ingress_policy(
mod,
node,
ips,
job_template,
pod_module_template,
params.network_params,
params.test_duration,
br_name,
kubecli,
params.execution_type,
)
)
if params.execution_type == "serial":
logging.info("Waiting for serial job to finish")
start_time = int(time.time())
wait_for_job(job_list[:], kubecli,
params.test_duration + 20)
logging.info("Waiting for wait_duration %s" %
params.test_duration)
wait_for_job(job_list[:], kubecli, params.test_duration + 20)
logging.info("Waiting for wait_duration %s" % params.test_duration)
time.sleep(params.test_duration)
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
params.kraken_config, "", start_time, end_time
)
if params.execution_type == "parallel":
break
@@ -1653,7 +1576,7 @@ def pod_ingress_shaping(
end_time = int(time.time())
if publish:
cerberus.publish_kraken_status(
config, failed_post_scenarios, start_time, end_time
params.kraken_config, "", start_time, end_time
)
return "success", PodIngressNetShapingSuccessOutput(
@@ -1662,14 +1585,9 @@ def pod_ingress_shaping(
execution_type=params.execution_type,
)
except Exception as e:
logging.error(
"Pod network Shaping scenario exiting due to Exception - %s" % e)
logging.error("Pod network Shaping scenario exiting due to Exception - %s" % e)
return "error", PodIngressNetShapingErrorOutput(format_exc())
finally:
delete_virtual_interfaces(
kubecli,
node_dict.keys(),
pod_module_template
)
delete_virtual_interfaces(kubecli, node_dict.keys(), pod_module_template)
logging.info("Deleting jobs(if any)")
delete_jobs(kubecli, job_list[:])

View File

@@ -42,19 +42,13 @@ class NetworkChaosScenarioPlugin(AbstractScenarioPlugin):
test_egress = get_yaml_item_value(
test_dict, "egress", {"bandwidth": "100mbit"}
)
if test_node:
node_name_list = test_node.split(",")
nodelst = common_node_functions.get_node_by_name(node_name_list, lib_telemetry.get_lib_kubernetes())
else:
node_name_list = [test_node]
nodelst = []
for single_node_name in node_name_list:
nodelst.extend(
common_node_functions.get_node(
single_node_name,
test_node_label,
test_instance_count,
lib_telemetry.get_lib_kubernetes(),
)
nodelst = common_node_functions.get_node(
test_node_label, test_instance_count, lib_telemetry.get_lib_kubernetes()
)
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__))
@@ -149,7 +143,10 @@ class NetworkChaosScenarioPlugin(AbstractScenarioPlugin):
finally:
logging.info("Deleting jobs")
self.delete_job(joblst[:], lib_telemetry.get_lib_kubernetes())
except (RuntimeError, Exception):
except (RuntimeError, Exception) as e:
logging.error(
"NetworkChaosScenarioPlugin exiting due to Exception %s" % e
)
scenario_telemetry.exit_status = 1
return 1
else:

View File

@@ -36,6 +36,20 @@ class abstract_node_scenarios:
self.helper_node_start_scenario(instance_kill_count, node, timeout)
logging.info("helper_node_stop_start_scenario has been successfully injected!")
# Node scenario to detach and attach the disk
def node_disk_detach_attach_scenario(self, instance_kill_count, node, timeout, duration):
logging.info("Starting disk_detach_attach_scenario injection")
disk_attachment_details = self.get_disk_attachment_info(instance_kill_count, node)
if disk_attachment_details:
self.disk_detach_scenario(instance_kill_count, node, timeout)
logging.info("Waiting for %s seconds before attaching the disk" % (duration))
time.sleep(duration)
self.disk_attach_scenario(instance_kill_count, disk_attachment_details, timeout)
logging.info("node_disk_detach_attach_scenario has been successfully injected!")
else:
logging.error("Node %s has only root disk attached" % (node))
logging.error("node_disk_detach_attach_scenario failed!")
# Node scenario to terminate the node
def node_termination_scenario(self, instance_kill_count, node, timeout):
pass

View File

@@ -12,7 +12,8 @@ from krkn_lib.k8s import KrknKubernetes
class AWS:
def __init__(self):
self.boto_client = boto3.client("ec2")
self.boto_instance = boto3.resource("ec2").Instance("id")
self.boto_resource = boto3.resource("ec2")
self.boto_instance = self.boto_resource.Instance("id")
# Get the instance ID of the node
def get_instance_id(self, node):
@@ -179,6 +180,72 @@ class AWS:
raise RuntimeError()
# Detach volume
def detach_volumes(self, volumes_ids: list):
for volume in volumes_ids:
try:
self.boto_client.detach_volume(VolumeId=volume, Force=True)
except Exception as e:
logging.error(
"Detaching volume %s failed with exception: %s"
% (volume, e)
)
# Attach volume
def attach_volume(self, attachment: dict):
try:
if self.get_volume_state(attachment["VolumeId"]) == "in-use":
logging.info(
"Volume %s is already in use." % attachment["VolumeId"]
)
return
logging.info(
"Attaching the %s volumes to instance %s."
% (attachment["VolumeId"], attachment["InstanceId"])
)
self.boto_client.attach_volume(
InstanceId=attachment["InstanceId"],
Device=attachment["Device"],
VolumeId=attachment["VolumeId"]
)
except Exception as e:
logging.error(
"Failed attaching disk %s to the %s instance. "
"Encountered following exception: %s"
% (attachment['VolumeId'], attachment['InstanceId'], e)
)
raise RuntimeError()
# Get IDs of node volumes
def get_volumes_ids(self, instance_id: list):
response = self.boto_client.describe_instances(InstanceIds=instance_id)
instance_attachment_details = response["Reservations"][0]["Instances"][0]["BlockDeviceMappings"]
root_volume_device_name = self.get_root_volume_id(instance_id)
volume_ids = []
for device in instance_attachment_details:
if device["DeviceName"] != root_volume_device_name:
volume_id = device["Ebs"]["VolumeId"]
volume_ids.append(volume_id)
return volume_ids
# Get volumes attachment details
def get_volume_attachment_details(self, volume_ids: list):
response = self.boto_client.describe_volumes(VolumeIds=volume_ids)
volumes_details = response["Volumes"]
return volumes_details
# Get root volume
def get_root_volume_id(self, instance_id):
instance_id = instance_id[0]
instance = self.boto_resource.Instance(instance_id)
root_volume_id = instance.root_device_name
return root_volume_id
# Get volume state
def get_volume_state(self, volume_id: str):
volume = self.boto_resource.Volume(volume_id)
state = volume.state
return state
# krkn_lib
class aws_node_scenarios(abstract_node_scenarios):
@@ -290,3 +357,49 @@ class aws_node_scenarios(abstract_node_scenarios):
logging.error("node_reboot_scenario injection failed!")
raise RuntimeError()
# Get volume attachment info
def get_disk_attachment_info(self, instance_kill_count, node):
for _ in range(instance_kill_count):
try:
logging.info("Obtaining disk attachment information")
instance_id = (self.aws.get_instance_id(node)).split()
volumes_ids = self.aws.get_volumes_ids(instance_id)
if volumes_ids:
vol_attachment_details = self.aws.get_volume_attachment_details(
volumes_ids
)
return vol_attachment_details
return
except Exception as e:
logging.error(
"Failed to obtain disk attachment information of %s node. "
"Encounteres following exception: %s." % (node, e)
)
raise RuntimeError()
# Node scenario to detach the volume
def disk_detach_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting disk_detach_scenario injection")
instance_id = (self.aws.get_instance_id(node)).split()
volumes_ids = self.aws.get_volumes_ids(instance_id)
logging.info(
"Detaching the %s volumes from instance %s "
% (volumes_ids, node)
)
self.aws.detach_volumes(volumes_ids)
except Exception as e:
logging.error(
"Failed to detach disk from %s node. Encountered following"
"exception: %s." % (node, e)
)
logging.debug("")
raise RuntimeError()
# Node scenario to attach the volume
def disk_attach_scenario(self, instance_kill_count, attachment_details, timeout):
for _ in range(instance_kill_count):
for attachment in attachment_details:
self.aws.attach_volume(attachment["Attachments"][0])

View File

@@ -8,19 +8,28 @@ from krkn_lib.k8s import KrknKubernetes
node_general = False
def get_node_by_name(node_name_list, kubecli: KrknKubernetes):
killable_nodes = kubecli.list_killable_nodes()
for node_name in node_name_list:
if node_name not in killable_nodes:
logging.info(
f"Node with provided ${node_name} does not exist or the node might "
"be in NotReady state."
)
return
return node_name_list
# Pick a random node with specified label selector
def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubernetes):
if node_name in kubecli.list_killable_nodes():
return [node_name]
elif node_name:
logging.info(
"Node with provided node_name does not exist or the node might "
"be in NotReady state."
)
nodes = kubecli.list_killable_nodes(label_selector)
def get_node(label_selector, instance_kill_count, kubecli: KrknKubernetes):
label_selector_list = label_selector.split(",")
nodes = []
for label_selector in label_selector_list:
nodes.extend(kubecli.list_killable_nodes(label_selector))
if not nodes:
raise Exception("Ready nodes with the provided label selector do not exist")
logging.info("Ready nodes with the label selector %s: %s" % (label_selector, nodes))
logging.info("Ready nodes with the label selector %s: %s" % (label_selector_list, nodes))
number_of_nodes = len(nodes)
if instance_kill_count == number_of_nodes:
return nodes
@@ -35,22 +44,19 @@ def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubern
# krkn_lib
# Wait until the node status becomes Ready
def wait_for_ready_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "True", timeout, resource_version)
kubecli.watch_node_status(node, "True", timeout)
# krkn_lib
# Wait until the node status becomes Not Ready
def wait_for_not_ready_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "False", timeout, resource_version)
kubecli.watch_node_status(node, "False", timeout)
# krkn_lib
# Wait until the node status becomes Unknown
def wait_for_unknown_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "Unknown", timeout, resource_version)
kubecli.watch_node_status(node, "Unknown", timeout)
# Get the ip of the cluster node

View File

@@ -1,66 +1,78 @@
import os
import sys
import time
import logging
import json
import google.auth
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
abstract_node_scenarios,
)
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
from google.cloud import compute_v1
from krkn_lib.k8s import KrknKubernetes
class GCP:
def __init__(self):
try:
gapp_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
with open(gapp_creds, "r") as f:
f_str = f.read()
self.project = json.loads(f_str)["project_id"]
# self.project = runcommand.invoke("gcloud config get-value project").split("/n")[0].strip()
logging.info("project " + str(self.project) + "!")
credentials = GoogleCredentials.get_application_default()
self.client = discovery.build(
"compute", "v1", credentials=credentials, cache_discovery=False
)
_, self.project_id = google.auth.default()
self.instance_client = compute_v1.InstancesClient()
except Exception as e:
logging.error("Error on setting up GCP connection: " + str(e))
raise e
# Get the instance ID of the node
def get_instance_id(self, node):
zone_request = self.client.zones().list(project=self.project)
while zone_request is not None:
zone_response = zone_request.execute()
for zone in zone_response["items"]:
instances_request = self.client.instances().list(
project=self.project, zone=zone["name"]
)
while instances_request is not None:
instance_response = instances_request.execute()
if "items" in instance_response.keys():
for instance in instance_response["items"]:
if instance["name"] in node:
return instance["name"], zone["name"]
instances_request = self.client.zones().list_next(
previous_request=instances_request,
previous_response=instance_response,
)
zone_request = self.client.zones().list_next(
previous_request=zone_request, previous_response=zone_response
# Get the instance of the node
def get_node_instance(self, node):
try:
request = compute_v1.AggregatedListInstancesRequest(
project = self.project_id
)
logging.info("no instances ")
agg_list = self.instance_client.aggregated_list(request=request)
for _, response in agg_list:
if response.instances:
for instance in response.instances:
if instance.name in node:
return instance
logging.info("no instances ")
except Exception as e:
logging.error("Error getting the instance of the node: " + str(e))
raise e
# Get the instance name
def get_instance_name(self, instance):
if instance.name:
return instance.name
# Get the instance zone
def get_instance_zone(self, instance):
if instance.zone:
return instance.zone.split("/")[-1]
# Get the instance zone of the node
def get_node_instance_zone(self, node):
instance = self.get_node_instance(node)
if instance:
return self.get_instance_zone(instance)
# Get the instance name of the node
def get_node_instance_name(self, node):
instance = self.get_node_instance(node)
if instance:
return self.get_instance_name(instance)
# Get the instance name of the node
def get_instance_id(self, node):
return self.get_node_instance_name(node)
# Start the node instance
def start_instances(self, zone, instance_id):
def start_instances(self, instance_id):
try:
self.client.instances().start(
project=self.project, zone=zone, instance=instance_id
).execute()
logging.info("vm name " + str(instance_id) + " started")
request = compute_v1.StartInstanceRequest(
instance=instance_id,
project=self.project_id,
zone=self.get_node_instance_zone(instance_id),
)
self.instance_client.start(request=request)
logging.info("Instance: " + str(instance_id) + " started")
except Exception as e:
logging.error(
"Failed to start node instance %s. Encountered following "
@@ -70,12 +82,15 @@ class GCP:
raise RuntimeError()
# Stop the node instance
def stop_instances(self, zone, instance_id):
def stop_instances(self, instance_id):
try:
self.client.instances().stop(
project=self.project, zone=zone, instance=instance_id
).execute()
logging.info("vm name " + str(instance_id) + " stopped")
request = compute_v1.StopInstanceRequest(
instance=instance_id,
project=self.project_id,
zone=self.get_node_instance_zone(instance_id),
)
self.instance_client.stop(request=request)
logging.info("Instance: " + str(instance_id) + " stopped")
except Exception as e:
logging.error(
"Failed to stop node instance %s. Encountered following "
@@ -84,13 +99,16 @@ class GCP:
raise RuntimeError()
# Start the node instance
def suspend_instances(self, zone, instance_id):
# Suspend the node instance
def suspend_instances(self, instance_id):
try:
self.client.instances().suspend(
project=self.project, zone=zone, instance=instance_id
).execute()
logging.info("vm name " + str(instance_id) + " suspended")
request = compute_v1.SuspendInstanceRequest(
instance=instance_id,
project=self.project_id,
zone=self.get_node_instance_zone(instance_id),
)
self.instance_client.suspend(request=request)
logging.info("Instance: " + str(instance_id) + " suspended")
except Exception as e:
logging.error(
"Failed to suspend node instance %s. Encountered following "
@@ -100,49 +118,65 @@ class GCP:
raise RuntimeError()
# Terminate the node instance
def terminate_instances(self, zone, instance_id):
def terminate_instances(self, instance_id):
try:
self.client.instances().delete(
project=self.project, zone=zone, instance=instance_id
).execute()
logging.info("vm name " + str(instance_id) + " terminated")
request = compute_v1.DeleteInstanceRequest(
instance=instance_id,
project=self.project_id,
zone=self.get_node_instance_zone(instance_id),
)
self.instance_client.delete(request=request)
logging.info("Instance: " + str(instance_id) + " terminated")
except Exception as e:
logging.error(
"Failed to start node instance %s. Encountered following "
"Failed to terminate node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
raise RuntimeError()
# Reboot the node instance
def reboot_instances(self, zone, instance_id):
def reboot_instances(self, instance_id):
try:
self.client.instances().reset(
project=self.project, zone=zone, instance=instance_id
).execute()
logging.info("vm name " + str(instance_id) + " rebooted")
request = compute_v1.ResetInstanceRequest(
instance=instance_id,
project=self.project_id,
zone=self.get_node_instance_zone(instance_id),
)
self.instance_client.reset(request=request)
logging.info("Instance: " + str(instance_id) + " rebooted")
except Exception as e:
logging.error(
"Failed to start node instance %s. Encountered following "
"Failed to reboot node instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
raise RuntimeError()
# Get instance status
def get_instance_status(self, zone, instance_id, expected_status, timeout):
# statuses: PROVISIONING, STAGING, RUNNING, STOPPING, SUSPENDING, SUSPENDED, REPAIRING,
def get_instance_status(self, instance_id, expected_status, timeout):
# states: PROVISIONING, STAGING, RUNNING, STOPPING, SUSPENDING, SUSPENDED, REPAIRING,
# and TERMINATED.
i = 0
sleeper = 5
while i <= timeout:
instStatus = (
self.client.instances()
.get(project=self.project, zone=zone, instance=instance_id)
.execute()
)
logging.info("Status of vm " + str(instStatus["status"]))
if instStatus["status"] == expected_status:
try:
request = compute_v1.GetInstanceRequest(
instance=instance_id,
project=self.project_id,
zone=self.get_node_instance_zone(instance_id),
)
instance_status = self.instance_client.get(request=request).status
logging.info("Status of instance " + str(instance_id) + ": " + instance_status)
except Exception as e:
logging.error(
"Failed to get status of instance %s. Encountered following "
"exception: %s." % (instance_id, e)
)
raise RuntimeError()
if instance_status == expected_status:
return True
time.sleep(sleeper)
i += sleeper
@@ -153,33 +187,21 @@ class GCP:
return False
# Wait until the node instance is suspended
def wait_until_suspended(self, zone, instance_id, timeout):
return self.get_instance_status(zone, instance_id, "SUSPENDED", timeout)
def wait_until_suspended(self, instance_id, timeout):
return self.get_instance_status(instance_id, "SUSPENDED", timeout)
# Wait until the node instance is running
def wait_until_running(self, zone, instance_id, timeout):
return self.get_instance_status(zone, instance_id, "RUNNING", timeout)
def wait_until_running(self, instance_id, timeout):
return self.get_instance_status(instance_id, "RUNNING", timeout)
# Wait until the node instance is stopped
def wait_until_stopped(self, zone, instance_id, timeout):
return self.get_instance_status(zone, instance_id, "TERMINATED", timeout)
def wait_until_stopped(self, instance_id, timeout):
# In GCP, the next state after STOPPING is TERMINATED
return self.get_instance_status(instance_id, "TERMINATED", timeout)
# Wait until the node instance is terminated
def wait_until_terminated(self, zone, instance_id, timeout):
try:
i = 0
sleeper = 5
while i <= timeout:
instStatus = (
self.client.instances()
.get(project=self.project, zone=zone, instance=instance_id)
.execute()
)
logging.info("Status of vm " + str(instStatus["status"]))
time.sleep(sleeper)
except Exception as e:
logging.info("here " + str(e))
return True
def wait_until_terminated(self, instance_id, timeout):
return self.get_instance_status(instance_id, "TERMINATED", timeout)
# krkn_lib
@@ -193,12 +215,13 @@ class gcp_node_scenarios(abstract_node_scenarios):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_start_scenario injection")
instance_id, zone = self.gcp.get_instance_id(node)
instance = self.gcp.get_node_instance(node)
instance_id = self.gcp.get_instance_name(instance)
logging.info(
"Starting the node %s with instance ID: %s " % (node, instance_id)
)
self.gcp.start_instances(zone, instance_id)
self.gcp.wait_until_running(zone, instance_id, timeout)
self.gcp.start_instances(instance_id)
self.gcp.wait_until_running(instance_id, timeout)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info(
"Node with instance ID: %s is in running state" % instance_id
@@ -215,16 +238,16 @@ class gcp_node_scenarios(abstract_node_scenarios):
# Node scenario to stop the node
def node_stop_scenario(self, instance_kill_count, node, timeout):
logging.info("stop scenario")
for _ in range(instance_kill_count):
try:
logging.info("Starting node_stop_scenario injection")
instance_id, zone = self.gcp.get_instance_id(node)
instance = self.gcp.get_node_instance(node)
instance_id = self.gcp.get_instance_name(instance)
logging.info(
"Stopping the node %s with instance ID: %s " % (node, instance_id)
)
self.gcp.stop_instances(zone, instance_id)
self.gcp.wait_until_stopped(zone, instance_id, timeout)
self.gcp.stop_instances(instance_id)
self.gcp.wait_until_stopped(instance_id, timeout)
logging.info(
"Node with instance ID: %s is in stopped state" % instance_id
)
@@ -243,13 +266,14 @@ class gcp_node_scenarios(abstract_node_scenarios):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_termination_scenario injection")
instance_id, zone = self.gcp.get_instance_id(node)
instance = self.gcp.get_node_instance(node)
instance_id = self.gcp.get_instance_name(instance)
logging.info(
"Terminating the node %s with instance ID: %s "
% (node, instance_id)
)
self.gcp.terminate_instances(zone, instance_id)
self.gcp.wait_until_terminated(zone, instance_id, timeout)
self.gcp.terminate_instances(instance_id)
self.gcp.wait_until_terminated(instance_id, timeout)
for _ in range(timeout):
if node not in self.kubecli.list_nodes():
break
@@ -267,19 +291,20 @@ class gcp_node_scenarios(abstract_node_scenarios):
)
logging.error("node_termination_scenario injection failed!")
raise e
raise RuntimeError()
# Node scenario to reboot the node
def node_reboot_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting node_reboot_scenario injection")
instance_id, zone = self.gcp.get_instance_id(node)
instance = self.gcp.get_node_instance(node)
instance_id = self.gcp.get_instance_name(instance)
logging.info(
"Rebooting the node %s with instance ID: %s " % (node, instance_id)
)
self.gcp.reboot_instances(zone, instance_id)
self.gcp.reboot_instances(instance_id)
self.gcp.wait_until_running(instance_id, timeout)
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
logging.info(
"Node with instance ID: %s has been rebooted" % instance_id

View File

@@ -1,5 +1,7 @@
import logging
import time
from multiprocessing.pool import ThreadPool
from itertools import repeat
import yaml
from krkn_lib.k8s import KrknKubernetes
@@ -64,23 +66,23 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
global node_general
node_general = True
return general_node_scenarios(kubecli)
if node_scenario["cloud_type"] == "aws":
if node_scenario["cloud_type"].lower() == "aws":
return aws_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "gcp":
elif node_scenario["cloud_type"].lower() == "gcp":
return gcp_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "openstack":
elif node_scenario["cloud_type"].lower() == "openstack":
from krkn.scenario_plugins.node_actions.openstack_node_scenarios import (
openstack_node_scenarios,
)
return openstack_node_scenarios(kubecli)
elif (
node_scenario["cloud_type"] == "azure"
node_scenario["cloud_type"].lower() == "azure"
or node_scenario["cloud_type"] == "az"
):
return azure_node_scenarios(kubecli)
elif (
node_scenario["cloud_type"] == "alibaba"
node_scenario["cloud_type"].lower() == "alibaba"
or node_scenario["cloud_type"] == "alicloud"
):
from krkn.scenario_plugins.node_actions.alibaba_node_scenarios import (
@@ -88,7 +90,7 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
)
return alibaba_node_scenarios(kubecli)
elif node_scenario["cloud_type"] == "bm":
elif node_scenario["cloud_type"].lower() == "bm":
from krkn.scenario_plugins.node_actions.bm_node_scenarios import (
bm_node_scenarios,
)
@@ -99,7 +101,7 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
node_scenario.get("bmc_password", None),
kubecli,
)
elif node_scenario["cloud_type"] == "docker":
elif node_scenario["cloud_type"].lower() == "docker":
return docker_node_scenarios(kubecli)
else:
logging.error(
@@ -120,100 +122,131 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
def inject_node_scenario(
self, action, node_scenario, node_scenario_object, kubecli: KrknKubernetes
):
generic_cloud_scenarios = ("stop_kubelet_scenario", "node_crash_scenario")
# Get the node scenario configurations
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
# Get the node scenario configurations for setting nodes
instance_kill_count = get_yaml_item_value(node_scenario, "instance_count", 1)
node_name = get_yaml_item_value(node_scenario, "node_name", "")
label_selector = get_yaml_item_value(node_scenario, "label_selector", "")
if action == "node_stop_start_scenario":
parallel_nodes = get_yaml_item_value(node_scenario, "parallel", False)
# Get the node to apply the scenario
if node_name:
node_name_list = node_name.split(",")
nodes = common_node_functions.get_node_by_name(node_name_list, kubecli)
else:
nodes = common_node_functions.get_node(
label_selector, instance_kill_count, kubecli
)
# GCP api doesn't support multiprocessing calls, will only actually run 1
if parallel_nodes and node_scenario['cloud_type'].lower() != "gcp":
self.multiprocess_nodes(nodes, node_scenario_object, action, node_scenario)
else:
for single_node in nodes:
self.run_node(single_node, node_scenario_object, action, node_scenario)
def multiprocess_nodes(self, nodes, node_scenario_object, action, node_scenario):
try:
logging.info("parallely call to nodes")
# pool object with number of element
pool = ThreadPool(processes=len(nodes))
pool.starmap(self.run_node,zip(nodes, repeat(node_scenario_object), repeat(action), repeat(node_scenario)))
pool.close()
except Exception as e:
logging.info("Error on pool multiprocessing: " + str(e))
def run_node(self, single_node, node_scenario_object, action, node_scenario):
logging.info("action" + str(action))
# Get the scenario specifics for running action nodes
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
if action in ("node_stop_start_scenario", "node_disk_detach_attach_scenario"):
duration = get_yaml_item_value(node_scenario, "duration", 120)
timeout = get_yaml_item_value(node_scenario, "timeout", 120)
service = get_yaml_item_value(node_scenario, "service", "")
ssh_private_key = get_yaml_item_value(
node_scenario, "ssh_private_key", "~/.ssh/id_rsa"
)
# Get the node to apply the scenario
if node_name:
node_name_list = node_name.split(",")
else:
node_name_list = [node_name]
for single_node_name in node_name_list:
nodes = common_node_functions.get_node(
single_node_name, label_selector, instance_kill_count, kubecli
generic_cloud_scenarios = ("stop_kubelet_scenario", "node_crash_scenario")
if node_general and action not in generic_cloud_scenarios:
logging.info(
"Scenario: "
+ action
+ " is not set up for generic cloud type, skipping action"
)
for single_node in nodes:
if node_general and action not in generic_cloud_scenarios:
logging.info(
"Scenario: "
+ action
+ " is not set up for generic cloud type, skipping action"
else:
if action == "node_start_scenario":
node_scenario_object.node_start_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_stop_scenario":
node_scenario_object.node_stop_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_stop_start_scenario":
node_scenario_object.node_stop_start_scenario(
run_kill_count, single_node, timeout, duration
)
elif action == "node_termination_scenario":
node_scenario_object.node_termination_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_reboot_scenario":
node_scenario_object.node_reboot_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_disk_detach_attach_scenario":
node_scenario_object.node_disk_detach_attach_scenario(
run_kill_count, single_node, timeout, duration)
elif action == "stop_start_kubelet_scenario":
node_scenario_object.stop_start_kubelet_scenario(
run_kill_count, single_node, timeout
)
elif action == "restart_kubelet_scenario":
node_scenario_object.restart_kubelet_scenario(
run_kill_count, single_node, timeout
)
elif action == "stop_kubelet_scenario":
node_scenario_object.stop_kubelet_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_crash_scenario":
node_scenario_object.node_crash_scenario(
run_kill_count, single_node, timeout
)
elif action == "stop_start_helper_node_scenario":
if node_scenario["cloud_type"] != "openstack":
logging.error(
"Scenario: " + action + " is not supported for "
"cloud type "
+ node_scenario["cloud_type"]
+ ", skipping action"
)
else:
if action == "node_start_scenario":
node_scenario_object.node_start_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_stop_scenario":
node_scenario_object.node_stop_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_stop_start_scenario":
node_scenario_object.node_stop_start_scenario(
run_kill_count, single_node, timeout, duration
)
elif action == "node_termination_scenario":
node_scenario_object.node_termination_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_reboot_scenario":
node_scenario_object.node_reboot_scenario(
run_kill_count, single_node, timeout
)
elif action == "stop_start_kubelet_scenario":
node_scenario_object.stop_start_kubelet_scenario(
run_kill_count, single_node, timeout
)
elif action == "restart_kubelet_scenario":
node_scenario_object.restart_kubelet_scenario(
run_kill_count, single_node, timeout
)
elif action == "stop_kubelet_scenario":
node_scenario_object.stop_kubelet_scenario(
run_kill_count, single_node, timeout
)
elif action == "node_crash_scenario":
node_scenario_object.node_crash_scenario(
run_kill_count, single_node, timeout
)
elif action == "stop_start_helper_node_scenario":
if node_scenario["cloud_type"] != "openstack":
logging.error(
"Scenario: " + action + " is not supported for "
"cloud type "
+ node_scenario["cloud_type"]
+ ", skipping action"
)
else:
if not node_scenario["helper_node_ip"]:
logging.error("Helper node IP address is not provided")
raise Exception(
"Helper node IP address is not provided"
)
node_scenario_object.helper_node_stop_start_scenario(
run_kill_count, node_scenario["helper_node_ip"], timeout
)
node_scenario_object.helper_node_service_status(
node_scenario["helper_node_ip"],
service,
ssh_private_key,
timeout,
)
else:
logging.info(
"There is no node action that matches %s, skipping scenario"
% action
if not node_scenario["helper_node_ip"]:
logging.error("Helper node IP address is not provided")
raise Exception(
"Helper node IP address is not provided"
)
node_scenario_object.helper_node_stop_start_scenario(
run_kill_count, node_scenario["helper_node_ip"], timeout
)
node_scenario_object.helper_node_service_status(
node_scenario["helper_node_ip"],
service,
ssh_private_key,
timeout,
)
else:
logging.info(
"There is no node action that matches %s, skipping scenario"
% action
)
def get_scenario_types(self) -> list[str]:
return ["node_scenarios"]

View File

@@ -29,6 +29,9 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
pvc_name = get_yaml_item_value(scenario_config, "pvc_name", "")
pod_name = get_yaml_item_value(scenario_config, "pod_name", "")
namespace = get_yaml_item_value(scenario_config, "namespace", "")
block_size = get_yaml_item_value(
scenario_config, "block_size", "102400"
)
target_fill_percentage = get_yaml_item_value(
scenario_config, "fill_percentage", "50"
)
@@ -176,10 +179,39 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
start_time = int(time.time())
# Create temp file in the PVC
full_path = "%s/%s" % (str(mount_path), str(file_name))
command = "fallocate -l $((%s*1024)) %s" % (
str(file_size_kb),
str(full_path),
fallocate = lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
["command -v fallocate"],
pod_name,
namespace,
container_name,
)
dd = lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
["command -v dd"],
pod_name,
namespace,
container_name,
)
if fallocate:
command = "fallocate -l $((%s*1024)) %s" % (
str(file_size_kb),
str(full_path),
)
elif dd is not None:
block_size = int(block_size)
blocks = int(file_size_kb / int(block_size / 1024))
logging.warning(
"fallocate not found, using dd, it may take longer based on the amount of data, please wait..."
)
command = f"dd if=/dev/urandom of={str(full_path)} bs={str(block_size)} count={str(blocks)} oflag=direct"
else:
logging.error(
"failed to locate required binaries fallocate or dd to execute the scenario"
)
return 1
logging.debug("Create temp file in the PVC command:\n %s" % command)
lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
[command],
@@ -214,45 +246,6 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
)
return 1
# Calculate file size
file_size_kb = int(
(float(target_fill_percentage / 100) * float(pvc_capacity_kb))
- float(pvc_used_kb)
)
logging.debug("File size: %s KB" % file_size_kb)
file_name = "kraken.tmp"
logging.info(
"Creating %s file, %s KB size, in pod %s at %s (ns %s)"
% (
str(file_name),
str(file_size_kb),
str(pod_name),
str(mount_path),
str(namespace),
)
)
start_time = int(time.time())
# Create temp file in the PVC
full_path = "%s/%s" % (str(mount_path), str(file_name))
command = "fallocate -l $((%s*1024)) %s" % (
str(file_size_kb),
str(full_path),
)
logging.debug("Create temp file in the PVC command:\n %s" % command)
lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
[command], pod_name, namespace, container_name
)
# Check if file is created
command = "ls -lh %s" % (str(mount_path))
logging.debug("Check file is created command:\n %s" % command)
response = lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
[command], pod_name, namespace, container_name
)
logging.info("\n" + str(response))
if str(file_name).lower() in str(response).lower():
logging.info(
"Waiting for the specified duration in the config: %ss" % duration
)

View File

@@ -13,6 +13,7 @@ from krkn.scenario_plugins.node_actions.aws_node_scenarios import AWS
from krkn.scenario_plugins.node_actions.az_node_scenarios import Azure
from krkn.scenario_plugins.node_actions.gcp_node_scenarios import GCP
from krkn.scenario_plugins.node_actions.openstack_node_scenarios import OPENSTACKCLOUD
from krkn.scenario_plugins.native.node_scenarios.ibmcloud_plugin import IbmCloud
class ShutDownScenarioPlugin(AbstractScenarioPlugin):
@@ -86,6 +87,8 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
cloud_object = OPENSTACKCLOUD()
elif cloud_type.lower() in ["azure", "az"]:
cloud_object = Azure()
elif cloud_type.lower() in ["ibm", "ibmcloud"]:
cloud_object = IbmCloud()
else:
logging.error(
"Cloud type %s is not currently supported for cluster shut down"

View File

@@ -29,6 +29,8 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
subnet_ids = scenario_config["subnet_id"]
duration = scenario_config["duration"]
cloud_type = scenario_config["cloud_type"]
# Add support for user-provided default network ACL
default_acl_id = scenario_config.get("default_acl_id")
ids = {}
acl_ids_created = []
@@ -58,7 +60,20 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
"Network association ids associated with "
"the subnet %s: %s" % (subnet_id, network_association_ids)
)
acl_id = cloud_object.create_default_network_acl(vpc_id)
# Use provided default ACL if available, otherwise create a new one
if default_acl_id:
acl_id = default_acl_id
logging.info(
"Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
default_acl_id
)
# Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
else:
acl_id = cloud_object.create_default_network_acl(vpc_id)
logging.info("Created new default ACL %s", acl_id)
acl_ids_created.append(acl_id)
new_association_id = cloud_object.replace_network_acl_association(
network_association_ids[0], acl_id
)
@@ -66,7 +81,6 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
# capture the orginal_acl_id, created_acl_id and
# new association_id to use during the recovery
ids[new_association_id] = original_acl_id
acl_ids_created.append(acl_id)
# wait for the specified duration
logging.info(

View File

@@ -11,15 +11,15 @@ coverage==7.4.1
datetime==5.4
docker==7.0.0
gitpython==3.1.41
google-api-python-client==2.116.0
google-auth==2.37.0
google-cloud-compute==1.22.0
ibm_cloud_sdk_core==3.18.0
ibm_vpc==0.20.0
jinja2==3.1.4
krkn-lib==4.0.3
krkn-lib==4.0.4
lxml==5.1.0
kubernetes==28.1.0
numpy==1.26.4
oauth2client==4.1.3
pandas==2.2.0
openshift-client==1.0.21
paramiko==3.4.0
@@ -32,7 +32,7 @@ requests==2.32.2
service_identity==24.1.0
PyYAML==6.0.1
setuptools==70.0.0
werkzeug==3.0.3
werkzeug==3.0.6
wheel==0.42.0
zope.interface==5.4.0

View File

@@ -627,7 +627,7 @@ if __name__ == "__main__":
junit_testcase_xml = get_junit_test_case(
success=True if retval == 0 else False,
time=int(junit_endtime - junit_start_time),
test_suite_name="krkn-test-suite",
test_suite_name="chaos-krkn",
test_case_description=options.junit_testcase,
test_stdout=tee_handler.get_output(),
test_version=options.junit_testcase_version,

View File

@@ -1,13 +1,14 @@
node_scenarios:
- actions: # node chaos scenarios to be injected
- actions: # node chaos scenarios to be injected
- node_stop_start_scenario
node_name: # node on which scenario has to be injected; can set multiple names separated by comma
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
instance_count: 1 # Number of nodes to perform action/select that match the label selector
runs: 1 # number of times to inject each scenario under actions (will perform on same node each time)
timeout: 360 # duration to wait for completion of node scenario injection
duration: 120 # duration to stop the node before running the start action
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
node_name: # node on which scenario has to be injected; can set multiple names separated by comma
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection; can specify multiple by a comma separated list
instance_count: 2 # Number of nodes to perform action/select that match the label selector
runs: 1 # number of times to inject each scenario under actions (will perform on same node each time)
timeout: 360 # duration to wait for completion of node scenario injection
duration: 20 # duration to stop the node before running the start action
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
parallel: true # Run action on label or node name in parallel or sequential, defaults to sequential
- actions:
- node_reboot_scenario
node_name:
@@ -15,3 +16,10 @@ node_scenarios:
instance_count: 1
timeout: 120
cloud_type: aws
- actions:
- node_disk_detach_attach_scenario
node_name:
label_selector:
instance_count: 1
timeout: 120
cloud_type: aws

View File

@@ -4,3 +4,4 @@ pvc_scenario:
namespace: <namespace_name> # Namespace where the PVC is
fill_percentage: 50 # Target percentage to fill up the cluster, value must be higher than current percentage, valid values are between 0 and 99
duration: 60 # Duration in seconds for the fault
block_size: 102400 # used only by dd if fallocate not present in the container

View File

@@ -3,3 +3,4 @@ zone_outage: # Scenario to create an out
duration: 600 # duration in seconds after which the zone will be back online
vpc_id: # cluster virtual private network to target
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic
default_acl_id: acl-xxxxxxxx # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.

View File

@@ -0,0 +1,304 @@
# OpenShift Shenanigans
## Workflow Description
Given a target OpenShift cluster, this workflow executes a
[kube-burner plugin](https://github.com/redhat-performance/arcaflow-plugin-kube-burner)
workflow to place a load on the cluster, repeatedly removes a targeted pod at a given time frequency with the [kill-pod plugin](https://github.com/krkn-chaos/arcaflow-plugin-kill-pod),
and runs a [stress-ng](https://github.com/ColinIanKing/stress-ng) CPU workload on the cluster.
Target your OpenShift cluster with the appropriate `kubeconfig` file, and add its file path as
the value for `kubernetes_target.kubeconfig_path`, in the input file. Any combination of subworkflows can be disabled in the input file by setting either `cpu_hog_enabled`, `pod_chaos_enabled`, or `kubeburner_enabled` to `false`.
## Files
- [`workflow.yaml`](workflow.yaml) -- Defines the workflow input schema, the plugins to run
and their data relationships, and the output to present to the user
- [`input.yaml`](input.yaml) -- The input parameters that the user provides for running
the workflow
- [`config.yaml`](config.yaml) -- Global config parameters that are passed to the Arcaflow
engine
- [`cpu-hog.yaml`](subworkflows/cpu-hog.yaml) -- The StressNG workload on the CPU.
- [`kubeburner.yaml`](subworkflows/kubeburner.yaml) -- The KubeBurner workload for the Kubernetes Cluster API.
- [`pod-chaos.yaml`](subworkflows/pod-chaos.yaml) -- The Kill Pod workflow for the Kubernetes infrastructure pods.
## Running the Workflow
### Workflow Dependencies
Install Python, at least `3.9`.
First, add the path to your Python interpreter to `config.yaml` as the value
for `pythonPath` as shown here. A common choice for users working in
distributions of Linux operating systems is `usr/bin/python`. Second, add a
directory to which your Arcaflow process will have write access as the
value for `workdir`, `/tmp` is a common choice because your process will likely be able to write to it.
```yaml
deployers:
python:
pythonPath: /usr/bin/python
workdir: /tmp
```
To use this Python interpreter with our `kill-pod` plugin, go to the `deploy` section of the `kill_pod` step in [`pod-chaos.yaml`](subworkflows/pod-chaos.yaml). You can use the same `pythonPath` and `workdir` that you used in
your `config.yaml`.
```yaml
deploy:
deployer_name: python
modulePullPolicy: Always
pythonPath: /usr/bin/python
workdir: /tmp
```
Download a Go binary of the latest version of the Arcaflow engine from: https://github.com/arcalot/arcaflow-engine/releases.
#### OpenShift Target
Target your desired OpenShift cluster by setting the `kubeconfig_path` variable for each subworkflow's parameter list in [`input.yaml`](input.yaml).
#### Kube-Burner Plugin
The `kube-burner` plugin generates and reports the UUID to which the
`kube-burner` data is associated in your search database. The `uuidgen`
workflow step uses the `arcaflow-plugin-utilities` `uuid` plugin step to
randomly generate a UUID for you.
### Workflow Execution
Run the workflow:
```
$ export WFPATH=<path to this workflow directory>
$ arcaflow --context ${WFPATH} --input input.yaml --config config.yaml --workflow workflow.yaml
```
## Workflow Diagram
This diagram shows the complete end-to-end workflow logic.
### Main Workflow
```mermaid
%% Mermaid markdown workflow
flowchart LR
%% Success path
input-->steps.cpu_hog_wf.enabling
input-->steps.cpu_hog_wf.execute
input-->steps.kubeburner_wf.enabling
input-->steps.kubeburner_wf.execute
input-->steps.pod_chaos_wf.enabling
input-->steps.pod_chaos_wf.execute
outputs.workflow_success.cpu_hog-->outputs.workflow_success
outputs.workflow_success.cpu_hog.disabled-->outputs.workflow_success.cpu_hog
outputs.workflow_success.cpu_hog.enabled-->outputs.workflow_success.cpu_hog
outputs.workflow_success.kubeburner-->outputs.workflow_success
outputs.workflow_success.kubeburner.disabled-->outputs.workflow_success.kubeburner
outputs.workflow_success.kubeburner.enabled-->outputs.workflow_success.kubeburner
outputs.workflow_success.pod_chaos-->outputs.workflow_success
outputs.workflow_success.pod_chaos.disabled-->outputs.workflow_success.pod_chaos
outputs.workflow_success.pod_chaos.enabled-->outputs.workflow_success.pod_chaos
steps.cpu_hog_wf.closed-->steps.cpu_hog_wf.closed.result
steps.cpu_hog_wf.disabled-->steps.cpu_hog_wf.disabled.output
steps.cpu_hog_wf.disabled.output-->outputs.workflow_success.cpu_hog.disabled
steps.cpu_hog_wf.enabling-->steps.cpu_hog_wf.closed
steps.cpu_hog_wf.enabling-->steps.cpu_hog_wf.disabled
steps.cpu_hog_wf.enabling-->steps.cpu_hog_wf.enabling.resolved
steps.cpu_hog_wf.enabling-->steps.cpu_hog_wf.execute
steps.cpu_hog_wf.execute-->steps.cpu_hog_wf.outputs
steps.cpu_hog_wf.outputs-->steps.cpu_hog_wf.outputs.success
steps.cpu_hog_wf.outputs.success-->outputs.workflow_success.cpu_hog.enabled
steps.kubeburner_wf.closed-->steps.kubeburner_wf.closed.result
steps.kubeburner_wf.disabled-->steps.kubeburner_wf.disabled.output
steps.kubeburner_wf.disabled.output-->outputs.workflow_success.kubeburner.disabled
steps.kubeburner_wf.enabling-->steps.kubeburner_wf.closed
steps.kubeburner_wf.enabling-->steps.kubeburner_wf.disabled
steps.kubeburner_wf.enabling-->steps.kubeburner_wf.enabling.resolved
steps.kubeburner_wf.enabling-->steps.kubeburner_wf.execute
steps.kubeburner_wf.execute-->steps.kubeburner_wf.outputs
steps.kubeburner_wf.outputs-->steps.kubeburner_wf.outputs.success
steps.kubeburner_wf.outputs.success-->outputs.workflow_success.kubeburner.enabled
steps.pod_chaos_wf.closed-->steps.pod_chaos_wf.closed.result
steps.pod_chaos_wf.disabled-->steps.pod_chaos_wf.disabled.output
steps.pod_chaos_wf.disabled.output-->outputs.workflow_success.pod_chaos.disabled
steps.pod_chaos_wf.enabling-->steps.pod_chaos_wf.closed
steps.pod_chaos_wf.enabling-->steps.pod_chaos_wf.disabled
steps.pod_chaos_wf.enabling-->steps.pod_chaos_wf.enabling.resolved
steps.pod_chaos_wf.enabling-->steps.pod_chaos_wf.execute
steps.pod_chaos_wf.execute-->steps.pod_chaos_wf.outputs
steps.pod_chaos_wf.outputs-->steps.pod_chaos_wf.outputs.success
steps.pod_chaos_wf.outputs.success-->outputs.workflow_success.pod_chaos.enabled
%% Error path
steps.cpu_hog_wf.execute-->steps.cpu_hog_wf.failed
steps.cpu_hog_wf.failed-->steps.cpu_hog_wf.failed.error
steps.kubeburner_wf.execute-->steps.kubeburner_wf.failed
steps.kubeburner_wf.failed-->steps.kubeburner_wf.failed.error
steps.pod_chaos_wf.execute-->steps.pod_chaos_wf.failed
steps.pod_chaos_wf.failed-->steps.pod_chaos_wf.failed.error
%% Mermaid end
```
### Pod Chaos Workflow
```mermaid
%% Mermaid markdown workflow
flowchart LR
%% Success path
input-->steps.kill_pod.starting
steps.kill_pod.cancelled-->steps.kill_pod.closed
steps.kill_pod.cancelled-->steps.kill_pod.outputs
steps.kill_pod.closed-->steps.kill_pod.closed.result
steps.kill_pod.deploy-->steps.kill_pod.closed
steps.kill_pod.deploy-->steps.kill_pod.starting
steps.kill_pod.disabled-->steps.kill_pod.disabled.output
steps.kill_pod.enabling-->steps.kill_pod.closed
steps.kill_pod.enabling-->steps.kill_pod.disabled
steps.kill_pod.enabling-->steps.kill_pod.enabling.resolved
steps.kill_pod.enabling-->steps.kill_pod.starting
steps.kill_pod.outputs-->steps.kill_pod.outputs.success
steps.kill_pod.outputs.success-->outputs.success
steps.kill_pod.running-->steps.kill_pod.closed
steps.kill_pod.running-->steps.kill_pod.outputs
steps.kill_pod.starting-->steps.kill_pod.closed
steps.kill_pod.starting-->steps.kill_pod.running
steps.kill_pod.starting-->steps.kill_pod.starting.started
%% Error path
steps.kill_pod.cancelled-->steps.kill_pod.crashed
steps.kill_pod.cancelled-->steps.kill_pod.deploy_failed
steps.kill_pod.crashed-->steps.kill_pod.crashed.error
steps.kill_pod.deploy-->steps.kill_pod.deploy_failed
steps.kill_pod.deploy_failed-->steps.kill_pod.deploy_failed.error
steps.kill_pod.enabling-->steps.kill_pod.crashed
steps.kill_pod.outputs-->steps.kill_pod.outputs.error
steps.kill_pod.running-->steps.kill_pod.crashed
steps.kill_pod.starting-->steps.kill_pod.crashed
%% Mermaid end
```
### StressNG (CPU Hog) Workflow
```mermaid
%% Mermaid markdown workflow
flowchart LR
%% Success path
input-->steps.kubeconfig.starting
input-->steps.stressng.deploy
input-->steps.stressng.starting
steps.kubeconfig.cancelled-->steps.kubeconfig.closed
steps.kubeconfig.cancelled-->steps.kubeconfig.outputs
steps.kubeconfig.closed-->steps.kubeconfig.closed.result
steps.kubeconfig.deploy-->steps.kubeconfig.closed
steps.kubeconfig.deploy-->steps.kubeconfig.starting
steps.kubeconfig.disabled-->steps.kubeconfig.disabled.output
steps.kubeconfig.enabling-->steps.kubeconfig.closed
steps.kubeconfig.enabling-->steps.kubeconfig.disabled
steps.kubeconfig.enabling-->steps.kubeconfig.enabling.resolved
steps.kubeconfig.enabling-->steps.kubeconfig.starting
steps.kubeconfig.outputs-->steps.kubeconfig.outputs.success
steps.kubeconfig.outputs.success-->steps.stressng.deploy
steps.kubeconfig.running-->steps.kubeconfig.closed
steps.kubeconfig.running-->steps.kubeconfig.outputs
steps.kubeconfig.starting-->steps.kubeconfig.closed
steps.kubeconfig.starting-->steps.kubeconfig.running
steps.kubeconfig.starting-->steps.kubeconfig.starting.started
steps.stressng.cancelled-->steps.stressng.closed
steps.stressng.cancelled-->steps.stressng.outputs
steps.stressng.closed-->steps.stressng.closed.result
steps.stressng.deploy-->steps.stressng.closed
steps.stressng.deploy-->steps.stressng.starting
steps.stressng.disabled-->steps.stressng.disabled.output
steps.stressng.enabling-->steps.stressng.closed
steps.stressng.enabling-->steps.stressng.disabled
steps.stressng.enabling-->steps.stressng.enabling.resolved
steps.stressng.enabling-->steps.stressng.starting
steps.stressng.outputs-->steps.stressng.outputs.success
steps.stressng.outputs.success-->outputs.success
steps.stressng.running-->steps.stressng.closed
steps.stressng.running-->steps.stressng.outputs
steps.stressng.starting-->steps.stressng.closed
steps.stressng.starting-->steps.stressng.running
steps.stressng.starting-->steps.stressng.starting.started
%% Error path
steps.kubeconfig.cancelled-->steps.kubeconfig.crashed
steps.kubeconfig.cancelled-->steps.kubeconfig.deploy_failed
steps.kubeconfig.crashed-->steps.kubeconfig.crashed.error
steps.kubeconfig.deploy-->steps.kubeconfig.deploy_failed
steps.kubeconfig.deploy_failed-->steps.kubeconfig.deploy_failed.error
steps.kubeconfig.enabling-->steps.kubeconfig.crashed
steps.kubeconfig.outputs-->steps.kubeconfig.outputs.error
steps.kubeconfig.running-->steps.kubeconfig.crashed
steps.kubeconfig.starting-->steps.kubeconfig.crashed
steps.stressng.cancelled-->steps.stressng.crashed
steps.stressng.cancelled-->steps.stressng.deploy_failed
steps.stressng.crashed-->steps.stressng.crashed.error
steps.stressng.deploy-->steps.stressng.deploy_failed
steps.stressng.deploy_failed-->steps.stressng.deploy_failed.error
steps.stressng.enabling-->steps.stressng.crashed
steps.stressng.outputs-->steps.stressng.outputs.error
steps.stressng.running-->steps.stressng.crashed
steps.stressng.starting-->steps.stressng.crashed
%% Mermaid end
```
### Kube-Burner Workflow
```mermaid
%% Mermaid markdown workflow
flowchart LR
%% Success path
input-->steps.kubeburner.starting
steps.kubeburner.cancelled-->steps.kubeburner.closed
steps.kubeburner.cancelled-->steps.kubeburner.outputs
steps.kubeburner.closed-->steps.kubeburner.closed.result
steps.kubeburner.deploy-->steps.kubeburner.closed
steps.kubeburner.deploy-->steps.kubeburner.starting
steps.kubeburner.disabled-->steps.kubeburner.disabled.output
steps.kubeburner.enabling-->steps.kubeburner.closed
steps.kubeburner.enabling-->steps.kubeburner.disabled
steps.kubeburner.enabling-->steps.kubeburner.enabling.resolved
steps.kubeburner.enabling-->steps.kubeburner.starting
steps.kubeburner.outputs-->steps.kubeburner.outputs.success
steps.kubeburner.outputs.success-->outputs.success
steps.kubeburner.running-->steps.kubeburner.closed
steps.kubeburner.running-->steps.kubeburner.outputs
steps.kubeburner.starting-->steps.kubeburner.closed
steps.kubeburner.starting-->steps.kubeburner.running
steps.kubeburner.starting-->steps.kubeburner.starting.started
steps.uuidgen.cancelled-->steps.uuidgen.closed
steps.uuidgen.cancelled-->steps.uuidgen.outputs
steps.uuidgen.closed-->steps.uuidgen.closed.result
steps.uuidgen.deploy-->steps.uuidgen.closed
steps.uuidgen.deploy-->steps.uuidgen.starting
steps.uuidgen.disabled-->steps.uuidgen.disabled.output
steps.uuidgen.enabling-->steps.uuidgen.closed
steps.uuidgen.enabling-->steps.uuidgen.disabled
steps.uuidgen.enabling-->steps.uuidgen.enabling.resolved
steps.uuidgen.enabling-->steps.uuidgen.starting
steps.uuidgen.outputs-->steps.uuidgen.outputs.success
steps.uuidgen.outputs.success-->steps.kubeburner.starting
steps.uuidgen.running-->steps.uuidgen.closed
steps.uuidgen.running-->steps.uuidgen.outputs
steps.uuidgen.starting-->steps.uuidgen.closed
steps.uuidgen.starting-->steps.uuidgen.running
steps.uuidgen.starting-->steps.uuidgen.starting.started
%% Error path
steps.kubeburner.cancelled-->steps.kubeburner.crashed
steps.kubeburner.cancelled-->steps.kubeburner.deploy_failed
steps.kubeburner.crashed-->steps.kubeburner.crashed.error
steps.kubeburner.deploy-->steps.kubeburner.deploy_failed
steps.kubeburner.deploy_failed-->steps.kubeburner.deploy_failed.error
steps.kubeburner.enabling-->steps.kubeburner.crashed
steps.kubeburner.outputs-->steps.kubeburner.outputs.error
steps.kubeburner.running-->steps.kubeburner.crashed
steps.kubeburner.starting-->steps.kubeburner.crashed
steps.uuidgen.cancelled-->steps.uuidgen.crashed
steps.uuidgen.cancelled-->steps.uuidgen.deploy_failed
steps.uuidgen.crashed-->steps.uuidgen.crashed.error
steps.uuidgen.deploy-->steps.uuidgen.deploy_failed
steps.uuidgen.deploy_failed-->steps.uuidgen.deploy_failed.error
steps.uuidgen.enabling-->steps.uuidgen.crashed
steps.uuidgen.outputs-->steps.uuidgen.outputs.error
steps.uuidgen.running-->steps.uuidgen.crashed
steps.uuidgen.starting-->steps.uuidgen.crashed
%% Mermaid end
```

View File

@@ -0,0 +1,18 @@
---
deployers:
image:
deployer_name: podman
deployment:
imagePullPolicy: IfNotPresent
python:
deployer_name: python
modulePullPolicy: Always
pythonPath: /usr/bin/python
workdir: /tmp
log:
level: debug
logged_outputs:
error:
level: debug
success:
level: debug

View File

@@ -0,0 +1,41 @@
kubernetes_target:
kubeconfig_path:
cpu_hog_enabled: true
pod_chaos_enabled: true
kubeburner_enabled: true
kubeburner_list:
- kubeburner:
kubeconfig: 'given later in workflow by kubeconfig plugin'
workload: 'cluster-density'
qps: 20
burst: 20
log_level: 'info'
timeout: '1m'
iterations: 1
churn: 'true'
churn_duration: 1s
churn_delay: 1s
churn_percent: 10
alerting: 'true'
gc: 'true'
pod_chaos_list:
- namespace_pattern: ^openshift-etcd$
label_selector: k8s-app=etcd
kill: 1
krkn_pod_recovery_time: 1
cpu_hog_list:
- namespace: default
# set the node selector as a key-value pair eg.
# node_selector:
# kubernetes.io/hostname: kind-worker2
node_selector: {}
stressng_params:
timeout: 1
stressors:
- stressor: cpu
workers: 1
cpu-load: 20
cpu-method: all

View File

@@ -0,0 +1,75 @@
version: v0.2.0
input:
root: CpuHog__KubernetesTarget
objects:
CpuHog__KubernetesTarget:
id: CpuHog__KubernetesTarget
properties:
constant:
type:
type_id: ref
id: KubernetesTarget
item:
type:
type_id: ref
id: CpuHog
KubernetesTarget:
id: KubernetesTarget
properties:
kubeconfig_path:
type:
type_id: string
CpuHog:
id: CpuHog
properties:
namespace:
display:
description: The namespace where the container will be deployed
name: Namespace
type:
type_id: string
required: true
node_selector:
display:
description: kubernetes node name where the plugin must be deployed
type:
type_id: map
values:
type_id: string
keys:
type_id: string
required: true
stressng_params:
type:
type_id: ref
id: StressNGParams
namespace: $.steps.stressng.starting.inputs.input
steps:
kubeconfig:
plugin:
src: quay.io/arcalot/arcaflow-plugin-kubeconfig:0.3.1
deployment_type: image
input:
kubeconfig: !expr 'readFile($.input.constant.kubeconfig_path)'
stressng:
plugin:
src: quay.io/arcalot/arcaflow-plugin-stressng:0.8.0
deployment_type: image
step: workload
input: !expr $.input.item.stressng_params
deploy:
deployer_name: kubernetes
connection: !expr $.steps.kubeconfig.outputs.success.connection
pod:
metadata:
namespace: !expr $.input.item.namespace
labels:
arcaflow: stressng
spec:
nodeSelector: !expr $.input.item.node_selector
pluginContainer:
imagePullPolicy: Always
outputs:
success: !expr $.steps.stressng.outputs.success

View File

@@ -0,0 +1,54 @@
version: v0.2.0
input:
root: KubeBurner__KubernetesTarget
objects:
KubeBurner__KubernetesTarget:
id: KubeBurner__KubernetesTarget
properties:
constant:
type:
type_id: ref
id: KubernetesTarget
item:
type:
type_id: ref
id: KubeBurner
KubernetesTarget:
id: KubernetesTarget
properties:
kubeconfig_path:
type:
type_id: string
KubeBurner:
id: KubeBurner
properties:
kubeburner:
type:
type_id: ref
id: KubeBurnerInputParams
namespace: $.steps.kubeburner.starting.inputs.input
steps:
uuidgen:
plugin:
deployment_type: image
src: quay.io/arcalot/arcaflow-plugin-utilities:0.6.0
step: uuid
input: {}
kubeburner:
plugin:
deployment_type: image
src: quay.io/redhat-performance/arcaflow-plugin-kube-burner:latest
step: kube-burner
input:
kubeconfig: !expr 'readFile($.input.constant.kubeconfig_path)'
uuid: !expr $.steps.uuidgen.outputs.success.uuid
workload: !expr $.input.item.kubeburner.workload
iterations: !expr $.input.item.kubeburner.iterations
churn: !expr $.input.item.kubeburner.churn
churn_duration: !expr $.input.item.kubeburner.churn_duration
churn_delay: !expr $.input.item.kubeburner.churn_delay
outputs:
success:
burner: !expr $.steps.kubeburner.outputs.success

View File

@@ -0,0 +1,108 @@
version: v0.2.0
input:
root: KillPodConfig__KubernetesTarget
objects:
KillPodConfig__KubernetesTarget:
id: KillPodConfig__KubernetesTarget
properties:
constant:
type:
type_id: ref
id: KubernetesTarget
item:
type:
type_id: ref
id: KillPodConfig
KubernetesTarget:
id: KubernetesTarget
properties:
kubeconfig_path:
type:
type_id: string
KillPodConfig:
id: KillPodConfig
properties:
backoff:
default: '1'
display:
description: How many seconds to wait between checks for the target
pod status.
name: Backoff
required: false
type:
type_id: integer
kill:
default: '1'
display:
description: How many pods should we attempt to kill?
name: Number of pods to kill
required: false
type:
min: 1
type_id: integer
krkn_pod_recovery_time:
default: '60'
display:
description: The Expected Recovery time fo the pod (used by Krkn to
monitor the pod lifecycle)
name: Recovery Time
required: false
type:
type_id: integer
label_selector:
display:
description: 'Kubernetes label selector for the target pods. Required
if name_pattern is not set.
See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
for details.'
name: Label selector
required: false
required_if_not:
- name_pattern
type:
type_id: string
name_pattern:
display:
description: Regular expression for target pods. Required if label_selector
is not set.
name: Name pattern
required: false
required_if_not:
- label_selector
type:
type_id: pattern
namespace_pattern:
display:
description: Regular expression for target pod namespaces.
name: Namespace pattern
required: true
type:
type_id: pattern
timeout:
default: '180'
display:
description: Timeout to wait for the target pod(s) to be removed in
seconds.
name: Timeout
required: false
type:
type_id: integer
steps:
kill_pod:
step: kill-pods
plugin:
deployment_type: python
src: arcaflow-plugin-kill-pod@git+https://github.com/krkn-chaos/arcaflow-plugin-kill-pod.git@a9f87f88d8e7763d111613bd8b2c7862fc49624f
input:
namespace_pattern: !expr $.input.item.namespace_pattern
label_selector: !expr $.input.item.label_selector
kubeconfig_path: !expr $.input.constant.kubeconfig_path
deploy:
deployer_name: python
modulePullPolicy: Always
pythonPath: /usr/bin/python
workdir: /tmp
outputs:
success: !expr $.steps.kill_pod.outputs.success

View File

@@ -0,0 +1,73 @@
version: v0.2.0
input:
root: RootObject
objects:
KubernetesTarget:
id: KubernetesTarget
properties:
kubeconfig_path:
type:
type_id: string
RootObject:
id: RootObject
properties:
cpu_hog_enabled:
type:
type_id: bool
pod_chaos_enabled:
type:
type_id: bool
kubeburner_enabled:
type:
type_id: bool
kubernetes_target:
type:
type_id: ref
id: KubernetesTarget
kubeburner_list:
type:
type_id: list
items:
type_id: ref
id: KubeBurner
namespace: $.steps.kubeburner_wf.execute.inputs.items
pod_chaos_list:
type:
type_id: list
items:
type_id: ref
id: KillPodConfig
namespace: $.steps.pod_chaos_wf.execute.inputs.items
cpu_hog_list:
type:
type_id: list
items:
type_id: ref
id: CpuHog
namespace: $.steps.cpu_hog_wf.execute.inputs.items
steps:
kubeburner_wf:
kind: foreach
items: !expr 'bindConstants($.input.kubeburner_list, $.input.kubernetes_target)'
workflow: subworkflows/kubeburner.yaml
parallelism: 1
enabled: !expr $.input.kubeburner_enabled
pod_chaos_wf:
kind: foreach
items: !expr 'bindConstants($.input.pod_chaos_list, $.input.kubernetes_target)'
workflow: subworkflows/pod-chaos.yaml
parallelism: 1
enabled: !expr $.input.pod_chaos_enabled
cpu_hog_wf:
kind: foreach
items: !expr 'bindConstants($.input.cpu_hog_list, $.input.kubernetes_target)'
workflow: subworkflows/cpu-hog.yaml
parallelism: 1
enabled: !expr $.input.cpu_hog_enabled
outputs:
workflow_success:
kubeburner: !ordisabled $.steps.kubeburner_wf.outputs.success
pod_chaos: !ordisabled $.steps.pod_chaos_wf.outputs.success
cpu_hog: !ordisabled $.steps.cpu_hog_wf.outputs.success