mirror of
https://github.com/krkn-chaos/krkn.git
synced 2026-02-20 21:10:42 +00:00
Compare commits
11 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
10ba53574e | ||
|
|
0ecba41082 | ||
|
|
491f59d152 | ||
|
|
2549c9a146 | ||
|
|
949f1f09e0 | ||
|
|
959766254d | ||
|
|
0e68dedb12 | ||
|
|
34a676a795 | ||
|
|
e5c5b35db3 | ||
|
|
93d2e60386 | ||
|
|
462c9ac67e |
11
ROADMAP.md
11
ROADMAP.md
@@ -6,10 +6,11 @@ Following are a list of enhancements that we are planning to work on adding supp
|
||||
- [x] [Centralized storage for chaos experiments artifacts](https://github.com/krkn-chaos/krkn/issues/423)
|
||||
- [ ] [Support for causing DNS outages](https://github.com/krkn-chaos/krkn/issues/394)
|
||||
- [x] [Chaos recommender](https://github.com/krkn-chaos/krkn/tree/main/utils/chaos-recommender) to suggest scenarios having probability of impacting the service under test using profiling results
|
||||
- [ ] Chaos AI integration to improve and automate test coverage
|
||||
- [] Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time
|
||||
- [x] [Support for pod level network traffic shaping](https://github.com/krkn-chaos/krkn/issues/393)
|
||||
- [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/krkn-chaos/krkn/issues/124)
|
||||
- [ ] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
|
||||
- [ ] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
|
||||
- [ ] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
|
||||
- [ ] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
|
||||
- [x] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
|
||||
- [x] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
|
||||
- [x] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
|
||||
- [x] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
|
||||
- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)
|
||||
|
||||
@@ -38,11 +38,11 @@ A couple of [alert profiles](https://github.com/redhat-chaos/krkn/tree/main/conf
|
||||
severity: critical
|
||||
```
|
||||
|
||||
Kube-burner supports setting the severity for the alerts with each one having different effects:
|
||||
Krkn supports setting the severity for the alerts with each one having different effects:
|
||||
|
||||
```
|
||||
info: Prints an info message with the alarm description to stdout. By default all expressions have this severity.
|
||||
warning: Prints a warning message with the alarm description to stdout.
|
||||
error: Prints a error message with the alarm description to stdout and makes kube-burner rc = 1
|
||||
error: Prints a error message with the alarm description to stdout and sets Krkn rc = 1
|
||||
critical: Prints a fatal message with the alarm description to stdout and exits execution inmediatly with rc != 0
|
||||
```
|
||||
|
||||
@@ -8,6 +8,7 @@ Current accepted cloud types:
|
||||
* [GCP](cloud_setup.md#gcp)
|
||||
* [AWS](cloud_setup.md#aws)
|
||||
* [Openstack](cloud_setup.md#openstack)
|
||||
* [IBMCloud](cloud_setup.md#ibmcloud)
|
||||
|
||||
|
||||
```
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
* [Scenarios](#scenarios)
|
||||
* [Test Environment Recommendations - how and where to run chaos tests](#test-environment-recommendations---how-and-where-to-run-chaos-tests)
|
||||
* [Chaos testing in Practice](#chaos-testing-in-practice)
|
||||
* [OpenShift oraganization](#openshift-organization)
|
||||
* [OpenShift organization](#openshift-organization)
|
||||
* [startx-lab](#startx-lab)
|
||||
|
||||
|
||||
|
||||
@@ -57,6 +57,8 @@ kind was primarily designed for testing Kubernetes itself, but may be used for l
|
||||
#### GCP
|
||||
Cloud setup instructions can be found [here](cloud_setup.md#gcp). Sample scenario config can be found [here](https://github.com/krkn-chaos/krkn/blob/main/scenarios/openshift/gcp_node_scenarios.yml).
|
||||
|
||||
NOTE: The parallel option is not available for GCP, the api doesn't perform processes at the same time
|
||||
|
||||
|
||||
#### Openstack
|
||||
|
||||
|
||||
@@ -13,10 +13,12 @@ zone_outage: # Scenario to create an out
|
||||
duration: 600 # Duration in seconds after which the zone will be back online.
|
||||
vpc_id: # Cluster virtual private network to target.
|
||||
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic.
|
||||
default_acl_id: acl-xxxxxxxx # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.
|
||||
```
|
||||
|
||||
**NOTE**: vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).
|
||||
**NOTE**: Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.
|
||||
**NOTE**: default_acl_id can be obtained from the AWS VPC Console by selecting "Network ACLs" from the left sidebar ( the ID will be in the format 'acl-xxxxxxxx' ). Make sure the selected ACL has the desired ingress/egress rules for your outage scenario ( i.e., deny all ).
|
||||
|
||||
##### Debugging steps in case of failures
|
||||
In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in `NotReady` condition. Here is how to fix it:
|
||||
|
||||
@@ -34,7 +34,16 @@ class IbmCloud:
|
||||
self.service.set_service_url(service_url)
|
||||
except Exception as e:
|
||||
logging.error("error authenticating" + str(e))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
# Get the instance ID of the node
|
||||
def get_instance_id(self, node_name):
|
||||
node_list = self.list_instances()
|
||||
for node in node_list:
|
||||
if node_name == node["vpc_name"]:
|
||||
return node["vpc_id"]
|
||||
logging.error("Couldn't find node with name " + str(node_name) + ", you could try another region")
|
||||
sys.exit(1)
|
||||
|
||||
def delete_instance(self, instance_id):
|
||||
"""
|
||||
|
||||
@@ -8,19 +8,28 @@ from krkn_lib.k8s import KrknKubernetes
|
||||
node_general = False
|
||||
|
||||
|
||||
def get_node_by_name(node_name_list, kubecli: KrknKubernetes):
|
||||
killable_nodes = kubecli.list_killable_nodes()
|
||||
for node_name in node_name_list:
|
||||
if node_name not in killable_nodes:
|
||||
logging.info(
|
||||
f"Node with provided ${node_name} does not exist or the node might "
|
||||
"be in NotReady state."
|
||||
)
|
||||
return
|
||||
return node_name_list
|
||||
|
||||
|
||||
# Pick a random node with specified label selector
|
||||
def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubernetes):
|
||||
if node_name in kubecli.list_killable_nodes():
|
||||
return [node_name]
|
||||
elif node_name:
|
||||
logging.info(
|
||||
"Node with provided node_name does not exist or the node might "
|
||||
"be in NotReady state."
|
||||
)
|
||||
nodes = kubecli.list_killable_nodes(label_selector)
|
||||
def get_node(label_selector, instance_kill_count, kubecli: KrknKubernetes):
|
||||
|
||||
label_selector_list = label_selector.split(",")
|
||||
nodes = []
|
||||
for label_selector in label_selector_list:
|
||||
nodes.extend(kubecli.list_killable_nodes(label_selector))
|
||||
if not nodes:
|
||||
raise Exception("Ready nodes with the provided label selector do not exist")
|
||||
logging.info("Ready nodes with the label selector %s: %s" % (label_selector, nodes))
|
||||
logging.info("Ready nodes with the label selector %s: %s" % (label_selector_list, nodes))
|
||||
number_of_nodes = len(nodes)
|
||||
if instance_kill_count == number_of_nodes:
|
||||
return nodes
|
||||
@@ -35,22 +44,19 @@ def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubern
|
||||
# krkn_lib
|
||||
# Wait until the node status becomes Ready
|
||||
def wait_for_ready_status(node, timeout, kubecli: KrknKubernetes):
|
||||
resource_version = kubecli.get_node_resource_version(node)
|
||||
kubecli.watch_node_status(node, "True", timeout, resource_version)
|
||||
kubecli.watch_node_status(node, "True", timeout)
|
||||
|
||||
|
||||
# krkn_lib
|
||||
# Wait until the node status becomes Not Ready
|
||||
def wait_for_not_ready_status(node, timeout, kubecli: KrknKubernetes):
|
||||
resource_version = kubecli.get_node_resource_version(node)
|
||||
kubecli.watch_node_status(node, "False", timeout, resource_version)
|
||||
kubecli.watch_node_status(node, "False", timeout)
|
||||
|
||||
|
||||
# krkn_lib
|
||||
# Wait until the node status becomes Unknown
|
||||
def wait_for_unknown_status(node, timeout, kubecli: KrknKubernetes):
|
||||
resource_version = kubecli.get_node_resource_version(node)
|
||||
kubecli.watch_node_status(node, "Unknown", timeout, resource_version)
|
||||
kubecli.watch_node_status(node, "Unknown", timeout)
|
||||
|
||||
|
||||
# Get the ip of the cluster node
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
import logging
|
||||
import time
|
||||
from multiprocessing.pool import ThreadPool
|
||||
from itertools import repeat
|
||||
|
||||
import yaml
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
@@ -64,23 +66,23 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
global node_general
|
||||
node_general = True
|
||||
return general_node_scenarios(kubecli)
|
||||
if node_scenario["cloud_type"] == "aws":
|
||||
if node_scenario["cloud_type"].lower() == "aws":
|
||||
return aws_node_scenarios(kubecli)
|
||||
elif node_scenario["cloud_type"] == "gcp":
|
||||
elif node_scenario["cloud_type"].lower() == "gcp":
|
||||
return gcp_node_scenarios(kubecli)
|
||||
elif node_scenario["cloud_type"] == "openstack":
|
||||
elif node_scenario["cloud_type"].lower() == "openstack":
|
||||
from krkn.scenario_plugins.node_actions.openstack_node_scenarios import (
|
||||
openstack_node_scenarios,
|
||||
)
|
||||
|
||||
return openstack_node_scenarios(kubecli)
|
||||
elif (
|
||||
node_scenario["cloud_type"] == "azure"
|
||||
node_scenario["cloud_type"].lower() == "azure"
|
||||
or node_scenario["cloud_type"] == "az"
|
||||
):
|
||||
return azure_node_scenarios(kubecli)
|
||||
elif (
|
||||
node_scenario["cloud_type"] == "alibaba"
|
||||
node_scenario["cloud_type"].lower() == "alibaba"
|
||||
or node_scenario["cloud_type"] == "alicloud"
|
||||
):
|
||||
from krkn.scenario_plugins.node_actions.alibaba_node_scenarios import (
|
||||
@@ -88,7 +90,7 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
)
|
||||
|
||||
return alibaba_node_scenarios(kubecli)
|
||||
elif node_scenario["cloud_type"] == "bm":
|
||||
elif node_scenario["cloud_type"].lower() == "bm":
|
||||
from krkn.scenario_plugins.node_actions.bm_node_scenarios import (
|
||||
bm_node_scenarios,
|
||||
)
|
||||
@@ -99,7 +101,7 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
node_scenario.get("bmc_password", None),
|
||||
kubecli,
|
||||
)
|
||||
elif node_scenario["cloud_type"] == "docker":
|
||||
elif node_scenario["cloud_type"].lower() == "docker":
|
||||
return docker_node_scenarios(kubecli)
|
||||
else:
|
||||
logging.error(
|
||||
@@ -120,100 +122,128 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
def inject_node_scenario(
|
||||
self, action, node_scenario, node_scenario_object, kubecli: KrknKubernetes
|
||||
):
|
||||
generic_cloud_scenarios = ("stop_kubelet_scenario", "node_crash_scenario")
|
||||
# Get the node scenario configurations
|
||||
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
|
||||
|
||||
# Get the node scenario configurations for setting nodes
|
||||
|
||||
instance_kill_count = get_yaml_item_value(node_scenario, "instance_count", 1)
|
||||
node_name = get_yaml_item_value(node_scenario, "node_name", "")
|
||||
label_selector = get_yaml_item_value(node_scenario, "label_selector", "")
|
||||
parallel_nodes = get_yaml_item_value(node_scenario, "parallel", False)
|
||||
|
||||
# Get the node to apply the scenario
|
||||
if node_name:
|
||||
node_name_list = node_name.split(",")
|
||||
nodes = common_node_functions.get_node_by_name(node_name_list, kubecli)
|
||||
else:
|
||||
nodes = common_node_functions.get_node(
|
||||
label_selector, instance_kill_count, kubecli
|
||||
)
|
||||
|
||||
# GCP api doesn't support multiprocessing calls, will only actually run 1
|
||||
if parallel_nodes and node_scenario['cloud_type'].lower() != "gcp":
|
||||
self.multiprocess_nodes(nodes, node_scenario_object, action, node_scenario)
|
||||
else:
|
||||
for single_node in nodes:
|
||||
self.run_node(single_node, node_scenario_object, action, node_scenario)
|
||||
|
||||
def multiprocess_nodes(self, nodes, node_scenario_object, action, node_scenario):
|
||||
try:
|
||||
logging.info("parallely call to nodes")
|
||||
# pool object with number of element
|
||||
pool = ThreadPool(processes=len(nodes))
|
||||
|
||||
pool.starmap(self.run_node,zip(nodes, repeat(node_scenario_object), repeat(action), repeat(node_scenario)))
|
||||
|
||||
pool.close()
|
||||
except Exception as e:
|
||||
logging.info("Error on pool multiprocessing: " + str(e))
|
||||
|
||||
|
||||
def run_node(self, single_node, node_scenario_object, action, node_scenario):
|
||||
logging.info("action" + str(action))
|
||||
# Get the scenario specifics for running action nodes
|
||||
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
|
||||
if action == "node_stop_start_scenario":
|
||||
duration = get_yaml_item_value(node_scenario, "duration", 120)
|
||||
|
||||
timeout = get_yaml_item_value(node_scenario, "timeout", 120)
|
||||
service = get_yaml_item_value(node_scenario, "service", "")
|
||||
ssh_private_key = get_yaml_item_value(
|
||||
node_scenario, "ssh_private_key", "~/.ssh/id_rsa"
|
||||
)
|
||||
# Get the node to apply the scenario
|
||||
if node_name:
|
||||
node_name_list = node_name.split(",")
|
||||
else:
|
||||
node_name_list = [node_name]
|
||||
for single_node_name in node_name_list:
|
||||
nodes = common_node_functions.get_node(
|
||||
single_node_name, label_selector, instance_kill_count, kubecli
|
||||
generic_cloud_scenarios = ("stop_kubelet_scenario", "node_crash_scenario")
|
||||
|
||||
if node_general and action not in generic_cloud_scenarios:
|
||||
logging.info(
|
||||
"Scenario: "
|
||||
+ action
|
||||
+ " is not set up for generic cloud type, skipping action"
|
||||
)
|
||||
for single_node in nodes:
|
||||
if node_general and action not in generic_cloud_scenarios:
|
||||
logging.info(
|
||||
"Scenario: "
|
||||
+ action
|
||||
+ " is not set up for generic cloud type, skipping action"
|
||||
else:
|
||||
if action == "node_start_scenario":
|
||||
node_scenario_object.node_start_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_stop_scenario":
|
||||
node_scenario_object.node_stop_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_stop_start_scenario":
|
||||
node_scenario_object.node_stop_start_scenario(
|
||||
run_kill_count, single_node, timeout, duration
|
||||
)
|
||||
elif action == "node_termination_scenario":
|
||||
node_scenario_object.node_termination_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_reboot_scenario":
|
||||
node_scenario_object.node_reboot_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "stop_start_kubelet_scenario":
|
||||
node_scenario_object.stop_start_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "restart_kubelet_scenario":
|
||||
node_scenario_object.restart_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "stop_kubelet_scenario":
|
||||
node_scenario_object.stop_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_crash_scenario":
|
||||
node_scenario_object.node_crash_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "stop_start_helper_node_scenario":
|
||||
if node_scenario["cloud_type"] != "openstack":
|
||||
logging.error(
|
||||
"Scenario: " + action + " is not supported for "
|
||||
"cloud type "
|
||||
+ node_scenario["cloud_type"]
|
||||
+ ", skipping action"
|
||||
)
|
||||
else:
|
||||
if action == "node_start_scenario":
|
||||
node_scenario_object.node_start_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_stop_scenario":
|
||||
node_scenario_object.node_stop_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_stop_start_scenario":
|
||||
node_scenario_object.node_stop_start_scenario(
|
||||
run_kill_count, single_node, timeout, duration
|
||||
)
|
||||
elif action == "node_termination_scenario":
|
||||
node_scenario_object.node_termination_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_reboot_scenario":
|
||||
node_scenario_object.node_reboot_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "stop_start_kubelet_scenario":
|
||||
node_scenario_object.stop_start_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "restart_kubelet_scenario":
|
||||
node_scenario_object.restart_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "stop_kubelet_scenario":
|
||||
node_scenario_object.stop_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_crash_scenario":
|
||||
node_scenario_object.node_crash_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "stop_start_helper_node_scenario":
|
||||
if node_scenario["cloud_type"] != "openstack":
|
||||
logging.error(
|
||||
"Scenario: " + action + " is not supported for "
|
||||
"cloud type "
|
||||
+ node_scenario["cloud_type"]
|
||||
+ ", skipping action"
|
||||
)
|
||||
else:
|
||||
if not node_scenario["helper_node_ip"]:
|
||||
logging.error("Helper node IP address is not provided")
|
||||
raise Exception(
|
||||
"Helper node IP address is not provided"
|
||||
)
|
||||
node_scenario_object.helper_node_stop_start_scenario(
|
||||
run_kill_count, node_scenario["helper_node_ip"], timeout
|
||||
)
|
||||
node_scenario_object.helper_node_service_status(
|
||||
node_scenario["helper_node_ip"],
|
||||
service,
|
||||
ssh_private_key,
|
||||
timeout,
|
||||
)
|
||||
else:
|
||||
logging.info(
|
||||
"There is no node action that matches %s, skipping scenario"
|
||||
% action
|
||||
if not node_scenario["helper_node_ip"]:
|
||||
logging.error("Helper node IP address is not provided")
|
||||
raise Exception(
|
||||
"Helper node IP address is not provided"
|
||||
)
|
||||
node_scenario_object.helper_node_stop_start_scenario(
|
||||
run_kill_count, node_scenario["helper_node_ip"], timeout
|
||||
)
|
||||
node_scenario_object.helper_node_service_status(
|
||||
node_scenario["helper_node_ip"],
|
||||
service,
|
||||
ssh_private_key,
|
||||
timeout,
|
||||
)
|
||||
else:
|
||||
logging.info(
|
||||
"There is no node action that matches %s, skipping scenario"
|
||||
% action
|
||||
)
|
||||
|
||||
def get_scenario_types(self) -> list[str]:
|
||||
return ["node_scenarios"]
|
||||
|
||||
@@ -29,6 +29,9 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
|
||||
pvc_name = get_yaml_item_value(scenario_config, "pvc_name", "")
|
||||
pod_name = get_yaml_item_value(scenario_config, "pod_name", "")
|
||||
namespace = get_yaml_item_value(scenario_config, "namespace", "")
|
||||
block_size = get_yaml_item_value(
|
||||
scenario_config, "block_size", "102400"
|
||||
)
|
||||
target_fill_percentage = get_yaml_item_value(
|
||||
scenario_config, "fill_percentage", "50"
|
||||
)
|
||||
@@ -197,10 +200,12 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
|
||||
str(full_path),
|
||||
)
|
||||
elif dd is not None:
|
||||
block_size = int(block_size)
|
||||
blocks = int(file_size_kb / int(block_size / 1024))
|
||||
logging.warning(
|
||||
"fallocate not found, using dd, it may take longer based on the amount of data, please wait..."
|
||||
)
|
||||
command = f"dd if=/dev/urandom of={str(full_path)} bs=1024 count={str(file_size_kb)} oflag=direct"
|
||||
command = f"dd if=/dev/urandom of={str(full_path)} bs={str(block_size)} count={str(blocks)} oflag=direct"
|
||||
else:
|
||||
logging.error(
|
||||
"failed to locate required binaries fallocate or dd to execute the scenario"
|
||||
@@ -241,45 +246,6 @@ class PvcScenarioPlugin(AbstractScenarioPlugin):
|
||||
)
|
||||
return 1
|
||||
|
||||
# Calculate file size
|
||||
file_size_kb = int(
|
||||
(float(target_fill_percentage / 100) * float(pvc_capacity_kb))
|
||||
- float(pvc_used_kb)
|
||||
)
|
||||
logging.debug("File size: %s KB" % file_size_kb)
|
||||
|
||||
file_name = "kraken.tmp"
|
||||
logging.info(
|
||||
"Creating %s file, %s KB size, in pod %s at %s (ns %s)"
|
||||
% (
|
||||
str(file_name),
|
||||
str(file_size_kb),
|
||||
str(pod_name),
|
||||
str(mount_path),
|
||||
str(namespace),
|
||||
)
|
||||
)
|
||||
|
||||
start_time = int(time.time())
|
||||
# Create temp file in the PVC
|
||||
full_path = "%s/%s" % (str(mount_path), str(file_name))
|
||||
command = "fallocate -l $((%s*1024)) %s" % (
|
||||
str(file_size_kb),
|
||||
str(full_path),
|
||||
)
|
||||
logging.debug("Create temp file in the PVC command:\n %s" % command)
|
||||
lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
|
||||
[command], pod_name, namespace, container_name
|
||||
)
|
||||
|
||||
# Check if file is created
|
||||
command = "ls -lh %s" % (str(mount_path))
|
||||
logging.debug("Check file is created command:\n %s" % command)
|
||||
response = lib_telemetry.get_lib_kubernetes().exec_cmd_in_pod(
|
||||
[command], pod_name, namespace, container_name
|
||||
)
|
||||
logging.info("\n" + str(response))
|
||||
if str(file_name).lower() in str(response).lower():
|
||||
logging.info(
|
||||
"Waiting for the specified duration in the config: %ss" % duration
|
||||
)
|
||||
|
||||
@@ -13,6 +13,7 @@ from krkn.scenario_plugins.node_actions.aws_node_scenarios import AWS
|
||||
from krkn.scenario_plugins.node_actions.az_node_scenarios import Azure
|
||||
from krkn.scenario_plugins.node_actions.gcp_node_scenarios import GCP
|
||||
from krkn.scenario_plugins.node_actions.openstack_node_scenarios import OPENSTACKCLOUD
|
||||
from krkn.scenario_plugins.native.node_scenarios.ibmcloud_plugin import IbmCloud
|
||||
|
||||
|
||||
class ShutDownScenarioPlugin(AbstractScenarioPlugin):
|
||||
@@ -86,6 +87,8 @@ class ShutDownScenarioPlugin(AbstractScenarioPlugin):
|
||||
cloud_object = OPENSTACKCLOUD()
|
||||
elif cloud_type.lower() in ["azure", "az"]:
|
||||
cloud_object = Azure()
|
||||
elif cloud_type.lower() in ["ibm", "ibmcloud"]:
|
||||
cloud_object = IbmCloud()
|
||||
else:
|
||||
logging.error(
|
||||
"Cloud type %s is not currently supported for cluster shut down"
|
||||
|
||||
@@ -29,6 +29,8 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
subnet_ids = scenario_config["subnet_id"]
|
||||
duration = scenario_config["duration"]
|
||||
cloud_type = scenario_config["cloud_type"]
|
||||
# Add support for user-provided default network ACL
|
||||
default_acl_id = scenario_config.get("default_acl_id")
|
||||
ids = {}
|
||||
acl_ids_created = []
|
||||
|
||||
@@ -58,7 +60,20 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
"Network association ids associated with "
|
||||
"the subnet %s: %s" % (subnet_id, network_association_ids)
|
||||
)
|
||||
acl_id = cloud_object.create_default_network_acl(vpc_id)
|
||||
|
||||
# Use provided default ACL if available, otherwise create a new one
|
||||
if default_acl_id:
|
||||
acl_id = default_acl_id
|
||||
logging.info(
|
||||
"Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
|
||||
default_acl_id
|
||||
)
|
||||
# Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
|
||||
else:
|
||||
acl_id = cloud_object.create_default_network_acl(vpc_id)
|
||||
logging.info("Created new default ACL %s", acl_id)
|
||||
acl_ids_created.append(acl_id)
|
||||
|
||||
new_association_id = cloud_object.replace_network_acl_association(
|
||||
network_association_ids[0], acl_id
|
||||
)
|
||||
@@ -66,7 +81,6 @@ class ZoneOutageScenarioPlugin(AbstractScenarioPlugin):
|
||||
# capture the orginal_acl_id, created_acl_id and
|
||||
# new association_id to use during the recovery
|
||||
ids[new_association_id] = original_acl_id
|
||||
acl_ids_created.append(acl_id)
|
||||
|
||||
# wait for the specified duration
|
||||
logging.info(
|
||||
|
||||
@@ -15,7 +15,7 @@ google-api-python-client==2.116.0
|
||||
ibm_cloud_sdk_core==3.18.0
|
||||
ibm_vpc==0.20.0
|
||||
jinja2==3.1.4
|
||||
krkn-lib==4.0.3
|
||||
krkn-lib==4.0.4
|
||||
lxml==5.1.0
|
||||
kubernetes==28.1.0
|
||||
numpy==1.26.4
|
||||
@@ -32,7 +32,7 @@ requests==2.32.2
|
||||
service_identity==24.1.0
|
||||
PyYAML==6.0.1
|
||||
setuptools==70.0.0
|
||||
werkzeug==3.0.3
|
||||
werkzeug==3.0.6
|
||||
wheel==0.42.0
|
||||
zope.interface==5.4.0
|
||||
|
||||
|
||||
@@ -627,7 +627,7 @@ if __name__ == "__main__":
|
||||
junit_testcase_xml = get_junit_test_case(
|
||||
success=True if retval == 0 else False,
|
||||
time=int(junit_endtime - junit_start_time),
|
||||
test_suite_name="krkn-test-suite",
|
||||
test_suite_name="chaos-krkn",
|
||||
test_case_description=options.junit_testcase,
|
||||
test_stdout=tee_handler.get_output(),
|
||||
test_version=options.junit_testcase_version,
|
||||
|
||||
@@ -1,13 +1,14 @@
|
||||
node_scenarios:
|
||||
- actions: # node chaos scenarios to be injected
|
||||
- actions: # node chaos scenarios to be injected
|
||||
- node_stop_start_scenario
|
||||
node_name: # node on which scenario has to be injected; can set multiple names separated by comma
|
||||
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
|
||||
instance_count: 1 # Number of nodes to perform action/select that match the label selector
|
||||
runs: 1 # number of times to inject each scenario under actions (will perform on same node each time)
|
||||
timeout: 360 # duration to wait for completion of node scenario injection
|
||||
duration: 120 # duration to stop the node before running the start action
|
||||
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
|
||||
node_name: # node on which scenario has to be injected; can set multiple names separated by comma
|
||||
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection; can specify multiple by a comma separated list
|
||||
instance_count: 2 # Number of nodes to perform action/select that match the label selector
|
||||
runs: 1 # number of times to inject each scenario under actions (will perform on same node each time)
|
||||
timeout: 360 # duration to wait for completion of node scenario injection
|
||||
duration: 20 # duration to stop the node before running the start action
|
||||
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
|
||||
parallel: true # Run action on label or node name in parallel or sequential, defaults to sequential
|
||||
- actions:
|
||||
- node_reboot_scenario
|
||||
node_name:
|
||||
|
||||
@@ -4,3 +4,4 @@ pvc_scenario:
|
||||
namespace: <namespace_name> # Namespace where the PVC is
|
||||
fill_percentage: 50 # Target percentage to fill up the cluster, value must be higher than current percentage, valid values are between 0 and 99
|
||||
duration: 60 # Duration in seconds for the fault
|
||||
block_size: 102400 # used only by dd if fallocate not present in the container
|
||||
|
||||
@@ -3,3 +3,4 @@ zone_outage: # Scenario to create an out
|
||||
duration: 600 # duration in seconds after which the zone will be back online
|
||||
vpc_id: # cluster virtual private network to target
|
||||
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic
|
||||
default_acl_id: acl-xxxxxxxx # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.
|
||||
|
||||
Reference in New Issue
Block a user