OCM/ACM integration (#370)

* OCM support for ManagedClusters

* Updated docs and general adjustments

* Improved docs

* Improved docs2

* Removed io packet import

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Removed time from imports

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Removed duplicate logging import

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Removed sys import

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

* Update run.py

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>
This commit is contained in:
José Castillo Lema
2023-01-10 14:58:17 +01:00
committed by GitHub
parent bed40b0c6a
commit d76ab31155
10 changed files with 446 additions and 0 deletions

View File

@@ -69,6 +69,7 @@ Scenario type | Kubernetes | OpenShift
[Application_outages](docs/application_outages.md) | :heavy_check_mark: | :heavy_check_mark: |
[PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: | :heavy_check_mark: |
[Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: | :heavy_check_mark: |
[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: | :question: |
### Kraken scenario pass/fail criteria and report
@@ -96,6 +97,9 @@ Kraken supports capturing metrics for the duration of the scenarios defined in t
### Alerts
In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. Information on enabling and leveraging this feature can be found [here](docs/alerts.md).
### OCM / ACM integration
Kraken supports injecting faults into [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management) managed clusters through [ManagedCluster Scenarios](docs/managedcluster_scenarios.md).
### Blogs and other useful resources
- Blog post on introduction to Kraken: https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes

View File

@@ -0,0 +1,36 @@
### ManagedCluster Scenarios
[ManagedCluster](https://open-cluster-management.io/concepts/managedcluster/) scenarios provide a way to integrate kraken with [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management).
ManagedCluster scenarios leverage [ManifestWorks](https://open-cluster-management.io/concepts/manifestwork/) to inject faults into the ManagedClusters.
The following ManagedCluster chaos scenarios are supported:
1. **managedcluster_start_scenario**: Scenario to start the ManagedCluster instance.
2. **managedcluster_stop_scenario**: Scenario to stop the ManagedCluster instance.
3. **managedcluster_stop_start_scenario**: Scenario to stop and then start the ManagedCluster instance.
4. **start_klusterlet_scenario**: Scenario to start the klusterlet of the ManagedCluster instance.
5. **stop_klusterlet_scenario**: Scenario to stop the klusterlet of the ManagedCluster instance.
6. **stop_start_klusterlet_scenario**: Scenario to stop and start the klusterlet of the ManagedCluster instance.
ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under `managedcluster_scenarios` option in the Kraken config. Refer to [managedcluster_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/kube/managedcluster_scenarios_example.yml) config file.
```
managedcluster_scenarios:
- actions: # ManagedCluster chaos scenarios to be injected
- managedcluster_stop_start_scenario
managedcluster_name: cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
instance_count: 1 # Number of managedcluster to perform action/select that match the label selector
runs: 1 # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
timeout: 420 # Duration to wait for completion of ManagedCluster scenario injection
# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
# (default leaseDurationSeconds = 60 sec)
- actions:
- stop_start_klusterlet_scenario
managedcluster_name: cluster1
# label_selector:
instance_count: 1
runs: 1
timeout: 60
```

View File

@@ -175,6 +175,26 @@ def list_killable_nodes(label_selector=None):
return nodes
# List managedclusters attached to the hub that can be killed
def list_killable_managedclusters(label_selector=None):
managedclusters = []
try:
ret = custom_object_client.list_cluster_custom_object(
group="cluster.open-cluster-management.io",
version="v1",
plural="managedclusters",
label_selector=label_selector
)
except ApiException as e:
logging.error("Exception when calling CustomObjectsApi->list_cluster_custom_object: %s\n" % e)
raise e
for managedcluster in ret['items']:
conditions = managedcluster['status']['conditions']
available = list(filter(lambda condition: condition['reason'] == 'ManagedClusterAvailable', conditions))
if available and available[0]['status'] == 'True':
managedclusters.append(managedcluster['metadata']['name'])
return managedclusters
# List pods in the given namespace
def list_pods(namespace, label_selector=None):
pods = []
@@ -362,6 +382,33 @@ def create_job(body, namespace="default"):
raise
def create_manifestwork(body, namespace):
try:
api_response = custom_object_client.create_namespaced_custom_object(
group="work.open-cluster-management.io",
version="v1",
plural="manifestworks",
body=body,
namespace=namespace
)
return api_response
except ApiException as e:
print("Exception when calling CustomObjectsApi->create_namespaced_custom_object: %s\n" % e)
def delete_manifestwork(namespace):
try:
api_response = custom_object_client.delete_namespaced_custom_object(
group="work.open-cluster-management.io",
version="v1",
plural="manifestworks",
name="managedcluster-scenarios-template",
namespace=namespace
)
return api_response
except ApiException as e:
print("Exception when calling CustomObjectsApi->delete_namespaced_custom_object: %s\n" % e)
def get_job_status(name, namespace="default"):
try:
return batch_cli.read_namespaced_job_status(
@@ -814,6 +861,30 @@ def watch_node_status(node, status, timeout, resource_version):
watch_resource.stop()
# Watch for a specific managedcluster status
# TODO: Implement this with a watcher instead of polling
def watch_managedcluster_status(managedcluster, status, timeout):
elapsed_time = 0
while True:
conditions = custom_object_client.get_cluster_custom_object_status(
"cluster.open-cluster-management.io", "v1", "managedclusters", managedcluster
)['status']['conditions']
available = list(filter(lambda condition: condition['reason'] == 'ManagedClusterAvailable', conditions))
if status == "True":
if available and available[0]['status'] == "True":
logging.info("Status of managedcluster " + managedcluster + ": Available")
return True
else:
if not available:
logging.info("Status of managedcluster " + managedcluster + ": Unavailable")
return True
time.sleep(2)
elapsed_time += 2
if elapsed_time >= timeout:
logging.info("Timeout waiting for managedcluster " + managedcluster + " to become: " + status)
return False
# Get the resource version for the specified node
def get_node_resource_version(node):
return cli.read_node(name=node).metadata.resource_version

View File

@@ -0,0 +1,34 @@
import random
import logging
import kraken.kubernetes.client as kubecli
# Pick a random managedcluster with specified label selector
def get_managedcluster(managedcluster_name, label_selector, instance_kill_count):
if managedcluster_name in kubecli.list_killable_managedclusters():
return [managedcluster_name]
elif managedcluster_name:
logging.info("managedcluster with provided managedcluster_name does not exist or the managedcluster might " "be in unavailable state.")
managedclusters = kubecli.list_killable_managedclusters(label_selector)
if not managedclusters:
raise Exception("Available managedclusters with the provided label selector do not exist")
logging.info("Available managedclusters with the label selector %s: %s" % (label_selector, managedclusters))
number_of_managedclusters = len(managedclusters)
if instance_kill_count == number_of_managedclusters:
return managedclusters
managedclusters_to_return = []
for i in range(instance_kill_count):
managedcluster_to_add = managedclusters[random.randint(0, len(managedclusters) - 1)]
managedclusters_to_return.append(managedcluster_to_add)
managedclusters.remove(managedcluster_to_add)
return managedclusters_to_return
# Wait until the managedcluster status becomes Available
def wait_for_available_status(managedcluster, timeout):
kubecli.watch_managedcluster_status(managedcluster, "True", timeout)
# Wait until the managedcluster status becomes Not Available
def wait_for_unavailable_status(managedcluster, timeout):
kubecli.watch_managedcluster_status(managedcluster, "Unknown", timeout)

View File

@@ -0,0 +1,140 @@
from jinja2 import Environment, FileSystemLoader
import os
import time
import logging
import sys
import yaml
import html
import kraken.kubernetes.client as kubecli
import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
class GENERAL:
def __init__(self):
pass
class managedcluster_scenarios():
def __init__(self):
self.general = GENERAL()
# managedcluster scenario to start the managedcluster
def managedcluster_start_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting managedcluster_start_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3 &
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 1 -n open-cluster-management-agent""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("managedcluster_start_scenario has been successfully injected!")
logging.info("Waiting for the specified timeout: %s" % timeout)
common_managedcluster_functions.wait_for_available_status(managedcluster, timeout)
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop the managedcluster
def managedcluster_stop_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting managedcluster_stop_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)),encoding='utf-8')
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0 &&
kubectl scale deployment.apps/klusterlet-registration-agent --replicas 0 -n open-cluster-management-agent""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("managedcluster_stop_scenario has been successfully injected!")
logging.info("Waiting for the specified timeout: %s" % timeout)
common_managedcluster_functions.wait_for_unavailable_status(managedcluster, timeout)
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop and then start the managedcluster
def managedcluster_stop_start_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("Starting managedcluster_stop_start_scenario injection")
self.managedcluster_stop_scenario(instance_kill_count, managedcluster, timeout)
time.sleep(10)
self.managedcluster_start_scenario(instance_kill_count, managedcluster, timeout)
logging.info("managedcluster_stop_start_scenario has been successfully injected!")
# managedcluster scenario to terminate the managedcluster
def managedcluster_termination_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster termination is not implemented, " "no action is going to be taken")
# managedcluster scenario to reboot the managedcluster
def managedcluster_reboot_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster reboot is not implemented," " no action is going to be taken")
# managedcluster scenario to start the klusterlet
def start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting start_klusterlet_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 3""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("start_klusterlet_scenario has been successfully injected!")
time.sleep(30) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop the klusterlet
def stop_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting stop_klusterlet_scenario injection")
file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
env = Environment(loader=file_loader, autoescape=False)
template = env.get_template("manifestwork.j2")
body = yaml.safe_load(
template.render(managedcluster_name=managedcluster,
args="""kubectl scale deployment.apps/klusterlet --replicas 0""")
)
kubecli.create_manifestwork(body, managedcluster)
logging.info("stop_klusterlet_scenario has been successfully injected!")
time.sleep(30) # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
except Exception as e:
logging.error("managedcluster scenario exiting due to Exception %s" % e)
sys.exit(1)
finally:
logging.info("Deleting manifestworks")
kubecli.delete_manifestwork(managedcluster)
# managedcluster scenario to stop and start the klusterlet
def stop_start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("Starting stop_start_klusterlet_scenario injection")
self.stop_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
time.sleep(10)
self.start_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
logging.info("stop_start_klusterlet_scenario has been successfully injected!")
# managedcluster scenario to crash the managedcluster
def managedcluster_crash_scenario(self, instance_kill_count, managedcluster, timeout):
logging.info("managedcluster crash scenario is not implemented, " "no action is going to be taken")

View File

@@ -0,0 +1,68 @@
apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
namespace: {{managedcluster_name}}
name: managedcluster-scenarios-template
spec:
workload:
manifests:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: scale-deploy
namespace: open-cluster-management
rules:
- apiGroups: ["apps"]
resources: ["deployments/scale"]
verbs: ["patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get"]
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scale-deploy-to-sa
namespace: open-cluster-management
subjects:
- kind: ServiceAccount
name: internal-kubectl
namespace: open-cluster-management
roleRef:
kind: ClusterRole
name: scale-deploy
apiGroup: rbac.authorization.k8s.io
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scale-deploy-to-sa
namespace: open-cluster-management-agent
subjects:
- kind: ServiceAccount
name: internal-kubectl
namespace: open-cluster-management
roleRef:
kind: ClusterRole
name: scale-deploy
apiGroup: rbac.authorization.k8s.io
- apiVersion: v1
kind: ServiceAccount
metadata:
name: internal-kubectl
namespace: open-cluster-management
- apiVersion: batch/v1
kind: Job
metadata:
name: managedcluster-scenarios-template
namespace: open-cluster-management
spec:
template:
spec:
serviceAccountName: internal-kubectl
containers:
- name: kubectl
image: quay.io/sighup/kubectl-kustomize:1.21.6_3.9.1
command: ["/bin/sh", "-c"]
args:
- {{args}}
restartPolicy: Never
backoffLimit: 0

View File

@@ -0,0 +1,66 @@
import yaml
import logging
import time
from kraken.managedcluster_scenarios.managedcluster_scenarios import managedcluster_scenarios
import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
import kraken.cerberus.setup as cerberus
# Get the managedcluster scenarios object of specfied cloud type
def get_managedcluster_scenario_object(managedcluster_scenario):
return managedcluster_scenarios()
# Run defined scenarios
def run(scenarios_list, config, wait_duration):
for managedcluster_scenario_config in scenarios_list:
with open(managedcluster_scenario_config, "r") as f:
managedcluster_scenario_config = yaml.full_load(f)
for managedcluster_scenario in managedcluster_scenario_config["managedcluster_scenarios"]:
managedcluster_scenario_object = get_managedcluster_scenario_object(managedcluster_scenario)
if managedcluster_scenario["actions"]:
for action in managedcluster_scenario["actions"]:
start_time = int(time.time())
inject_managedcluster_scenario(action, managedcluster_scenario, managedcluster_scenario_object)
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
end_time = int(time.time())
cerberus.get_status(config, start_time, end_time)
logging.info("")
# Inject the specified managedcluster scenario
def inject_managedcluster_scenario(action, managedcluster_scenario, managedcluster_scenario_object):
# Get the managedcluster scenario configurations
run_kill_count = managedcluster_scenario.get("runs", 1)
instance_kill_count = managedcluster_scenario.get("instance_count", 1)
managedcluster_name = managedcluster_scenario.get("managedcluster_name", "")
label_selector = managedcluster_scenario.get("label_selector", "")
timeout = managedcluster_scenario.get("timeout", 120)
# Get the managedcluster to apply the scenario
if managedcluster_name:
managedcluster_name_list = managedcluster_name.split(",")
else:
managedcluster_name_list = [managedcluster_name]
for single_managedcluster_name in managedcluster_name_list:
managedclusters = common_managedcluster_functions.get_managedcluster(single_managedcluster_name, label_selector, instance_kill_count)
for single_managedcluster in managedclusters:
if action == "managedcluster_start_scenario":
managedcluster_scenario_object.managedcluster_start_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_stop_scenario":
managedcluster_scenario_object.managedcluster_stop_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_stop_start_scenario":
managedcluster_scenario_object.managedcluster_stop_start_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_termination_scenario":
managedcluster_scenario_object.managedcluster_termination_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_reboot_scenario":
managedcluster_scenario_object.managedcluster_reboot_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "stop_start_klusterlet_scenario":
managedcluster_scenario_object.stop_start_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "start_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "stop_klusterlet_scenario":
managedcluster_scenario_object.stop_klusterlet_scenario(run_kill_count, single_managedcluster, timeout)
elif action == "managedcluster_crash_scenario":
managedcluster_scenario_object.managedcluster_crash_scenario(run_kill_count, single_managedcluster, timeout)
else:
logging.info("There is no managedcluster action that matches %s, skipping scenario" % action)

View File

@@ -16,6 +16,7 @@ import kraken.pod_scenarios.setup as pod_scenarios
import kraken.namespace_actions.common_namespace_functions as namespace_actions
import kraken.shut_down.common_shut_down_func as shut_down
import kraken.node_actions.run as nodeaction
import kraken.managedcluster_scenarios.run as managedcluster_scenarios
import kraken.kube_burner.client as kube_burner
import kraken.zone_outage.actions as zone_outages
import kraken.application_outage.actions as application_outage
@@ -240,6 +241,15 @@ def main(cfg):
wait_duration
)
# Inject managedcluster chaos scenarios specified in the config
elif scenario_type == "managedcluster_scenarios":
logging.info("Running managedcluster scenarios")
managedcluster_scenarios.run(
scenarios_list,
config,
wait_duration
)
# Inject time skew chaos scenarios specified
# in the config
elif scenario_type == "time_scenarios":

View File

@@ -0,0 +1,17 @@
managedcluster_scenarios:
- actions: # ManagedCluster chaos scenarios to be injected
- managedcluster_stop_start_scenario
managedcluster_name: cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
instance_count: 1 # Number of managedcluster to perform action/select that match the label selector
runs: 1 # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
timeout: 420 # Duration to wait for completion of ManagedCluster scenario injection
# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
# (default leaseDurationSeconds = 60 sec)
- actions:
- stop_start_klusterlet_scenario
managedcluster_name: cluster1
# label_selector:
instance_count: 1
runs: 1
timeout: 60