mirror of
https://github.com/krkn-chaos/krkn.git
synced 2026-02-18 12:00:19 +00:00
Compare commits
3 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
667798d588 | ||
|
|
0c30d89a1b | ||
|
|
2ba20fa483 |
@@ -13,13 +13,26 @@ Supported Cloud Providers:
|
||||
**NOTE**: For clusters with AWS make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) is installed and properly [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html) using an AWS account
|
||||
|
||||
## GCP
|
||||
**NOTE**: For clusters with GCP make sure [GCP CLI](https://cloud.google.com/sdk/docs/install#linux) is installed.
|
||||
|
||||
A google service account is required to give proper authentication to GCP for node actions. See [here](https://cloud.google.com/docs/authentication/getting-started) for how to create a service account.
|
||||
In order to set up Application Default Credentials (ADC) for use by Cloud Client Libraries, you can provide either service account credentials or the credentials associated with your user acccount:
|
||||
|
||||
**NOTE**: A user with 'resourcemanager.projects.setIamPolicy' permission is required to grant project-level permissions to the service account.
|
||||
- Using service account credentials:
|
||||
|
||||
After creating the service account you will need to enable the account using the following: ```export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"```
|
||||
A google service account is required to give proper authentication to GCP for node actions. See [here](https://cloud.google.com/docs/authentication/getting-started) for how to create a service account.
|
||||
|
||||
**NOTE**: A user with 'resourcemanager.projects.setIamPolicy' permission is required to grant project-level permissions to the service account.
|
||||
|
||||
After creating the service account you will need to enable the account using the following: ```export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"```
|
||||
|
||||
- Using the credentials associated with your user acccount:
|
||||
|
||||
1. Make sure that the [GCP CLI](https://cloud.google.com/sdk/docs/install#linux) is installed and [initialized](https://cloud.google.com/sdk/docs/initializing) by running:
|
||||
|
||||
```gcloud init```
|
||||
|
||||
2. Create local authentication credentials for your user account:
|
||||
|
||||
```gcloud auth application-default login```
|
||||
|
||||
## Openstack
|
||||
|
||||
@@ -32,6 +45,7 @@ After creating the service account you will need to enable the account using the
|
||||
To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.
|
||||
|
||||
Before running you will need to set the following:
|
||||
|
||||
1. ```export AZURE_SUBSCRIPTION_ID=<subscription_id>```
|
||||
|
||||
2. ```export AZURE_TENANT_ID=<tenant_id>```
|
||||
@@ -66,9 +80,10 @@ Set the following environment variables
|
||||
|
||||
These are the credentials that you would normally use to access the vSphere client.
|
||||
|
||||
|
||||
## IBMCloud
|
||||
If no api key is set up with proper VPC resource permissions, use the following to create:
|
||||
|
||||
If no API key is set up with proper VPC resource permissions, use the following to create it:
|
||||
|
||||
* Access group
|
||||
* Service id with the following access
|
||||
* With policy **VPC Infrastructure Services**
|
||||
|
||||
@@ -18,7 +18,7 @@ network_chaos: # Scenario to create an outage
|
||||
```
|
||||
|
||||
##### Sample scenario config for ingress traffic shaping (using a plugin)
|
||||
'''
|
||||
```
|
||||
- id: network_chaos
|
||||
config:
|
||||
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
|
||||
@@ -35,7 +35,7 @@ network_chaos: # Scenario to create an outage
|
||||
bandwidth: 10mbit
|
||||
wait_duration: 120
|
||||
test_duration: 60
|
||||
'''
|
||||
```
|
||||
|
||||
Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@ The following node chaos scenarios are supported:
|
||||
|
||||
1. **node_start_scenario**: Scenario to stop the node instance.
|
||||
2. **node_stop_scenario**: Scenario to stop the node instance.
|
||||
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance. Not supported on VMware.
|
||||
3. **node_stop_start_scenario**: Scenario to stop the node instance for specified duration and then start the node instance. Not supported on VMware.
|
||||
4. **node_termination_scenario**: Scenario to terminate the node instance.
|
||||
5. **node_reboot_scenario**: Scenario to reboot the node instance.
|
||||
6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
|
||||
@@ -12,6 +12,7 @@ The following node chaos scenarios are supported:
|
||||
8. **restart_kubelet_scenario**: Scenario to restart the kubelet of the node instance.
|
||||
9. **node_crash_scenario**: Scenario to crash the node instance.
|
||||
10. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
|
||||
11. **node_disk_detach_attach_scenario**: Scenario to detach node disk for specified duration.
|
||||
|
||||
|
||||
**NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
|
||||
@@ -20,6 +21,8 @@ The following node chaos scenarios are supported:
|
||||
, node_reboot_scenario and stop_start_kubelet_scenario are supported on AWS, Azure, OpenStack, BareMetal, GCP
|
||||
, VMware and Alibaba.
|
||||
|
||||
**NOTE**: node_disk_detach_attach_scenario is supported only on AWS and cannot detach root disk.
|
||||
|
||||
|
||||
#### AWS
|
||||
|
||||
|
||||
@@ -36,6 +36,20 @@ class abstract_node_scenarios:
|
||||
self.helper_node_start_scenario(instance_kill_count, node, timeout)
|
||||
logging.info("helper_node_stop_start_scenario has been successfully injected!")
|
||||
|
||||
# Node scenario to detach and attach the disk
|
||||
def node_disk_detach_attach_scenario(self, instance_kill_count, node, timeout, duration):
|
||||
logging.info("Starting disk_detach_attach_scenario injection")
|
||||
disk_attachment_details = self.get_disk_attachment_info(instance_kill_count, node)
|
||||
if disk_attachment_details:
|
||||
self.disk_detach_scenario(instance_kill_count, node, timeout)
|
||||
logging.info("Waiting for %s seconds before attaching the disk" % (duration))
|
||||
time.sleep(duration)
|
||||
self.disk_attach_scenario(instance_kill_count, disk_attachment_details, timeout)
|
||||
logging.info("node_disk_detach_attach_scenario has been successfully injected!")
|
||||
else:
|
||||
logging.error("Node %s has only root disk attached" % (node))
|
||||
logging.error("node_disk_detach_attach_scenario failed!")
|
||||
|
||||
# Node scenario to terminate the node
|
||||
def node_termination_scenario(self, instance_kill_count, node, timeout):
|
||||
pass
|
||||
|
||||
@@ -12,7 +12,8 @@ from krkn_lib.k8s import KrknKubernetes
|
||||
class AWS:
|
||||
def __init__(self):
|
||||
self.boto_client = boto3.client("ec2")
|
||||
self.boto_instance = boto3.resource("ec2").Instance("id")
|
||||
self.boto_resource = boto3.resource("ec2")
|
||||
self.boto_instance = self.boto_resource.Instance("id")
|
||||
|
||||
# Get the instance ID of the node
|
||||
def get_instance_id(self, node):
|
||||
@@ -179,6 +180,72 @@ class AWS:
|
||||
|
||||
raise RuntimeError()
|
||||
|
||||
# Detach volume
|
||||
def detach_volumes(self, volumes_ids: list):
|
||||
for volume in volumes_ids:
|
||||
try:
|
||||
self.boto_client.detach_volume(VolumeId=volume, Force=True)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Detaching volume %s failed with exception: %s"
|
||||
% (volume, e)
|
||||
)
|
||||
|
||||
# Attach volume
|
||||
def attach_volume(self, attachment: dict):
|
||||
try:
|
||||
if self.get_volume_state(attachment["VolumeId"]) == "in-use":
|
||||
logging.info(
|
||||
"Volume %s is already in use." % attachment["VolumeId"]
|
||||
)
|
||||
return
|
||||
logging.info(
|
||||
"Attaching the %s volumes to instance %s."
|
||||
% (attachment["VolumeId"], attachment["InstanceId"])
|
||||
)
|
||||
self.boto_client.attach_volume(
|
||||
InstanceId=attachment["InstanceId"],
|
||||
Device=attachment["Device"],
|
||||
VolumeId=attachment["VolumeId"]
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed attaching disk %s to the %s instance. "
|
||||
"Encountered following exception: %s"
|
||||
% (attachment['VolumeId'], attachment['InstanceId'], e)
|
||||
)
|
||||
raise RuntimeError()
|
||||
|
||||
# Get IDs of node volumes
|
||||
def get_volumes_ids(self, instance_id: list):
|
||||
response = self.boto_client.describe_instances(InstanceIds=instance_id)
|
||||
instance_attachment_details = response["Reservations"][0]["Instances"][0]["BlockDeviceMappings"]
|
||||
root_volume_device_name = self.get_root_volume_id(instance_id)
|
||||
volume_ids = []
|
||||
for device in instance_attachment_details:
|
||||
if device["DeviceName"] != root_volume_device_name:
|
||||
volume_id = device["Ebs"]["VolumeId"]
|
||||
volume_ids.append(volume_id)
|
||||
return volume_ids
|
||||
|
||||
# Get volumes attachment details
|
||||
def get_volume_attachment_details(self, volume_ids: list):
|
||||
response = self.boto_client.describe_volumes(VolumeIds=volume_ids)
|
||||
volumes_details = response["Volumes"]
|
||||
return volumes_details
|
||||
|
||||
# Get root volume
|
||||
def get_root_volume_id(self, instance_id):
|
||||
instance_id = instance_id[0]
|
||||
instance = self.boto_resource.Instance(instance_id)
|
||||
root_volume_id = instance.root_device_name
|
||||
return root_volume_id
|
||||
|
||||
# Get volume state
|
||||
def get_volume_state(self, volume_id: str):
|
||||
volume = self.boto_resource.Volume(volume_id)
|
||||
state = volume.state
|
||||
return state
|
||||
|
||||
# krkn_lib
|
||||
class aws_node_scenarios(abstract_node_scenarios):
|
||||
@@ -290,3 +357,49 @@ class aws_node_scenarios(abstract_node_scenarios):
|
||||
logging.error("node_reboot_scenario injection failed!")
|
||||
|
||||
raise RuntimeError()
|
||||
|
||||
# Get volume attachment info
|
||||
def get_disk_attachment_info(self, instance_kill_count, node):
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Obtaining disk attachment information")
|
||||
instance_id = (self.aws.get_instance_id(node)).split()
|
||||
volumes_ids = self.aws.get_volumes_ids(instance_id)
|
||||
if volumes_ids:
|
||||
vol_attachment_details = self.aws.get_volume_attachment_details(
|
||||
volumes_ids
|
||||
)
|
||||
return vol_attachment_details
|
||||
return
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to obtain disk attachment information of %s node. "
|
||||
"Encounteres following exception: %s." % (node, e)
|
||||
)
|
||||
raise RuntimeError()
|
||||
|
||||
# Node scenario to detach the volume
|
||||
def disk_detach_scenario(self, instance_kill_count, node, timeout):
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Starting disk_detach_scenario injection")
|
||||
instance_id = (self.aws.get_instance_id(node)).split()
|
||||
volumes_ids = self.aws.get_volumes_ids(instance_id)
|
||||
logging.info(
|
||||
"Detaching the %s volumes from instance %s "
|
||||
% (volumes_ids, node)
|
||||
)
|
||||
self.aws.detach_volumes(volumes_ids)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to detach disk from %s node. Encountered following"
|
||||
"exception: %s." % (node, e)
|
||||
)
|
||||
logging.debug("")
|
||||
raise RuntimeError()
|
||||
|
||||
# Node scenario to attach the volume
|
||||
def disk_attach_scenario(self, instance_kill_count, attachment_details, timeout):
|
||||
for _ in range(instance_kill_count):
|
||||
for attachment in attachment_details:
|
||||
self.aws.attach_volume(attachment["Attachments"][0])
|
||||
|
||||
@@ -1,66 +1,78 @@
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import logging
|
||||
import json
|
||||
import google.auth
|
||||
import krkn.scenario_plugins.node_actions.common_node_functions as nodeaction
|
||||
from krkn.scenario_plugins.node_actions.abstract_node_scenarios import (
|
||||
abstract_node_scenarios,
|
||||
)
|
||||
from googleapiclient import discovery
|
||||
from oauth2client.client import GoogleCredentials
|
||||
from google.cloud import compute_v1
|
||||
from krkn_lib.k8s import KrknKubernetes
|
||||
|
||||
|
||||
class GCP:
|
||||
def __init__(self):
|
||||
try:
|
||||
gapp_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
|
||||
with open(gapp_creds, "r") as f:
|
||||
f_str = f.read()
|
||||
self.project = json.loads(f_str)["project_id"]
|
||||
# self.project = runcommand.invoke("gcloud config get-value project").split("/n")[0].strip()
|
||||
logging.info("project " + str(self.project) + "!")
|
||||
credentials = GoogleCredentials.get_application_default()
|
||||
self.client = discovery.build(
|
||||
"compute", "v1", credentials=credentials, cache_discovery=False
|
||||
)
|
||||
|
||||
_, self.project_id = google.auth.default()
|
||||
self.instance_client = compute_v1.InstancesClient()
|
||||
except Exception as e:
|
||||
logging.error("Error on setting up GCP connection: " + str(e))
|
||||
|
||||
raise e
|
||||
|
||||
# Get the instance ID of the node
|
||||
def get_instance_id(self, node):
|
||||
zone_request = self.client.zones().list(project=self.project)
|
||||
while zone_request is not None:
|
||||
zone_response = zone_request.execute()
|
||||
for zone in zone_response["items"]:
|
||||
instances_request = self.client.instances().list(
|
||||
project=self.project, zone=zone["name"]
|
||||
)
|
||||
while instances_request is not None:
|
||||
instance_response = instances_request.execute()
|
||||
if "items" in instance_response.keys():
|
||||
for instance in instance_response["items"]:
|
||||
if instance["name"] in node:
|
||||
return instance["name"], zone["name"]
|
||||
instances_request = self.client.zones().list_next(
|
||||
previous_request=instances_request,
|
||||
previous_response=instance_response,
|
||||
)
|
||||
zone_request = self.client.zones().list_next(
|
||||
previous_request=zone_request, previous_response=zone_response
|
||||
# Get the instance of the node
|
||||
def get_node_instance(self, node):
|
||||
try:
|
||||
request = compute_v1.AggregatedListInstancesRequest(
|
||||
project = self.project_id
|
||||
)
|
||||
logging.info("no instances ")
|
||||
agg_list = self.instance_client.aggregated_list(request=request)
|
||||
for _, response in agg_list:
|
||||
if response.instances:
|
||||
for instance in response.instances:
|
||||
if instance.name in node:
|
||||
return instance
|
||||
logging.info("no instances ")
|
||||
except Exception as e:
|
||||
logging.error("Error getting the instance of the node: " + str(e))
|
||||
|
||||
raise e
|
||||
|
||||
# Get the instance name
|
||||
def get_instance_name(self, instance):
|
||||
if instance.name:
|
||||
return instance.name
|
||||
|
||||
# Get the instance zone
|
||||
def get_instance_zone(self, instance):
|
||||
if instance.zone:
|
||||
return instance.zone.split("/")[-1]
|
||||
|
||||
# Get the instance zone of the node
|
||||
def get_node_instance_zone(self, node):
|
||||
instance = self.get_node_instance(node)
|
||||
if instance:
|
||||
return self.get_instance_zone(instance)
|
||||
|
||||
# Get the instance name of the node
|
||||
def get_node_instance_name(self, node):
|
||||
instance = self.get_node_instance(node)
|
||||
if instance:
|
||||
return self.get_instance_name(instance)
|
||||
|
||||
# Get the instance name of the node
|
||||
def get_instance_id(self, node):
|
||||
return self.get_node_instance_name(node)
|
||||
|
||||
# Start the node instance
|
||||
def start_instances(self, zone, instance_id):
|
||||
def start_instances(self, instance_id):
|
||||
try:
|
||||
self.client.instances().start(
|
||||
project=self.project, zone=zone, instance=instance_id
|
||||
).execute()
|
||||
logging.info("vm name " + str(instance_id) + " started")
|
||||
request = compute_v1.StartInstanceRequest(
|
||||
instance=instance_id,
|
||||
project=self.project_id,
|
||||
zone=self.get_node_instance_zone(instance_id),
|
||||
)
|
||||
self.instance_client.start(request=request)
|
||||
logging.info("Instance: " + str(instance_id) + " started")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to start node instance %s. Encountered following "
|
||||
@@ -70,12 +82,15 @@ class GCP:
|
||||
raise RuntimeError()
|
||||
|
||||
# Stop the node instance
|
||||
def stop_instances(self, zone, instance_id):
|
||||
def stop_instances(self, instance_id):
|
||||
try:
|
||||
self.client.instances().stop(
|
||||
project=self.project, zone=zone, instance=instance_id
|
||||
).execute()
|
||||
logging.info("vm name " + str(instance_id) + " stopped")
|
||||
request = compute_v1.StopInstanceRequest(
|
||||
instance=instance_id,
|
||||
project=self.project_id,
|
||||
zone=self.get_node_instance_zone(instance_id),
|
||||
)
|
||||
self.instance_client.stop(request=request)
|
||||
logging.info("Instance: " + str(instance_id) + " stopped")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to stop node instance %s. Encountered following "
|
||||
@@ -84,13 +99,16 @@ class GCP:
|
||||
|
||||
raise RuntimeError()
|
||||
|
||||
# Start the node instance
|
||||
def suspend_instances(self, zone, instance_id):
|
||||
# Suspend the node instance
|
||||
def suspend_instances(self, instance_id):
|
||||
try:
|
||||
self.client.instances().suspend(
|
||||
project=self.project, zone=zone, instance=instance_id
|
||||
).execute()
|
||||
logging.info("vm name " + str(instance_id) + " suspended")
|
||||
request = compute_v1.SuspendInstanceRequest(
|
||||
instance=instance_id,
|
||||
project=self.project_id,
|
||||
zone=self.get_node_instance_zone(instance_id),
|
||||
)
|
||||
self.instance_client.suspend(request=request)
|
||||
logging.info("Instance: " + str(instance_id) + " suspended")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to suspend node instance %s. Encountered following "
|
||||
@@ -100,49 +118,65 @@ class GCP:
|
||||
raise RuntimeError()
|
||||
|
||||
# Terminate the node instance
|
||||
def terminate_instances(self, zone, instance_id):
|
||||
def terminate_instances(self, instance_id):
|
||||
try:
|
||||
self.client.instances().delete(
|
||||
project=self.project, zone=zone, instance=instance_id
|
||||
).execute()
|
||||
logging.info("vm name " + str(instance_id) + " terminated")
|
||||
request = compute_v1.DeleteInstanceRequest(
|
||||
instance=instance_id,
|
||||
project=self.project_id,
|
||||
zone=self.get_node_instance_zone(instance_id),
|
||||
)
|
||||
self.instance_client.delete(request=request)
|
||||
logging.info("Instance: " + str(instance_id) + " terminated")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to start node instance %s. Encountered following "
|
||||
"Failed to terminate node instance %s. Encountered following "
|
||||
"exception: %s." % (instance_id, e)
|
||||
)
|
||||
|
||||
raise RuntimeError()
|
||||
|
||||
# Reboot the node instance
|
||||
def reboot_instances(self, zone, instance_id):
|
||||
def reboot_instances(self, instance_id):
|
||||
try:
|
||||
self.client.instances().reset(
|
||||
project=self.project, zone=zone, instance=instance_id
|
||||
).execute()
|
||||
logging.info("vm name " + str(instance_id) + " rebooted")
|
||||
request = compute_v1.ResetInstanceRequest(
|
||||
instance=instance_id,
|
||||
project=self.project_id,
|
||||
zone=self.get_node_instance_zone(instance_id),
|
||||
)
|
||||
self.instance_client.reset(request=request)
|
||||
logging.info("Instance: " + str(instance_id) + " rebooted")
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to start node instance %s. Encountered following "
|
||||
"Failed to reboot node instance %s. Encountered following "
|
||||
"exception: %s." % (instance_id, e)
|
||||
)
|
||||
|
||||
raise RuntimeError()
|
||||
|
||||
# Get instance status
|
||||
def get_instance_status(self, zone, instance_id, expected_status, timeout):
|
||||
# statuses: PROVISIONING, STAGING, RUNNING, STOPPING, SUSPENDING, SUSPENDED, REPAIRING,
|
||||
def get_instance_status(self, instance_id, expected_status, timeout):
|
||||
# states: PROVISIONING, STAGING, RUNNING, STOPPING, SUSPENDING, SUSPENDED, REPAIRING,
|
||||
# and TERMINATED.
|
||||
i = 0
|
||||
sleeper = 5
|
||||
while i <= timeout:
|
||||
instStatus = (
|
||||
self.client.instances()
|
||||
.get(project=self.project, zone=zone, instance=instance_id)
|
||||
.execute()
|
||||
)
|
||||
logging.info("Status of vm " + str(instStatus["status"]))
|
||||
if instStatus["status"] == expected_status:
|
||||
try:
|
||||
request = compute_v1.GetInstanceRequest(
|
||||
instance=instance_id,
|
||||
project=self.project_id,
|
||||
zone=self.get_node_instance_zone(instance_id),
|
||||
)
|
||||
instance_status = self.instance_client.get(request=request).status
|
||||
logging.info("Status of instance " + str(instance_id) + ": " + instance_status)
|
||||
except Exception as e:
|
||||
logging.error(
|
||||
"Failed to get status of instance %s. Encountered following "
|
||||
"exception: %s." % (instance_id, e)
|
||||
)
|
||||
|
||||
raise RuntimeError()
|
||||
|
||||
if instance_status == expected_status:
|
||||
return True
|
||||
time.sleep(sleeper)
|
||||
i += sleeper
|
||||
@@ -153,33 +187,21 @@ class GCP:
|
||||
return False
|
||||
|
||||
# Wait until the node instance is suspended
|
||||
def wait_until_suspended(self, zone, instance_id, timeout):
|
||||
return self.get_instance_status(zone, instance_id, "SUSPENDED", timeout)
|
||||
def wait_until_suspended(self, instance_id, timeout):
|
||||
return self.get_instance_status(instance_id, "SUSPENDED", timeout)
|
||||
|
||||
# Wait until the node instance is running
|
||||
def wait_until_running(self, zone, instance_id, timeout):
|
||||
return self.get_instance_status(zone, instance_id, "RUNNING", timeout)
|
||||
def wait_until_running(self, instance_id, timeout):
|
||||
return self.get_instance_status(instance_id, "RUNNING", timeout)
|
||||
|
||||
# Wait until the node instance is stopped
|
||||
def wait_until_stopped(self, zone, instance_id, timeout):
|
||||
return self.get_instance_status(zone, instance_id, "TERMINATED", timeout)
|
||||
def wait_until_stopped(self, instance_id, timeout):
|
||||
# In GCP, the next state after STOPPING is TERMINATED
|
||||
return self.get_instance_status(instance_id, "TERMINATED", timeout)
|
||||
|
||||
# Wait until the node instance is terminated
|
||||
def wait_until_terminated(self, zone, instance_id, timeout):
|
||||
try:
|
||||
i = 0
|
||||
sleeper = 5
|
||||
while i <= timeout:
|
||||
instStatus = (
|
||||
self.client.instances()
|
||||
.get(project=self.project, zone=zone, instance=instance_id)
|
||||
.execute()
|
||||
)
|
||||
logging.info("Status of vm " + str(instStatus["status"]))
|
||||
time.sleep(sleeper)
|
||||
except Exception as e:
|
||||
logging.info("here " + str(e))
|
||||
return True
|
||||
def wait_until_terminated(self, instance_id, timeout):
|
||||
return self.get_instance_status(instance_id, "TERMINATED", timeout)
|
||||
|
||||
|
||||
# krkn_lib
|
||||
@@ -193,12 +215,13 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Starting node_start_scenario injection")
|
||||
instance_id, zone = self.gcp.get_instance_id(node)
|
||||
instance = self.gcp.get_node_instance(node)
|
||||
instance_id = self.gcp.get_instance_name(instance)
|
||||
logging.info(
|
||||
"Starting the node %s with instance ID: %s " % (node, instance_id)
|
||||
)
|
||||
self.gcp.start_instances(zone, instance_id)
|
||||
self.gcp.wait_until_running(zone, instance_id, timeout)
|
||||
self.gcp.start_instances(instance_id)
|
||||
self.gcp.wait_until_running(instance_id, timeout)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in running state" % instance_id
|
||||
@@ -215,16 +238,16 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
|
||||
# Node scenario to stop the node
|
||||
def node_stop_scenario(self, instance_kill_count, node, timeout):
|
||||
logging.info("stop scenario")
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Starting node_stop_scenario injection")
|
||||
instance_id, zone = self.gcp.get_instance_id(node)
|
||||
instance = self.gcp.get_node_instance(node)
|
||||
instance_id = self.gcp.get_instance_name(instance)
|
||||
logging.info(
|
||||
"Stopping the node %s with instance ID: %s " % (node, instance_id)
|
||||
)
|
||||
self.gcp.stop_instances(zone, instance_id)
|
||||
self.gcp.wait_until_stopped(zone, instance_id, timeout)
|
||||
self.gcp.stop_instances(instance_id)
|
||||
self.gcp.wait_until_stopped(instance_id, timeout)
|
||||
logging.info(
|
||||
"Node with instance ID: %s is in stopped state" % instance_id
|
||||
)
|
||||
@@ -243,13 +266,14 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Starting node_termination_scenario injection")
|
||||
instance_id, zone = self.gcp.get_instance_id(node)
|
||||
instance = self.gcp.get_node_instance(node)
|
||||
instance_id = self.gcp.get_instance_name(instance)
|
||||
logging.info(
|
||||
"Terminating the node %s with instance ID: %s "
|
||||
% (node, instance_id)
|
||||
)
|
||||
self.gcp.terminate_instances(zone, instance_id)
|
||||
self.gcp.wait_until_terminated(zone, instance_id, timeout)
|
||||
self.gcp.terminate_instances(instance_id)
|
||||
self.gcp.wait_until_terminated(instance_id, timeout)
|
||||
for _ in range(timeout):
|
||||
if node not in self.kubecli.list_nodes():
|
||||
break
|
||||
@@ -267,19 +291,20 @@ class gcp_node_scenarios(abstract_node_scenarios):
|
||||
)
|
||||
logging.error("node_termination_scenario injection failed!")
|
||||
|
||||
|
||||
raise e
|
||||
raise RuntimeError()
|
||||
|
||||
# Node scenario to reboot the node
|
||||
def node_reboot_scenario(self, instance_kill_count, node, timeout):
|
||||
for _ in range(instance_kill_count):
|
||||
try:
|
||||
logging.info("Starting node_reboot_scenario injection")
|
||||
instance_id, zone = self.gcp.get_instance_id(node)
|
||||
instance = self.gcp.get_node_instance(node)
|
||||
instance_id = self.gcp.get_instance_name(instance)
|
||||
logging.info(
|
||||
"Rebooting the node %s with instance ID: %s " % (node, instance_id)
|
||||
)
|
||||
self.gcp.reboot_instances(zone, instance_id)
|
||||
self.gcp.reboot_instances(instance_id)
|
||||
self.gcp.wait_until_running(instance_id, timeout)
|
||||
nodeaction.wait_for_ready_status(node, timeout, self.kubecli)
|
||||
logging.info(
|
||||
"Node with instance ID: %s has been rebooted" % instance_id
|
||||
|
||||
@@ -163,7 +163,7 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
logging.info("action" + str(action))
|
||||
# Get the scenario specifics for running action nodes
|
||||
run_kill_count = get_yaml_item_value(node_scenario, "runs", 1)
|
||||
if action == "node_stop_start_scenario":
|
||||
if action in ("node_stop_start_scenario", "node_disk_detach_attach_scenario"):
|
||||
duration = get_yaml_item_value(node_scenario, "duration", 120)
|
||||
|
||||
timeout = get_yaml_item_value(node_scenario, "timeout", 120)
|
||||
@@ -200,6 +200,9 @@ class NodeActionsScenarioPlugin(AbstractScenarioPlugin):
|
||||
node_scenario_object.node_reboot_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
)
|
||||
elif action == "node_disk_detach_attach_scenario":
|
||||
node_scenario_object.node_disk_detach_attach_scenario(
|
||||
run_kill_count, single_node, timeout, duration)
|
||||
elif action == "stop_start_kubelet_scenario":
|
||||
node_scenario_object.stop_start_kubelet_scenario(
|
||||
run_kill_count, single_node, timeout
|
||||
|
||||
@@ -11,7 +11,8 @@ coverage==7.4.1
|
||||
datetime==5.4
|
||||
docker==7.0.0
|
||||
gitpython==3.1.41
|
||||
google-api-python-client==2.116.0
|
||||
google-auth==2.37.0
|
||||
google-cloud-compute==1.22.0
|
||||
ibm_cloud_sdk_core==3.18.0
|
||||
ibm_vpc==0.20.0
|
||||
jinja2==3.1.4
|
||||
@@ -19,7 +20,6 @@ krkn-lib==4.0.4
|
||||
lxml==5.1.0
|
||||
kubernetes==28.1.0
|
||||
numpy==1.26.4
|
||||
oauth2client==4.1.3
|
||||
pandas==2.2.0
|
||||
openshift-client==1.0.21
|
||||
paramiko==3.4.0
|
||||
|
||||
@@ -16,3 +16,10 @@ node_scenarios:
|
||||
instance_count: 1
|
||||
timeout: 120
|
||||
cloud_type: aws
|
||||
- actions:
|
||||
- node_disk_detach_attach_scenario
|
||||
node_name:
|
||||
label_selector:
|
||||
instance_count: 1
|
||||
timeout: 120
|
||||
cloud_type: aws
|
||||
Reference in New Issue
Block a user