73 Commits

Author SHA1 Message Date
Naga Ravi Chaitanya Elluri
487a9f464c Deprecate long term metrics collection
This will be added back soon via native prometheus integration.

Signed-off-by: Naga Ravi Chaitanya Elluri <nelluri@redhat.com>
2024-01-10 15:08:58 -05:00
Tullio Sebastiani
f2d7f88cb8 Krkn lib prometheus client + kube_burner references removed
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
2024-01-09 10:43:32 -05:00
Paige Rubendall
b03511850b taking out more litmus references 2023-12-03 13:10:52 +05:30
Tullio Sebastiani
7a966a71d0 krkn integration of telemetry events collection (#523)
* function package refactoring in krkn-lib

* cluster events collection flag

* krkn-lib version bump

requirements

* dockerfile bump
2023-10-31 14:31:33 -04:00
Naga Ravi Chaitanya Elluri
43d891afd3 Bump telemetry archive default size to 500MB
This commit also removes litmus configs as they are not maintained.
2023-10-30 12:50:04 -04:00
Tullio Sebastiani
724068a978 Chaos recommender refactoring (#516)
* basic structure working

* config and options refactoring

nits and changes

* removed unused function with typo + fixed duration

* removed unused arguments

* minor fixes
2023-10-30 15:51:09 +01:00
Paige Rubendall
f7f1b2dfb0 Service disruption (#494)
* adding service disruption

* fixing kil services

* service log changes

* remvoing extra logging

* adding daemon set

* adding service disruption name changes

* cerberus config back

* bad string
2023-10-06 12:51:10 -04:00
Tullio Sebastiani
61356fd70b Added log telemetry piece to Krkn (#500)
* config

* log collection and upload

dictionary key fix

* escape regex in config.yaml

* bump krkn-lib version

* updated funtest github cli command

* update krkn-lib version to 1.3.2

* fixed requirements.txt
2023-10-06 10:08:46 -04:00
Tullio Sebastiani
f6f686e8fe fixed io-hog scenario 2023-09-13 09:57:00 -04:00
Sahil Shah
585d519687 Adding Prometheus Disruption Scenario (#484) 2023-09-11 11:18:29 -04:00
yogananth-subramanian
e40fedcd44 Update etcd metrics 2023-09-08 11:11:42 -04:00
Tullio Sebastiani
cee5259fd3 arcaflow scenarios removed from config.yaml 2023-08-23 08:50:19 -04:00
pratyusha
d2d80be241 Updated config.yaml file with more scenarios (#468) 2023-08-21 11:26:33 -04:00
Tullio Sebastiani
39c0152b7b Krkn telemetry integration (#435)
* adapted config.yaml to the new feature

* temporarly pointing requirement.txt to the lib feature branch

* run_kraken.py + arcaflow scenarios refactoring


typo

* plugin scenario

* node scenarios


return failed scenarios

* container scenarios


fix

* time scenarios

* cluster shutdown  scenarios

* namespace scenarios

* zone outage scenarios

* app outage scenarios

* pvc scenarios

* network chaos scenarios

* run_kraken.py adaptation to telemetry

* prometheus telemetry upload + config.yaml


some fixes


typos and logs


max retries in config


telemetry id with run_uuid


safe_logger

* catch send_telemetry exception

* scenario collection bug fixes

* telemetry enabled check

* telemetry run tag

* requirements pointing to main + archive_size

* requirements.txt and config.yaml update

* added telemetry config to common config

* fixed scenario array elements for telemetry
2023-08-10 14:42:53 -04:00
jtydlack
491dc17267 Slo via http (#459)
* Fix typo

* Enable loading SLO profile via URL (#438)
2023-08-10 11:02:33 -04:00
yogananth-subramanian
b2b5002f45 Pod egress network shapping Chaos scenario
The scenario introduces network latency, packet loss, and bandwidth restriction in the Pod's network interface.
The purpose of this scenario is to observe faults caused by random variations in the network.

Below example config applies egress traffic shaping to openshift console.
````
- id: pod_egress_shaping
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied.
    label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
    network_params:
        latency: 500ms             # Add 500ms latency to egress traffic from the pod.
````
2023-08-08 11:45:03 -04:00
Naga Ravi Chaitanya Elluri
de0567b067 Tweak the etcd alert severity 2023-06-16 09:19:17 -04:00
Naga Ravi Chaitanya Elluri
ce409ea6fb Update kube-burner dependency version to 1.7.0 2023-06-15 11:55:17 -04:00
Naga Ravi Chaitanya Elluri
0eb8d38596 Expand SLOs profile to cover monitoring for more alerts
This commit:
- Also sets appropriate severity to avoid false failures for the
  test cases especially given that theses are monitored during the chaos
  vs post chaos. Critical alerts are all monitored post chaos with few
  monitored during the chaos that represent overall health and performance
  of the service.
- Renames Alerts to SLOs validation

Metrics reference: f09a492b13/cmd/kube-burner/ocp-config/alerts.yml
2023-06-14 16:58:36 -04:00
Tullio Sebastiani
72b46f8393 temporarly removed io-hog scenario (#433)
* temporarly removed io-hog scenario

* removed litmus documentation & config
2023-06-05 11:03:44 -04:00
Naga Ravi Chaitanya Elluri
9858f96c78 Change the severity of the etcd leader election check to warning
This is the first step towards the goal to only have metrics tracking
the overall health and performance of the component/cluster. For instance,
for etcd disruption scenarios, leader elections are expected, we should instead
track etcd leader availability and fsync latency under critical catergory vs leader
elections.
2023-05-31 11:50:20 -04:00
yogananth-subramanian
8806781a4f Pod network outage Chaos scenario
Pod network outage chaos scenario blocks traffic at pod level irrespective of the network policy used.
With the current network policies, it is not possible to explicitly block ports which are enabled
by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules
to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.

Below example config blocks access to openshift console.
````
- id: pod_network_outage
  config:
    namespace: openshift-console
    direction:
        - ingress
    ingress_ports:
        - 8443
    label_selector: 'component=ui'
````
2023-05-15 10:43:58 -04:00
Tullio Sebastiani
83b811bee4 Arcaflow stress-ng hogs with parallelism support (#418)
* kubeconfig management for arcaflow + hogs scenario refactoring  

  * kubeconfig authentication parsing refactored to support arcaflow kubernetes deployer  
  * reimplemented all the hog scenarios to allow multiple parallel containers of the same scenarios 
  (eg. to stress two or more nodes in the same run simultaneously) 
  * updated documentation 
* removed sysbench scenarios


* recovered cpu hogs


* updated requirements.txt


* updated config.yaml

* added gitleaks file for test fixtures

* imported sys and logging

* removed config_arcaflow.yaml

* updated readme

* refactored arcaflow documentation entrypoint
2023-05-15 09:45:16 -04:00
Paige Rubendall
16ea18c718 Ibm plugin node scenario (#417)
* Node scenarios for ibmcloud

* adding openshift check info
2023-05-09 12:07:38 -04:00
Naga Ravi Chaitanya Elluri
bc863fa01f Add support to check for critical alerts
This commit enables users to opt in to check for critical alerts firing
in the cluster post chaos at the end of each scenario. Chaos scenario is
considered as failed if the cluster is unhealthy in which case user can
start debugging to fix and harden respective areas.

Fixes https://github.com/redhat-chaos/krkn/issues/410
2023-05-03 16:14:13 -04:00
Tullio Sebastiani
3627b5ba88 cpu hog scenario + basic arcaflow documentation (#391)
typo


typo


updated documentation


fixed workflow map issue
2023-03-15 16:52:20 +01:00
Tullio Sebastiani
fee4f7d2bf arcaflow integration (#384)
arcaflow library version

Co-authored-by: Tullio Sebastiani <tsebasti@redhat.com>
2023-03-08 12:01:03 +01:00
José Castillo Lema
493a8a245f Docker provider for node actions (#369)
* Docker provider for node actions

* Adjusted dependencies and imports

* Update config_kind.yaml

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>

Signed-off-by: José Castillo Lema <josecastillolema@gmail.com>
2023-01-10 14:36:18 -05:00
Naga Ravi Chaitanya Elluri
6b17dbdbb3 Allow users to set the listening address
This commit provides an option for the user to set the listening address
for the signal. This also fixes a security vulnerability.

Fixes https://github.com/redhat-chaos/krkn/issues/307
2022-11-08 15:59:57 -05:00
Sandro Bonazzola
0c36903fff config: really default to ~ instead of /root
Documentation says we default to ~ for looking up the kubernetes config
but then we set everywhere /root. Fixed the config to really look for ~.

Should solve #327.

Signed-off-by: Sandro Bonazzola <sbonazzo@redhat.com>
2022-09-13 12:01:16 +02:00
Shreyas Anantha Ramaprasad
9421a0c2c2 Added support for ingress traffic shaping (#299)
* Added plugin for ingress network traffic shaping

* Documentation changes

* Minor changes

* Documentation and formatting fixes

* Added trap to sleep infinity command running in containers

* Removed shell injection threat for modprobe commands

* Added docstrings to cerberus functions

* Added checks to prevent shell injection

* Bug fix
2022-09-02 07:54:11 +02:00
Naga Ravi Chaitanya Elluri
6c75d3dddb Add option to skip litmus installation
This commit adds an option for the user to pick whether to install
litmus or not depending on their use case. One use case is disconnected
environments where litmus is pre-installed insted of reaching out to the
internet.
2022-08-23 14:09:10 -04:00
Shreyas Anantha Ramaprasad
08deae63dd Added VMware Node Scenarios (#285)
* Added VMware node scenarios

* Made vmware plugin independent of Krkn

* Revert changes made to node status watch

* Fixed minor documentation changes
2022-08-15 23:35:16 +02:00
Janos Bonic
ccd902565e Fixes #265: Replace Powerfulseal and introduce Wolkenwalze SDK for plugin system 2022-08-02 16:25:03 +01:00
Naga Ravi Chaitanya Elluri
9208f39e06 Add support to run on Kubernetes
This commit:
- Leverages distribution flag in the config set by the user to skip
  things not supported on OpenShift to be able to run scenarios on
  Kubernetes.
- Adds sample config and scenario files that work on Kubernetes.
2022-06-01 07:27:06 -05:00
Adolfo Aguirrezabal
3adf5847b2 Add option to avoid litmus uninstall before chaos run (#242)
* Adds option to avoid litmus uninstall before chaos run

* Add new option to the config files
2022-05-05 09:02:25 -04:00
Steven Barre
3691bba5af Update Litmus version in config_performance.yaml
v1.10 uses a different case for `Experimentstatus` than v1.13 and thus will always fail.
2022-03-21 09:06:39 -04:00
yogananth-subramanian
50dd9873c1 Node egress traffic shaping
Patch adds a scenario to create variations in egress traffic of a Node's interface using the tc and Netem.
2021-12-16 12:54:53 -05:00
Alejandro Gullón
baa812b7f0 Added new scenario to fill up a given volumen (#182)
* Added new scenario to fill up a given volumen

* fixing small issues and style

* adding PVC as input param instead of pod name

* small fix

* get container name and volumen name
replace oc with kubectl commands

* adding yaml file to create a pv, pvc and pod to run pvc_scenario

* adding support to match both string for describe command when looking for pod_name

* added support to find the pvc from a given pod

* small fix

* small fix
2021-11-24 12:18:49 -05:00
Naga Ravi Chaitanya Elluri
674eb74a75 Expose setting the signal in the config
This commit enables users to start Kraken to act as listener by setting
the signal to PAUSE in the config to get the cluster to a desired test or
run any setup before injecting chaos by setting the signal to RUN. This
helps in cases where we have test cases that need to coordinate the chaos
at a desired time depending on the state of the cluster/test run.
2021-10-26 09:05:25 -04:00
Paige Rubendall
6b865fc573 Adding server set up for kraken 2021-10-25 08:58:46 -04:00
Naga Ravi Chaitanya Elluri
970cd061f4 Set the location of cerberus config to match entrypoint
Entrypoint for reference - https://github.com/cloud-bulldozer/cerberus/blob/master/containers/Dockerfile#L23.
2021-10-08 09:25:14 -04:00
Naga Ravi Chaitanya Elluri
cdf3bc03d2 Add support to block traffic to an application
This commit enables users to simulate a downtime of an application
by blocking the traffic for the specified duration to see how
it/other components communicating with it behave in case of downtime.
2021-10-01 10:13:40 -04:00
Paige Rubendall
22df024312 adding validation that namespace becomes active 2021-09-28 09:58:55 -04:00
Naga Ravi Chaitanya Elluri
036e51a6b1 Delete litmus crd's during the cleanup
This commit will ensure that the litmus resources installed on the
cluster get cleaned up and also creates the chaosengine in the
specified namespace.
2021-09-16 16:30:21 -04:00
Paige Rubendall
a9056ddf43 adding litmus logging 2021-09-08 17:11:49 -04:00
Naga Ravi Chaitanya Elluri
5da0b259c5 Run all the litmus resources in a single namespace
- This eases the usage and debuggability by running the fault injection pods in
  the same namespace as other resources of litmus. This will also ease the
  deletion process and ensure that there are no leftover objects on the cluster.

- This commit also enables users to use the same rbac template for all the litmus
  scenarios without having to pull in a specic one for each of the scenarios.
2021-09-08 16:37:07 -04:00
Naga Ravi Chaitanya Elluri
68a32666cd Update litmus docs with supported scenarios 2021-09-01 16:41:22 -04:00
prubenda
9b0bcdbf0e Adding node memory hog scenario 2021-08-20 14:02:00 -04:00
Naga Ravi Chaitanya Elluri
6456eec76a Add zone outage scenarios
This commit adds support to create zone outage in AWS by denying both
ingress and egress traffic to the instances belonging to a particular
subnet belonging to the zone by tweaking the network acl. This creates
an outage of all the nodes in the zone - both master and workers.
2021-08-17 11:43:13 -04:00