kured

mirror of https://github.com/kubereboot/kured.git synced 2026-05-13 03:47:10 +00:00

Author	SHA1	Message	Date
Jean-Philippe Evrard	455b3df0dc	improve tests (#1021 ) * Add e2e test concurrency w/ signal This will help make sure the big refactoring does not break the main features. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Add podblocker test Extends test coverage to ensure nothing breaks Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Rename "version" with "variant" in tests For tests not running in different kubernetes versions, but have different tests subcases/variants, rephrase the wording "versions" as it is confusing. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Fix Staticcheck's SA1024 (subset with dupe chars) This will replace trim, taking a cutset, with Replace. This clarifies the intent to remove a substring. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Fix Staticcheck's ST1005 According to staticcheck, Error strings should not be capitalized (ST1005). This changes the cases for our errors. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Fix incorrect string prints A few strings have evolved to eventually remove all the templating part of their strings, yet kept the formatting features. This is incorrect, and will not pass staticcheck SA1006 and S1039. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Add staticcheck in make tests Without this, people like myself will forget to run staticcheck. This fixes it by making it part of make tests, which will run with all the fast tests in CI. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> --------- Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>	2025-01-09 14:42:28 -08:00
Jean-Philippe Evrard	608abc6e89	Increase CI coverage and provide new dev tool (#982 ) * Move to stable kind cluster filenames Without this, we have to rename files at every version. This is really unnecessary, we should only change the files and be done with it. This is a problem, as if we move to programmatic test running, the tests would need to be mutatated at every k8s version. With this model, we know that only the kind-cluster files need to be modified for the tests to ba automatically adapted. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Create e2e from go tests interface Without this, e2e tests need tons of manual work to test locally, and the results are not easily exposed. People are less likely to use the e2e tests if they are tough to use outside the CI. This commit makes it easier to run tests locally, and ensures the CI is closer to the Makefile. At the same time, this removes debt in the github worfklows: By switching to newer versions of kind, we can remove the very old workaround for the failed to attach pid 1. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Add node stays as cordonned test Without this, impossible to prove that the node stays as cordonned after a reboot by kured. This refactor also adds the test in the CI, and makes sure the CI is a bit simpler, by using matrix more extensively. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> * Use hack dir instead of .tmp This is more idiomatic. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> --------- Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>	2024-10-15 13:16:45 -07:00
Jean-Philippe Evrard	a02ae67559	Accelerate CI jobs Without this, some CI jobs are flaky or slow due to the following issues: - Triggering a reboot cause an unrecoverable boot loop. This fixes it by restarting the containers that are incorrectly exited. - API server is down while operations happen. This fixes it by ensuring at least one API server is up. In this case, we don't add a reboot marker on the unique api server. - The amount of nodes in a test environment is larger than necessary. This fixes it by ensuring two nodes are required to reboot. This is enough for concurrency, and for the e2e testing. - The wait time between operations is high, and can cause a heartbeat to be missed in the check script. This fixes it by checking more often, at the expense of more logging. This is compensated by increasing the amount of tries. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>	2024-10-02 23:30:41 +02:00
Thomas Stringer	3b9b190422	Add multiple concurrent node reboot feature (#660 ) * Add ability to have multiple nodes get a lock Currently in kured a single node can get a lock with Acquire. There could be situations where multiple nodes might want a lock in the event that a cluster can handle multiple nodes being rebooted. This adds the side-by-side implementation for a multiple node lock situation. Signed-off-by: Thomas Stringer <thomas@trstringer.com> * Refactor to use the same code path for a single lock and a multilock Signed-off-by: Thomas Stringer <thomas@trstringer.com> * test: force rebuild Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> * build: log pod-logs Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> * fix: change condition Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> * build: fix test-script Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> * build: add concurrent test Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> * fix: final changes Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> --------- Signed-off-by: Thomas Stringer <thomas@trstringer.com> Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de> Co-authored-by: Christian Kotzbauer <git@ckotzbauer.de>	2023-08-14 18:33:18 +02:00
Jean-Philippe Evrard	240a669727	Add prometheus export metrics functional testing Without this, we can't know if the exposed prometheus metrics behave properly. This is a problem, as the only way we can evaluate the success (right now), is a compilation success or failure from kured. While this is a good start, it doesn't translate to what we claim to offer: A boolean showing if a reboot is required. This fixes it by creating a new github action workflow testing if the float64 gauge is properly showing 0 for no reboot, 1 for reboot. This is done by exposing the metrics endpoint through a node port. A helm chart change was required to have the ability to expose the service on a node port. We connect to the kind node through docker in the `tests/test-metrics.sh`, where we curl the nodeport, extract the only relevant metric, and compare it to the expected result.	2021-04-13 16:17:42 +02:00
Daniel Holbach	de4e9a9bd9	Merge pull request #249 from evrardjp/produce-more-logs-for-stopped-containers Add more logs into gates	2020-11-27 13:49:17 +01:00
Jean-Philippe Evrard	81ee206a87	Add more logs into gates This will be necessary to find out why some docker containers fail to come back up in github actions.	2020-11-27 13:31:20 +01:00
Jean-Philippe Evrard	1165cfe6f4	Fix shellcheck issue Without this, shellcheck will complain about double quotes missing.	2020-11-27 12:12:39 +01:00
Jean-Philippe Evrard	67ea5922f4	Improve coordinated reboot output When a failure is happening and the cluster doesn't manage to be back up on time, we exit 1, and don't show docker logs. This is a problem, as we would benefit from a detailed docker output on those cases, when debugging. This fixes it by ensuring the logging is always done at the exit of the script.	2020-11-27 10:59:14 +01:00
Jean-Philippe Evrard	3d75f1b37a	Add smoke/basic functional test Without this patch, we don't test on release whether kured actually works and behave well. This is a problem, as a functional issue could have been hidden by a recent change, as our testing is minimalist (only test the usability, not the functionality). Instead of testing manually, we should ensure this in CI. This fixes it by adding a github action which tests the previously built artifacts before publishing a release. The job consume the helm chart in our code tree (note: this relies on the last released image), and run a functional test triggering a coordinated restart of a whole 5 node cluster deployed with kind, through github actions. Note: The github action needs to reset docker configuration, else the reboot of the node (a docker container in kind) will fail. It will be correctly triggered, but the node will not come back up, with its systemd log mentioning: "Failed to attach 1 to compat systemd cgroup".	2020-08-28 09:25:44 +02:00

10 Commits