Commit Graph

10 Commits

Author SHA1 Message Date
Jean-Philippe Evrard
455b3df0dc improve tests (#1021)
* Add e2e test concurrency w/ signal

This will help make sure the big refactoring does not break
the main features.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Add podblocker test

Extends test coverage to ensure nothing breaks

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Rename "version" with "variant" in tests

For tests not running in different kubernetes versions,
but have different tests subcases/variants, rephrase the wording
"versions" as it is confusing.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Fix Staticcheck's SA1024 (subset with dupe chars)

This will replace trim, taking a cutset, with Replace.

This clarifies the intent to remove a substring.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Fix Staticcheck's ST1005

According to staticcheck, Error strings should not be capitalized (ST1005).

This changes the cases for our errors.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Fix incorrect string prints

A few strings have evolved to eventually remove all the templating
part of their strings, yet kept the formatting features.

This is incorrect, and will not pass staticcheck SA1006 and S1039.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Add staticcheck in make tests

Without this, people like myself will forget to run staticcheck.

This fixes it by making it part of make tests, which will run
with all the fast tests in CI.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

---------

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
2025-01-09 14:42:28 -08:00
Jean-Philippe Evrard
608abc6e89 Increase CI coverage and provide new dev tool (#982)
* Move to stable kind cluster filenames

Without this, we have to rename files at every version.
This is really unnecessary, we should only change the files
and be done with it.

This is a problem, as if we move to programmatic test running,
the tests would need to be mutatated at every k8s version.

With this model, we know that only the kind-cluster files
need to be modified for the tests to ba automatically
adapted.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Create e2e from go tests interface

Without this, e2e tests need tons of manual work to
test locally, and the results are not easily exposed.

People are less likely to use the e2e tests if they
are tough to use outside the CI.

This commit makes it easier to run tests locally,
and ensures the CI is closer to the Makefile.

At the same time, this removes debt in the github
worfklows: By switching to newer versions of kind,
we can remove the very old workaround for the
failed to attach pid 1.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Add node stays as cordonned test

Without this, impossible to prove that the node stays as cordonned
after a reboot by kured.

This refactor also adds the test in the CI, and makes sure the
CI is a bit simpler, by using matrix more extensively.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

* Use hack dir instead of .tmp

This is more idiomatic.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

---------

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
2024-10-15 13:16:45 -07:00
Jean-Philippe Evrard
a02ae67559 Accelerate CI jobs
Without this, some CI jobs are flaky or slow due to the following
issues:
- Triggering a reboot cause an unrecoverable boot loop.
  This fixes it by restarting the containers that are incorrectly
  exited.
- API server is down while operations happen.
  This fixes it by ensuring at least one API server is up. In this
  case, we don't add a reboot marker on the unique api server.
- The amount of nodes in a test environment is larger than
  necessary.
  This fixes it by ensuring two nodes are required to reboot.
  This is enough for concurrency, and for the e2e testing.
- The wait time between operations is high, and can cause
  a heartbeat to be missed in the check script.
  This fixes it by checking more often, at the expense of
  more logging. This is compensated by increasing the amount
  of tries.

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
2024-10-02 23:30:41 +02:00
Thomas Stringer
3b9b190422 Add multiple concurrent node reboot feature (#660)
* Add ability to have multiple nodes get a lock

Currently in kured a single node can get a lock with Acquire. There
could be situations where multiple nodes might want a lock in the event
that a cluster can handle multiple nodes being rebooted. This adds the
side-by-side implementation for a multiple node lock situation.

Signed-off-by: Thomas Stringer <thomas@trstringer.com>

* Refactor to use the same code path for a single lock and a multilock

Signed-off-by: Thomas Stringer <thomas@trstringer.com>

* test: force rebuild

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

* build: log pod-logs

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

* fix: change condition

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

* build: fix test-script

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

* build: add concurrent test

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

* fix: final changes

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

---------

Signed-off-by: Thomas Stringer <thomas@trstringer.com>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Co-authored-by: Christian Kotzbauer <git@ckotzbauer.de>
2023-08-14 18:33:18 +02:00
Jean-Philippe Evrard
240a669727 Add prometheus export metrics functional testing
Without this, we can't know if the exposed prometheus metrics
behave properly.

This is a problem, as the only way we can evaluate the success
(right now), is a compilation success or failure from kured.
While this is a good start, it doesn't translate to what we
claim to offer: A boolean showing if a reboot is required.

This fixes it by creating a new github action workflow testing
if the float64 gauge is properly showing 0 for no reboot, 1 for reboot.
This is done by exposing the metrics endpoint through a node port.
A helm chart change was required to have the ability to expose
the service on a node port. We connect to the kind node through
docker in the `tests/test-metrics.sh`, where we curl the nodeport,
extract the only relevant metric, and compare it to the expected result.
2021-04-13 16:17:42 +02:00
Daniel Holbach
de4e9a9bd9 Merge pull request #249 from evrardjp/produce-more-logs-for-stopped-containers
Add more logs into gates
2020-11-27 13:49:17 +01:00
Jean-Philippe Evrard
81ee206a87 Add more logs into gates
This will be necessary to find out why some docker containers fail
to come back up in github actions.
2020-11-27 13:31:20 +01:00
Jean-Philippe Evrard
1165cfe6f4 Fix shellcheck issue
Without this, shellcheck will complain about double quotes
missing.
2020-11-27 12:12:39 +01:00
Jean-Philippe Evrard
67ea5922f4 Improve coordinated reboot output
When a failure is happening and the cluster doesn't manage to
be back up on time, we exit 1, and don't show docker logs.

This is a problem, as we would benefit from a detailed docker
output on those cases, when debugging.

This fixes it by ensuring the logging is always done at the
exit of the script.
2020-11-27 10:59:14 +01:00
Jean-Philippe Evrard
3d75f1b37a Add smoke/basic functional test
Without this patch, we don't test on release whether kured actually
works and behave well.

This is a problem, as a functional issue could have been hidden by
a recent change, as our testing is minimalist (only test the
usability, not the functionality).
Instead of testing manually, we should ensure this in CI.

This fixes it by adding a github action which tests the previously
built artifacts before publishing a release. The job consume the helm
chart in our code tree  (note: this relies on the last released image),
and run a functional test triggering a coordinated restart of a
whole 5 node cluster deployed with kind, through github actions.

Note: The github action needs to reset docker configuration, else
the reboot of the node (a docker container in kind) will fail.
It will be correctly triggered, but the node will not come back up,
with its systemd log mentioning: "Failed to attach 1 to compat systemd cgroup".
2020-08-28 09:25:44 +02:00