* Add e2e test concurrency w/ signal
This will help make sure the big refactoring does not break
the main features.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Add podblocker test
Extends test coverage to ensure nothing breaks
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Rename "version" with "variant" in tests
For tests not running in different kubernetes versions,
but have different tests subcases/variants, rephrase the wording
"versions" as it is confusing.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Fix Staticcheck's SA1024 (subset with dupe chars)
This will replace trim, taking a cutset, with Replace.
This clarifies the intent to remove a substring.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Fix Staticcheck's ST1005
According to staticcheck, Error strings should not be capitalized (ST1005).
This changes the cases for our errors.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Fix incorrect string prints
A few strings have evolved to eventually remove all the templating
part of their strings, yet kept the formatting features.
This is incorrect, and will not pass staticcheck SA1006 and S1039.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Add staticcheck in make tests
Without this, people like myself will forget to run staticcheck.
This fixes it by making it part of make tests, which will run
with all the fast tests in CI.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
---------
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Move to stable kind cluster filenames
Without this, we have to rename files at every version.
This is really unnecessary, we should only change the files
and be done with it.
This is a problem, as if we move to programmatic test running,
the tests would need to be mutatated at every k8s version.
With this model, we know that only the kind-cluster files
need to be modified for the tests to ba automatically
adapted.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Create e2e from go tests interface
Without this, e2e tests need tons of manual work to
test locally, and the results are not easily exposed.
People are less likely to use the e2e tests if they
are tough to use outside the CI.
This commit makes it easier to run tests locally,
and ensures the CI is closer to the Makefile.
At the same time, this removes debt in the github
worfklows: By switching to newer versions of kind,
we can remove the very old workaround for the
failed to attach pid 1.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Add node stays as cordonned test
Without this, impossible to prove that the node stays as cordonned
after a reboot by kured.
This refactor also adds the test in the CI, and makes sure the
CI is a bit simpler, by using matrix more extensively.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Use hack dir instead of .tmp
This is more idiomatic.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
---------
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
Without this, some CI jobs are flaky or slow due to the following
issues:
- Triggering a reboot cause an unrecoverable boot loop.
This fixes it by restarting the containers that are incorrectly
exited.
- API server is down while operations happen.
This fixes it by ensuring at least one API server is up. In this
case, we don't add a reboot marker on the unique api server.
- The amount of nodes in a test environment is larger than
necessary.
This fixes it by ensuring two nodes are required to reboot.
This is enough for concurrency, and for the e2e testing.
- The wait time between operations is high, and can cause
a heartbeat to be missed in the check script.
This fixes it by checking more often, at the expense of
more logging. This is compensated by increasing the amount
of tries.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
* Add ability to have multiple nodes get a lock
Currently in kured a single node can get a lock with Acquire. There
could be situations where multiple nodes might want a lock in the event
that a cluster can handle multiple nodes being rebooted. This adds the
side-by-side implementation for a multiple node lock situation.
Signed-off-by: Thomas Stringer <thomas@trstringer.com>
* Refactor to use the same code path for a single lock and a multilock
Signed-off-by: Thomas Stringer <thomas@trstringer.com>
* test: force rebuild
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: log pod-logs
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: change condition
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: fix test-script
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: add concurrent test
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: final changes
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
---------
Signed-off-by: Thomas Stringer <thomas@trstringer.com>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Co-authored-by: Christian Kotzbauer <git@ckotzbauer.de>
Without this, we can't know if the exposed prometheus metrics
behave properly.
This is a problem, as the only way we can evaluate the success
(right now), is a compilation success or failure from kured.
While this is a good start, it doesn't translate to what we
claim to offer: A boolean showing if a reboot is required.
This fixes it by creating a new github action workflow testing
if the float64 gauge is properly showing 0 for no reboot, 1 for reboot.
This is done by exposing the metrics endpoint through a node port.
A helm chart change was required to have the ability to expose
the service on a node port. We connect to the kind node through
docker in the `tests/test-metrics.sh`, where we curl the nodeport,
extract the only relevant metric, and compare it to the expected result.
When a failure is happening and the cluster doesn't manage to
be back up on time, we exit 1, and don't show docker logs.
This is a problem, as we would benefit from a detailed docker
output on those cases, when debugging.
This fixes it by ensuring the logging is always done at the
exit of the script.
Without this patch, we don't test on release whether kured actually
works and behave well.
This is a problem, as a functional issue could have been hidden by
a recent change, as our testing is minimalist (only test the
usability, not the functionality).
Instead of testing manually, we should ensure this in CI.
This fixes it by adding a github action which tests the previously
built artifacts before publishing a release. The job consume the helm
chart in our code tree (note: this relies on the last released image),
and run a functional test triggering a coordinated restart of a
whole 5 node cluster deployed with kind, through github actions.
Note: The github action needs to reset docker configuration, else
the reboot of the node (a docker container in kind) will fail.
It will be correctly triggered, but the node will not come back up,
with its systemd log mentioning: "Failed to attach 1 to compat systemd cgroup".