Commit Graph

94 Commits

Author SHA1 Message Date
Daniel Kvist
b108aa4d2d Support json logformatter
This commit introduces a new flag '--log-format' that allows a user
to configure json logging on the pods. If the log-format
is not specified, the formatter will default to the existing
text formatter.
2021-10-25 14:38:53 +02:00
Jack
3c2508050d fix: don't use nil context in drain helper 2021-09-27 12:43:20 -07:00
Cameron McAvoy
cee15cfc32 Add force-reboot and drain timeouts to chart config and ds 2021-09-15 10:42:50 -05:00
Daniel Holbach
0955403470 Merge pull request #429 from weaveworks/alpine-3.14
build: updated to alpine@3.14
2021-08-30 10:54:35 +02:00
Christian Kotzbauer
9473f831be build: updated to alpine@3.14
Signed-off-by: Christian Kotzbauer <christian.kotzbauer@gmail.com>
2021-08-25 20:19:03 +02:00
Andres Morey
3c5eb968d3 Add reboot-delay command line argument
Currently, kured issues the system reboot command immediately after
kubectl drain finishes.

This is a problem for processes that need extra time to finish but aren't
running on pods and therefore aren't controlled by kubectl drain (e.g.
de-registering nodes from external load balancers).

This patch solves the problem by introducing a `reboot-delay` command
line argument that can be used to add a delay after kubectl drain
finishes but before the reboot command is issued.
2021-08-03 16:48:25 +03:00
Matt Jeanes
6af3f1abc1 Add --alert-firing-only parameter to only consider firing alerts 2021-07-27 11:23:10 +01:00
SimeonPoot
c7d5810503 Restructuring Prometheus client, added unit-tests to regex-queries active alerts (#386)
* prometheus labels incl tests

* enable label in main, add log, docs

* revert the option to query by label

* revert the option to query by label

* PromClient instantiate by func,white space removal

* revert whitespace fix for readability.

* revert removal of newlines for readability

* rename New to NewPromClient to improve readability

Co-authored-by: simp <simp@saxobank.com>
2021-07-27 07:09:46 +02:00
Danny Kulchinsky
c826d73695 fix slack deprecation notice 2021-05-28 13:52:01 -04:00
Jean-Philippe Evrard
79f22cee67 Merge branch 'main' into release-lock-delay 2021-04-14 09:48:28 +02:00
Steffen Pingel
f7b3de36a6 Add parameter for delaying release of lock
This support throtteling of reboots across the cluster
and allows rebooted nodes to reschedule pods, e.g.
to synchronize replicated state before rebooting the next node.
2021-04-13 10:14:14 +02:00
Cameron McAvoy
25dcf3cb12 Expose SkipWaitForDeleteTimeoutSeconds and explicitly return when cordonning fails 2021-04-08 09:52:15 -05:00
Cameron McAvoy
5a86ef40e8 Update the default drain timeout to be infinite 2021-04-07 17:17:33 -05:00
Cameron McAvoy
2400f34cc0 Don't panic if the cordon fails and force-reboot is true 2021-04-07 14:58:21 -05:00
Cameron McAvoy
8db5650510 Refactor force-drain to be a drain-timeout in general 2021-04-07 12:57:01 -05:00
Cameron McAvoy
65292983f2 Add force-reboot after force-timeout duration has been exceeded 2021-04-07 09:39:01 -05:00
Jean-Philippe Evrard
4d45fa8bdb Fix invoke reboot for custom commands
Without this patch, the rebootCommand passed to invokeReboot is
ignored, and the command used for reboot is always systemctl reboot.

This is a problem, as we are aiming for flexible commands for this
release.

This fixes it by restoring the previous behaviour before commit
[1] happened.

[1]: 694957d56e
2021-04-02 09:15:59 +02:00
atighineanu
694957d56e Implement universal notification mechanism
This patch gives the possibility to send notifications
 across different technologies. Also, this patch makes
 slack-hook-url, slack-username and slack-channel
 deprecated (informed by a warning).
 Also, updated the documentation (Readme).
2021-03-29 11:26:18 +02:00
Jean-Philippe Evrard
5930d733f8 Fix the Fatal calls using formatting
Without this, go test will rightfully fail.

This is a problem, as we don't have go test enabled, but we want
to have this in the future.

This should fix it.
2021-03-29 09:50:56 +02:00
Jean-Philippe Evrard
fd63e9a74b Add flexible commands parameters
Without this patch, you cannot configure the reboot
command to use, or the use another command to trigger
a reboot.

This is a problem, as multiple users have asked for
it in the past, and we are lacking flexibility.

This fixes it by introducing two new parameters,
- one to provide a custom reboot command.
  This should help people running kured on
  non systemd OS
- one to provide a custom sentinel command.
  This should help people running non Ubuntu OS,
  as they can directly use their command instead of
  generating a file (useful for CentOS/SUSE)

For this, several refactors had to be done, to
remove global state in some functions. Making those
functions closer to "pure functions" helps us
increase our test coverage here and later.

As commandReboot was very close to rebootCommand,
the function to reboot the node has been renamed
to invokeReboot.
2021-03-29 09:50:56 +02:00
Jean-Philippe Evrard
837bd4eb2a Refactor reboot blocks
Without this patch, we rely on global state in many functions for
which we check the reboot blockers.

This is a problem, as it's harder to test.

This patch fixes it by refactoring the reboot blockers. This also
includes a first series of unit tests for our main.
2021-03-29 09:50:56 +02:00
Jean-Philippe Evrard
15c57927c8 Update the deprecated DeleteLocalData
DeleteLocalData was deprecated for users of kubectl in 0.20 [1].
At the same time of the deprecation, the relevant code was also
removed [2] without warning: The DeleteLocalData from the helper
structure was simply renamed DeleteEmptyDirData, without shims
on the exposed pkg.

This is a problem, as it completely breaks kured.

This should fix it, by using the new field name.

[1]:
56ea9621b7
[2]:
56ea9621b7 (diff-041bdcdedca650a38a8d82cf15ab6f3665b7b84a0fb44a8bb5dcdc5cd944c63d)
2021-03-22 14:28:17 +01:00
Daniel Holbach
f6ada05c5d Merge pull request #320 from dholbach/alpine-3.13
update to alpine 3.13
2021-03-10 08:50:42 +01:00
Daniel Holbach
355813de30 update to alpine 3.13
Signed-off-by: Daniel Holbach <daniel@weave.works>
2021-03-10 08:10:36 +01:00
Daniel Holbach
250b9bad05 Merge pull request #296 from jackfrancis/node-annotations
add node annotations to identify kured reboot operations
2021-03-09 10:14:46 +01:00
Jack Francis
baf83408b8 add node annotations
adds a new --annotate-nodes daemonset runtime argument, which does the following when enabled:

- adds a new node annotation "weave.works/kured-most-recent-reboot-needed" with a value of the current RFC3339 timestamp as soon as kured identifies that a node needs to be rebooted
- adds a new node annotation "weave.works/kured-reboot-in-progress" with a value of the current RFC3339 timestamp as soon as kured identifies that a node needs to be rebooted
- removes the annotation "weave.works/kured-reboot-in-progress" when kured has successfully rebooted the node
2021-03-08 17:22:47 -08:00
Jack Francis
93c8242b89 always drain before reboot
This changes the pre-reboot drain functionality so that it always runs, regardless of the value of the Unschedulable node property.

Because kubectl drain is idempotent, we shouldn't have to worry about whether the node has already been set to Unschedulable (perhaps due to a prior, unsuccessful loop of the kured reboot cycle): we can run it over and over again. And because this drain func actually does a cordon + drain (and it only performs the drain if a cordon is successful), we can be sure that we aren't going to be thrashing this node w/ respect to scheduled pods.

This also fixes an edge case: if the node has been marked Unschedulable out-of-band, but workloads remain Running on this node, kured will no longer reboot the node's underlying VM/machine while it is actively running pods.
2021-03-08 17:20:31 -08:00
Daniel Holbach
fade706cbf Merge pull request #250 from damoon/19-PreferNoSchedule
implement issue-19 add prefer no schedule taint to avoid double draining of pods
2021-01-12 14:28:23 +01:00
David Sauer
5a4e197d27 change taint config to be disabled by default 2021-01-11 18:24:17 +01:00
David Sauer
3a35d6a46c remove taint in case the reboot is not needed anymore 2021-01-06 22:21:41 +01:00
David Sauer
34446f949e Allow to disable tainting during pending node reboot by setting the taint name to an empty string. 2021-01-06 21:39:32 +01:00
David Sauer
e4c684c3af taint node with PreferNoSchedule to prevent receiving (and double draining) additional pods from other rebooting nodes 2021-01-06 21:23:40 +01:00
David Sauer
204a06ca38 fixed call of log.Fatal instead of log.Fatalf 2021-01-06 21:23:40 +01:00
David Sauer
48897eb0ab avoid indentations to ease readability 2021-01-06 21:23:40 +01:00
Jean-Philippe Evrard
897834a9db Temporarily workaround alpine issue
Until a new alpine image is created, we should ensure the latest
packages are used, and therefore we should upgrade default
installed packages.

Without this patch, we'll have outdated and vulnerable packages
until a new 3.12 image is released.

This is a problem, as we'll publish broken images.

This should temporarily workaround it, at the expense of larger
images (contains package cache)
2020-12-14 11:20:27 +01:00
Daniel Jimenez Garcia
51cab0dedc rename message template parameters so they are not related to slack 2020-11-25 16:20:54 +00:00
Daniel Jimenez Garcia
f059cec794 GH-125, add additional parameters to override the drain/reboot slack messages 2020-11-25 16:19:31 +00:00
Bryan Boreham
1ba3acab98 Drain: allow pods grace period to terminate
The default of 0 is taken as "delete immediately", which is
not appropriate.
2020-11-23 18:07:56 +00:00
Daniel Holbach
aa49cfd8c4 Merge pull request #215 from evrardjp/make-lint-happier
Make go lint on cmd folder happier
2020-11-09 11:49:51 +01:00
Bryan Boreham
4c31184422 Merge pull request #213 from mvisonneau/lock_ttl
Replaced --annotationTTL with --lockTTL and fixed bug
2020-11-06 11:31:19 +00:00
Jean-Philippe Evrard
7091debe23 Make lint happier
Without this, golint is complaining about a few cosmetic changes.
This solves it, and is necessary if we want to add a lint test
in CI.
2020-11-05 10:14:39 +01:00
Jean-Philippe Evrard
ce6075c800 Remove prom-active-alerts
Prom-active-alerts command is not used, not tested, and
currently broken. Let's remove it.
2020-11-05 10:13:50 +01:00
Maxime VISONNEAU
9648d1d759 Replaced --annotationTTL with --lockTTL and made it work correctly 2020-10-30 10:39:18 +00:00
Jean-Philippe Evrard
e5a2d4acc7 Refactor drain/uncordon
Moving the drainer object close to its usage is more readable.
2020-10-29 11:45:20 +01:00
Jean-Philippe Evrard
72c4112e20 Use kubectl as library instead of calling from cli 2020-10-15 13:02:35 +02:00
Jean-Philippe Evrard
b0bd603931 fix: Follow DKL-DI-0004 guideline
Without this patch, we need to build a cache, remove it.
Since apk allows to work with no-cache and won't leave artifacts,
we should use it.

This will make the dockle best practices scanner happier.
2020-09-11 16:53:59 +02:00
Daniel Holbach
3ebc224958 update alpine to 3.12, k8s 1.18.8 2020-08-28 10:27:39 +02:00
Daniel Holbach
16109017ce Prepare for k8s release 1.19 (Aug 25)
This is #152, #139, #127 in disguise.

	Maybe this time let it simmer a bit longer until the k8s
	release is there?
2020-08-19 17:30:00 +02:00
Daniel Holbach
8fafad18bb Revert #139
This is a follow-up to #150, so we can get a 1.4.x release
	out that will be geared towards k8s 1.1[6-8].

	Update to latest 1.17 kubectl: 1.17.7.
2020-06-26 17:30:01 +02:00
Bryan Boreham
ec75533394 Merge pull request #119 from michalschott/annotationTTL
Adding --annotation-ttl for automatic unlock
2020-05-20 11:30:44 +01:00