https://helm.sh/docs/chart_best_practices/labels/#standard-labels
Upgrade Notes
* bump Helm chart version to v3.0.0
* shorten gitops directions
* shorten the amount of text to get to why
Users will want to know why we have decided to commit this breaking
change straightaway
* better sentence flow
* even slimmer, only support uninstall/reinstall
* better language
* fixup: it isn't kube-prometheus-stack's Smon
it's our ServiceMonitor, which has to line up with
kube-prometheus-stack's ServiceMonitor Selector labels
* remove the "updateStrategy"
Signed-off-by: Kingdon Barrett <kingdon@weave.works>
* added notification when uncordoning
when reboot & uncordoning is succ
essful -> notification will be se
nt
* added uncordon message tmpl
added message template for
announcing successful uncor-
doning and reboot.
* added proper documentation about new flag
added readme note about new flag
* Added support for multi-arch image build
* Requested changes to multi-arch build
* Further optimizations of multi build
* multi needs QEMU for some pieces
* change main push for all platforms
* Update Dockerfile to call Makefile
* Remove manual workflow
* feat: update kubernetes dependencies
closes#525
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: update kind
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: missed kind-update
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: another kind update
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: use new toleration
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: use both tolerations
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: some debugging
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* revert [skip ci]
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Terminated pods should be excluded from the blocking a reboot as per https://github.com/weaveworks/kured/issues/227
This adds status filters to the fieldSelector in order to do that. I've not updated tests here but have successfully tested the exact same filter using kubectl
This commit introduces a new flag '--log-format' that allows a user
to configure json logging on the pods. If the log-format
is not specified, the formatter will default to the existing
text formatter.
In this PR the slack-hook-url is translated
into shoutrrr syntax. Therefore, slack pack
age as well as checks for slack-hook-url in
drain and reboot functions are removed.
Also added a unit test for flagCheck(), this
function also checks the (slack)URL syntax.
Currently, kured issues the system reboot command immediately after
kubectl drain finishes.
This is a problem for processes that need extra time to finish but aren't
running on pods and therefore aren't controlled by kubectl drain (e.g.
de-registering nodes from external load balancers).
This patch solves the problem by introducing a `reboot-delay` command
line argument that can be used to add a delay after kubectl drain
finishes but before the reboot command is issued.
* prometheus labels incl tests
* enable label in main, add log, docs
* revert the option to query by label
* revert the option to query by label
* PromClient instantiate by func,white space removal
* revert whitespace fix for readability.
* revert removal of newlines for readability
* rename New to NewPromClient to improve readability
Co-authored-by: simp <simp@saxobank.com>
shoutrrr now have versioned docs to allow directly linking to the version that matches the one you use
changes should always backwards compatible, but not the other way around
Without this, we get multiple questions about our testing.
This should help clarify the tests and our coverage by:
- Simplifying our coverage
- Documenting better the purpose of each workflow file
- Documenting our testing and development activities better.
We are relying on master, which might break anytime (or in this
case, moved to another branch).
Instead we should rely on a stable version, and unfreeze if
necessary. Dependabot helps us maintain those releases anyway.
Without this patch, it's not clear that we added command line
arguments recently. This should expose our latest changes in the
future released manifest.
Without this change, the "Test helm chart (install) action" will
rightfully succeed when our helm chart gets installed and has
no syntax issues. However, it doesn't test if kured is properly
installed. For example, the helm chart can try to install a
yet unpublished image, and our test will succeed, as the syntax
is still valid.
This is a problem, as everything looks green, but it's not
effectively working. Our other jobs are focusing on code changes,
so they rightfully override the image tag, which is not what
we want in this "Test helm chart" action.
This fixes it by adding an extra job in the workflow, depending
on the chart testing.
Without this, we can't know if the exposed prometheus metrics
behave properly.
This is a problem, as the only way we can evaluate the success
(right now), is a compilation success or failure from kured.
While this is a good start, it doesn't translate to what we
claim to offer: A boolean showing if a reboot is required.
This fixes it by creating a new github action workflow testing
if the float64 gauge is properly showing 0 for no reboot, 1 for reboot.
This is done by exposing the metrics endpoint through a node port.
A helm chart change was required to have the ability to expose
the service on a node port. We connect to the kind node through
docker in the `tests/test-metrics.sh`, where we curl the nodeport,
extract the only relevant metric, and compare it to the expected result.
This support throtteling of reboots across the cluster
and allows rebooted nodes to reschedule pods, e.g.
to synchronize replicated state before rebooting the next node.
Without this patch, chart-testing is using the branch named
"master" by default.
This is a problem, as we just renamed our development branch
"main" instead of "master".
This should fix it by pointing to the right branch.
- Make markdownlint happier in a couple of places.
- Rename '*-master-*' files
- Change default branches of some other projects
we rely on. They moved to 'main' as well.
- Standardise version of actions/checkout.
- Update last release in README to 1.6.1.
- Bbump chart version.
Eventually closes: #252
Signed-off-by: Daniel Holbach <daniel@weave.works>
Without this patch, the rebootCommand passed to invokeReboot is
ignored, and the command used for reboot is always systemctl reboot.
This is a problem, as we are aiming for flexible commands for this
release.
This fixes it by restoring the previous behaviour before commit
[1] happened.
[1]: 694957d56e
This patch gives the possibility to send notifications
across different technologies. Also, this patch makes
slack-hook-url, slack-username and slack-channel
deprecated (informed by a warning).
Also, updated the documentation (Readme).
Without this, go test will rightfully fail.
This is a problem, as we don't have go test enabled, but we want
to have this in the future.
This should fix it.
Without this patch, you cannot configure the reboot
command to use, or the use another command to trigger
a reboot.
This is a problem, as multiple users have asked for
it in the past, and we are lacking flexibility.
This fixes it by introducing two new parameters,
- one to provide a custom reboot command.
This should help people running kured on
non systemd OS
- one to provide a custom sentinel command.
This should help people running non Ubuntu OS,
as they can directly use their command instead of
generating a file (useful for CentOS/SUSE)
For this, several refactors had to be done, to
remove global state in some functions. Making those
functions closer to "pure functions" helps us
increase our test coverage here and later.
As commandReboot was very close to rebootCommand,
the function to reboot the node has been renamed
to invokeReboot.
Without this patch, we rely on global state in many functions for
which we check the reboot blockers.
This is a problem, as it's harder to test.
This patch fixes it by refactoring the reboot blockers. This also
includes a first series of unit tests for our main.
Without this patch, the version of 1.20 is taken in jobs as 1.2.
This is a problem, as it breaks all jobs, because there is no
file to provision a cluster with kubernetes 1.2 (and we shouldn't
do this!)
This fixes it by ensuring there is no mangling of the version
strings, and therefore the right file is used.
DeleteLocalData was deprecated for users of kubectl in 0.20 [1].
At the same time of the deprecation, the relevant code was also
removed [2] without warning: The DeleteLocalData from the helper
structure was simply renamed DeleteEmptyDirData, without shims
on the exposed pkg.
This is a problem, as it completely breaks kured.
This should fix it, by using the new field name.
[1]:
56ea9621b7
[2]:
56ea9621b7 (diff-041bdcdedca650a38a8d82cf15ab6f3665b7b84a0fb44a8bb5dcdc5cd944c63d)
Without this patch, go.mod will lag behind for the kubernetes
packages, as it's not automatically tested by dependabot.
We should bump versions with each new minor release of kured.
This should fix it.
adds a new --annotate-nodes daemonset runtime argument, which does the following when enabled:
- adds a new node annotation "weave.works/kured-most-recent-reboot-needed" with a value of the current RFC3339 timestamp as soon as kured identifies that a node needs to be rebooted
- adds a new node annotation "weave.works/kured-reboot-in-progress" with a value of the current RFC3339 timestamp as soon as kured identifies that a node needs to be rebooted
- removes the annotation "weave.works/kured-reboot-in-progress" when kured has successfully rebooted the node
This changes the pre-reboot drain functionality so that it always runs, regardless of the value of the Unschedulable node property.
Because kubectl drain is idempotent, we shouldn't have to worry about whether the node has already been set to Unschedulable (perhaps due to a prior, unsuccessful loop of the kured reboot cycle): we can run it over and over again. And because this drain func actually does a cordon + drain (and it only performs the drain if a cordon is successful), we can be sure that we aren't going to be thrashing this node w/ respect to scheduled pods.
This also fixes an edge case: if the node has been marked Unschedulable out-of-band, but workloads remain Running on this node, kured will no longer reboot the node's underlying VM/machine while it is actively running pods.
Without this, it's possible that the helm chart documentation
contains the `image tag` version which might not be equal to
the version in the helm chart, as it's only an example.
This is a confusing, so instead we should use make to edit the
application version everywhere.
This fixes it by updating the Makefile to modify text of the
chart's README using a regex looking for something similar to
a version; then I used the updated makefile to edit the README,
which in turns requires a bump of the version of the chart
itself.
Without this patch, the name of the image is not templated, which
cause the action to fail.
This should fix it, by ensuring the image scan action uses a
templated value, instead of incorrectly relying on shell templating,
which doesn't run in the action.
Without this patch, we are using outdated images in kind cluster
setup.
This should fix it, by removing 1.17 cluster (which is not tested
anymore), and updating 1.19 images.
Until a new alpine image is created, we should ensure the latest
packages are used, and therefore we should upgrade default
installed packages.
Without this patch, we'll have outdated and vulnerable packages
until a new 3.12 image is released.
This is a problem, as we'll publish broken images.
This should temporarily workaround it, at the expense of larger
images (contains package cache)
Without this patch, dependabot will still try to bump some k8s
dependencies.
This is a problem, as we need to bump them together, manually.
This should fix it by removing them all from dependabot.
We are now testing the helm charts on each PR. They are now
ensured to be passing our tests and reviewed before merging.
This also means that the merged changes in the master branch
are reliable, and therefore can be consumed immediately.
Currently, we are waiting for a release to publish a helm
chart.
This is a problem as it means that the helm chart will
always lag behind, and we'll miss a few semantic versions,
if for example the helm chart is adapted multiple times
before the next release.
This should fix it by ensuring ALL the merged changes in
our helm chart will result in a new published helm chart.
Without this patch, chart linting will fail: more than two
spaces are needed before a comment in the helm chart values.
This fixes it by adding one more space, and move the whole block
of comments for consistency.
This ensures we bump the code for 1.20.
It updates the testing to ensure kured works on a 1.20 cluster,
removes the testing on 1.17 (as it is now deprecated).
Libraries remain on 1.19, to avoid breaking 1.18 clusters.
Without this patch, the PR jobs are broken and no jobs are running.
This was a recently introduced typo in the last refactor of the
PR jobs.
This should fix it, and make the PR test working again.
Without this, golang version used is the golang version decided
by github.
This is a problem, as it might shift over time, without our control.
This fixes it by getting the golang version from the go.mod.
Without this patch, we'll get kubernetes updates.
This is not necessary, and could be even a problem on merge:
those kubernetes updates are done separately, knowingly,
to respect the life cycle of the kubernetes we need
(and stay one version below latest to have a larger coverage
of versions).
We could keep dependabot to update those on a lower frequency,
but that sounds clunky and not great. Instead disable them all,
and rely on the team to do this regular maintenance work.
There are lots of duplicated code in this workflow.
This fixes it by making a unique job with parameters. The
matrix buys us the parallelisation and the fail-fast.
Without this patch, the lint action incorrectly returns everything
is fine.
This is a problem, as lint effectively is not running, and
therefore we could merge broken charts.
This fixes it by updating to the latest practices you can find
in the official chart-repo-actions.
(See the official example in
i1a9640d998/.github/workflows/lint-test.yaml)
- Made all the file extensions ".yaml"
- Regrouped actions together to make it easy to see when they
are useful: on-pr is useful at every PR, on-tag when we are
ready to tag next image, on-pr-chart when we have a PR to
modify the chart with the published image, on-release when
we have released and need to publish the final helm chart
- Regrouped periodic jobs together, to deal with stale prs/issues
and ensuring that our helm chart always works.
When a failure is happening and the cluster doesn't manage to
be back up on time, we exit 1, and don't show docker logs.
This is a problem, as we would benefit from a detailed docker
output on those cases, when debugging.
This fixes it by ensuring the logging is always done at the
exit of the script.
We don't need to test with kustomize, manifest testing is good
enough, as we just test that the manifest are correct, not that
they are functional (which would require a change in the poll time).
This extends our test coverages for kured-* manifest changes on PRs,
and any eventual changes in kubernetes/kubectl on periodics.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
In the past, we had lint issues which were merged into the code,
and/or lint changed without us adapting our code.
This should allow us to stay on top of linting issue by
highlighting them in PRs.
Without this patch, we might hold old issues and PR for a long
time. Instead we should close them. People can reopen if necessary.
This would show that we have a proper triage process, and a proper
way to handle those.
Without this patch, we need to build a cache, remove it.
Since apk allows to work with no-cache and won't leave artifacts,
we should use it.
This will make the dockle best practices scanner happier.
Without this patch, there is no way we can see, in the development
process, if the image we are about to publish is insecure.
This is a problem as we might be releasing new versions of kured
with outdated base image which contains vulnerabilities.
This fixes it by creating a job which will show any eventual
vulnerability.
This automates the manifest and helm chart version handling.
Just pass the organisation and version in the make command to
update the manifests/helm charts.
This does not automate the helm chart version and does not
create a manifest used in the release process.
Without this patch, we don't test on release whether kured actually
works and behave well.
This is a problem, as a functional issue could have been hidden by
a recent change, as our testing is minimalist (only test the
usability, not the functionality).
Instead of testing manually, we should ensure this in CI.
This fixes it by adding a github action which tests the previously
built artifacts before publishing a release. The job consume the helm
chart in our code tree (note: this relies on the last released image),
and run a functional test triggering a coordinated restart of a
whole 5 node cluster deployed with kind, through github actions.
Note: The github action needs to reset docker configuration, else
the reboot of the node (a docker container in kind) will fail.
It will be correctly triggered, but the node will not come back up,
with its systemd log mentioning: "Failed to attach 1 to compat systemd cgroup".
The upside is that image building will always use the latest
stable version of the alpine OS, which might include security fixes.
The downside is that it's less reproducible, because the full
version isn't given.
While this commit isn't necessary per se, it's nice to have
an image that will be up to date, when we'll build it.
Use the tools installed in the host to effect reboots, and allow
the execution of commands such as `needs-restart` to determine if
reboots are required.
- Added kured service account
- Added kured clusterrole
- Added kured clusterrolebinding
- Updated README.md documentation to include deploying with RBAC support
Due to
metadata:
name: kured # Must match `--ds-name`
namespace: kube-system # Must match `--ds-namespace`
There should be
- --ds-name=kured
- --ds-namespace=kube-system
As args.
2017-11-03 23:02:03 +03:00
60 changed files with 5469 additions and 612 deletions
# Stale by default waits for 60 days before marking PR/issues as stale, and closes them after 21 days.
# Do not expire the first issues that would allow the community to grow.
- uses:actions/stale@v5
with:
repo-token:${{ secrets.GITHUB_TOKEN }}
stale-issue-message:'This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).'
stale-pr-message:'This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).'
stale-issue-label:'no-issue-activity'
stale-pr-label:'no-pr-activity'
exempt-issue-labels:'good first issue,keep'
days-before-close:21
check-docs-links:
name:Check docs for incorrect links
runs-on:ubuntu-latest
steps:
- uses:actions/checkout@v3
- name:Link Checker
id:lc
uses:peter-evans/link-checker@v1
with:
args:-r *.md *.yaml */*/*.go -x .cluster.local
- name:Fail if there were link errors
run:exit ${{ steps.lc.outputs.exit_code }}
vuln-scan:
name:Build image and scan it against known vulnerabilities
SUDO=$(shell docker info >/dev/null 2>&1||echo"sudo -E")
all:image
clean:
go clean
rm -f cmd/kured/kured
rm -rf ./build
godeps=$(shell go get $1&& go list -f '{{join .Deps "\n"}}'$1| grep -v /vendor/ | xargs go list -f '{{if not .Standard}}{{ $$dep := . }}{{range .GoFiles}}{{$$dep.Dir}}/{{.}} {{end}}{{end}}')
godeps=$(shell go list -f '{{join .Deps "\n"}}'$1| grep -v /vendor/ | xargs go list -f '{{if not .Standard}}{{ $$dep := . }}{{range .GoFiles}}{{$$dep.Dir}}/{{.}} {{end}}{{end}}')
DEPS=$(call godeps,./cmd/kured)
@@ -19,16 +19,40 @@ cmd/kured/kured: $(DEPS)
cmd/kured/kured:cmd/kured/*.go
CGO_ENABLED=0GOOS=linux GOARCH=amd64 go build -ldflags "-X main.version=$(VERSION)" -o $@ cmd/kured/*.go
kured-multi:
CGO_ENABLED=0 go build -ldflags "-X main.version=$(VERSION)" -o cmd/kured/kured cmd/kured/*.go
If you want to customise the installation, download the manifest and
@@ -58,17 +87,49 @@ edit it in accordance with the following section before application.
The following arguments can be passed to kured via the daemonset pod template:
```
```console
Kubernetes Reboot Daemon
Usage:
kured [flags]
Flags:
--alert-filter-regexp value alert names to ignore when checking for active alerts
--ds-name string namespace containing daemonset on which to place lock (default "kube-system")
--ds-namespace string name of daemonset on which to place lock (default "kured")
--lock-annotation string annotation in which to record locking node (default "weave.works/kured-node-lock")
--period duration reboot check period (default 1h0m0s)
--prometheus-url string Prometheus instance to probe for active alerts
--reboot-sentinel string path to file whose existence signals need to reboot (default "/var/run/reboot-required")
--slack-hook-url string slack hook URL for reboot notfications
--slack-username string slack username for reboot notfications (default "kured")
--alert-filter-regexp regexp.Regexp alert names to ignore when checking for active alerts
--alert-firing-only only consider firing alerts when checking for active alerts
--annotate-nodes if set, the annotations 'weave.works/kured-reboot-in-progress' and 'weave.works/kured-most-recent-reboot-needed' will be given to nodes undergoing kured reboots
--drain-grace-period int time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default -1)
--drain-timeout duration timeout after which the drain is aborted (default: 0, infinite time)
--ds-name string name of daemonset on which to place lock (default "kured")
--ds-namespace string namespace containing daemonset on which to place lock (default "kube-system")
--end-time string schedule reboot only before this time of day (default "23:59:59")
--force-reboot force a reboot even if the drain fails or times out
-h, --help help for kured
--lock-annotation string annotation in which to record locking node (default "weave.works/kured-node-lock")
--lock-release-delay duration delay lock release for this duration (default: 0, disabled)
--lock-ttl duration expire lock annotation after this duration (default: 0, disabled)
--log-format string use text or json log format (default "text")
--message-template-drain string message template used to notify about a node being drained (default "Draining node %s")
--message-template-reboot string message template used to notify about a node being rebooted (default "Rebooting node %s")
--message-template-uncordon string message template used to notify about a node being successfully uncordoned (default "Node %s rebooted & uncordoned successfully!")
--node-id string node name kured runs on, should be passed down from spec.nodeName via KURED_NODE_ID environment variable
--notify-url string notify URL for reboot notifications (cannot use with --slack-hook-url flags)
--period duration sentinel check period (default 1h0m0s)
--post-reboot-node-labels strings labels to add to nodes after uncordoning
--pre-reboot-node-labels strings labels to add to nodes before cordoning
--prefer-no-schedule-taint string Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to "weave.works/kured-node-reboot" to enable tainting.
--prometheus-url string Prometheus instance to probe for active alerts
--reboot-command string command to run when a reboot is required (default "/bin/systemctl reboot")
--reboot-days strings schedule reboot on these days (default [su,mo,tu,we,th,fr,sa])
--reboot-delay duration delay reboot for this duration (default: 0, disabled)
--reboot-sentinel string path to file whose existence triggers the reboot command (default "/var/run/reboot-required")
--reboot-sentinel-command string command for which a zero return code will trigger a reboot command
--skip-wait-for-delete-timeout int when seconds is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node
--slack-channel string slack channel for reboot notifications
--slack-hook-url string slack hook URL for reboot notifications [deprecated in favor of --notify-url]
--slack-username string slack username for reboot notifications (default "kured")
--start-time string schedule reboot only after this time of day (default "0:00")
--time-zone string use this timezone for schedule inputs (default "UTC")
```
### Reboot Sentinel File & Period
@@ -79,32 +140,115 @@ values with `--reboot-sentinel` and `--period`. Each replica of the
daemon uses a random offset derived from the period on startup so that
nodes don't all contend for the lock simultaneously.
### Reboot Sentinel Command
Alternatively, a reboot sentinel command can be used. If a reboot
sentinel command is used, the reboot sentinel file presence will be
ignored. When the command exits with code `0`, kured will assume
that a reboot is required.
For example, if you're using RHEL or its derivatives, you can
set the sentinel command to `sh -c "! needs-restarting --reboothint"`
(by default the command will return `1` if a reboot is required,
so we wrap it in `sh -c` and add `!` to negate the return value).
We recommend setting `--slack-username` to be the name of the
environment, e.g. `dev` or `prod`.
Alternatively you can use the `--message-template-drain`, `--message-template-reboot` and `--message-template-uncordon` to customize the text of the message, e.g.
```cli
--message-template-drain="Draining node %s part of *my-cluster* in region *xyz*"
```
Here is the syntax:
- slack: `slack://tokenA/tokenB/tokenC`
(`slack://<USERNAME>@tokenA/tokenB/tokenC` - in case you want to [respect username](https://github.com/weaveworks/kured/issues/482))
(`--slack-hook-url` is deprecated but possible to use)
> NB the `-` at the end of the command is important - it instructs
> `kubectl` to remove that annotation entirely.
### Automatic Unlock
In exceptional circumstances (especially when used with cluster-autoscaler) a node
which holds lock might be killed thus annotation will stay there for ever.
Using `--lock-ttl=30m` will allow other nodes to take over if TTL has expired (in this case 30min) and continue reboot process.
### Delaying Lock Release
Using `--lock-release-delay=30m` will cause nodes to hold the lock for the specified time frame (in this case 30min) before it is released and the reboot process continues. This can be used to throttle reboots across the cluster.
## Building
Kured now uses [Go
Modules](https://github.com/golang/go/wiki/Modules), so build
instructions vary depending on where you have checked out the
repository:
**Building outside $GOPATH:**
```console
make
```
dep ensure && make
**Building inside $GOPATH:**
```console
GO111MODULE=on make
```
You can find the current preferred version of Golang in the [go.mod file](go.mod).
If you are interested in contributing code to kured, please take a look at
our [development][development] docs.
[development]: DEVELOPMENT.md
## Frequently Asked/Anticipated Questions
### Why is there no `latest` tag on Docker Hub?
Use of `latest` for production deployments is bad practice - see
[here](https://kubernetes.io/docs/concepts/configuration/overview) for
details. The manifest on `main` refers to `latest` for local
development testing with minikube only; for production use choose a
versioned manifest from the [release page](https://github.com/weaveworks/kured/releases/).
## Getting Help
If you have any questions about, feedback for or problems with `kured`:
* Invite yourself to the <a href="https://slack.weave.works/" target="_blank">Weave Users Slack</a>.
* Ask a question on the [#kured](https://weave-community.slack.com/messages/kured/) slack channel.
* [File an issue](https://github.com/weaveworks/kured/issues/new).
* Join us in [our monthly meeting](https://docs.google.com/document/d/1bsHTjHhqaaZ7yJnXF6W8c89UB_yn-OoSZEmDnIP34n8/edit#),
every fourth Wednesday of the month at 16:00 UTC.
We follow the [CNCF Code of Conduct](CODE_OF_CONDUCT.md).
The command removes all the Kubernetes components associated with the chart and deletes the release.
## Upgrade Notes
### From 2.x to 3.x
The Helm chart labels have been realigned to conform with the [standard labels](https://helm.sh/docs/chart_best_practices/labels/#standard-labels) in the current Helm Chart Best Practices guide, so this upgrade will fail unless the DaemonSet is deleted and recreated. The only way that Helm supports delete and recreate is by uninstalling, so please uninstall the Kured Helm chart before installing again with `v3.x`.
If you use any GitOps tool, please check and understand how to do a reinstall beforehand.
Supposing users want to enable metrics and use a `ServiceMonitor` with the `kube-prometheus-stack` chart's default `prometheus` instance. Starting with a chart that has values:
```
metrics:
create: true
labels:
release: kube-prometheus-stack
```
A "ServiceMonitor" needs a "release" label to be discovered by the Prometheus-Operator with the default configuration of `kube-prometheus-stack` and this chart (in the prior `v2.x` chart) already sets a `release` label hardcoded. This is changed by applying the best-practise labels in the chart `v3.x`. Now the user can decide which `release` label-value should be used.
With this update, it's more readily possible to make use of the Kured chart with `kube-prometheus-stack`'s default `ServiceMonitor` selector configuration.
## Migrate from stable Helm-Chart
### From 1.x to 2.x
The following changes have been made compared to the stable chart:
- **[BREAKING CHANGE]** The `autolock` feature was removed. Use `configuration.startTime` and `configuration.endTime` instead.
- Role inconsistencies have been fixed (allowed verbs for modifying the `DaemonSet`, apiGroup of `PodSecurityPolicy`)
- Added support for affinities.
- Configuration of cli-flags can be made through a `configuration` object.
- Added optional `Service` and `ServiceMonitor` support for metrics endpoint.
- Previously static Slack channel, hook URL and username values are now made dynamic using `tpl` function.
| `configuration.preferNoScheduleTaint` | Taint name applied during pending node reboot | `""` |
| `configuration.preRebootNodeLabels` | Array of key-value-pairs to add to nodes before cordoning for multiple cli-parameters `--pre-reboot-node-labels` | `[]` |
| `configuration.postRebootNodeLabels` | Array of key-value-pairs to add to nodes after uncordoning for multiple cli-parameters `--post-reboot-node-labels` | `[]` |
| `rbac.create` | Create RBAC roles | `true` |
| `serviceAccount.create` | Create a service account | `true` |
| `serviceAccount.name` | Service account name to create (or use if `serviceAccount.create` is false) | (chart fullname) |
| `containerSecurityContext.allowPrivilegeEscalation`| Enables `allowPrivilegeEscalation` in container-specific security context. If not set it won't be configured. | |
| `resources` | Resources requests and limits. | `{}` |
| `metrics.create` | Create a ServiceMonitor for prometheus-operator | `false` |
| `metrics.namespace` | The namespace to create the ServiceMonitor in | `""` |
| `metrics.labels` | Additional labels for the ServiceMonitor | `{}` |
| `metrics.interval` | Interval prometheus should scrape the endpoint | `60s` |
| `metrics.scrapeTimeout` | A custom scrapeTimeout for prometheus | `""` |
| `service.create` | Create a Service for the metrics endpoint | `false` |
| `service.name ` | Service name for the metrics endpoint | `""` |
| `service.port` | Port of the service to expose | `8080` |
| `service.annotations` | Annotations to apply to the service (eg to add Prometheus annotations) | `{}` |
| `priorityClassName` | Priority Class to be used by the pods | `""` |
| `tolerations` | Tolerations to apply to the daemonset (eg to allow running on master) | `[{"key": "node-role.kubernetes.io/control-plane", "effect": "NoSchedule"}]` for Kubernetes 1.24.0 and greater, otherwise `[{"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}]`|
| `affinity` | Affinity for the daemonset (ie, restrict which nodes kured runs on) | `{}` |
| `nodeSelector` | Node Selector for the daemonset (ie, restrict which nodes kured runs on) | `{}` |
| `volumeMounts` | Maps of volumes mount to mount | `{}` |
| `volumes` | Maps of volumes to mount | `{}` |
See https://github.com/weaveworks/kured#configuration for values (not contained in the `configuration` object) for `extraArgs`. Note that
```yaml
extraArgs:
foo:1
bar-baz:2
```
becomes `/usr/bin/kured ... --foo=1 --bar-baz=2`.
## Prometheus Metrics
Kured exposes a single prometheus metric indicating whether a reboot is required or not (see [kured docs](https://github.com/weaveworks/kured#prometheus-metrics)) for details.
# endTime: "" # only reboot before this time of day (default "23:59")
# lockAnnotation: "" # annotation in which to record locking node (default "weave.works/kured-node-lock")
period:"1m"# reboot check period (default 1h0m0s)
# forceReboot: false # force a reboot even if the drain fails or times out (default: false)
# drainGracePeriod: "" # time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default: -1)
# drainTimeout: "" # timeout after which the drain is aborted (default: 0, infinite time)
# skipWaitForDeleteTimeout: "" # when time is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node (default: 0)
# prometheusUrl: "" # Prometheus instance to probe for active alerts
# rebootDays: [] # only reboot on these days (default [su,mo,tu,we,th,fr,sa])
# rebootSentinel: "" # path to file whose existence signals need to reboot (default "/var/run/reboot-required")
# rebootSentinelCommand: "" # command for which a successful run signals need to reboot (default ""). If non-empty, sentinel file will be ignored.
# slackChannel: "" # slack channel for reboot notfications
# slackHookUrl: "" # slack hook URL for reboot notfications
tag:""# will default to the appVersion in Chart.yaml
pullPolicy:IfNotPresent
pullSecrets:[]
updateStrategy:RollingUpdate
# requires RollingUpdate updateStrategy
maxUnavailable:1
podAnnotations:{}
dsAnnotations:{}
extraArgs:{}
extraEnvVars:
# - name: slackHookUrl
# valueFrom:
# secretKeyRef:
# name: secret_name
# key: secret_key
# - name: regularEnvVariable
# value: 123
configuration:
lockTtl:0# force clean annotation after this amount of time (default 0, disabled)
alertFilterRegexp:""# alert names to ignore when checking for active alerts
alertFiringOnly:false# only consider firing alerts when checking for active alerts
blockingPodSelector:[]# label selector identifying pods whose presence should prevent reboots
endTime:""# only reboot before this time of day (default "23:59")
lockAnnotation:""# annotation in which to record locking node (default "weave.works/kured-node-lock")
period:""# reboot check period (default 1h0m0s)
forceReboot: false # force a reboot even if the drain fails or times out (default:false)
drainGracePeriod:""# time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default: -1)
drainTimeout:""# timeout after which the drain is aborted (default: 0, infinite time)
skipWaitForDeleteTimeout:""# when time is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node (default: 0)
prometheusUrl:""# Prometheus instance to probe for active alerts
rebootDays:[]# only reboot on these days (default [su,mo,tu,we,th,fr,sa])
rebootSentinel:""# path to file whose existence signals need to reboot (default "/var/run/reboot-required")
rebootSentinelCommand:""# command for which a successful run signals need to reboot (default ""). If non-empty, sentinel file will be ignored.
rebootCommand:"/bin/systemctl reboot"# command to run when a reboot is required by the sentinel
rebootDelay:""# add a delay after drain finishes but before the reboot command is issued
slackChannel:""# slack channel for reboot notfications
slackHookUrl:""# slack hook URL for reboot notfications
slackUsername:""# slack username for reboot notfications (default "kured")
notifyUrl:""# notification URL with the syntax as follows: https://containrrr.dev/shoutrrr/services/overview/
messageTemplateDrain:""# slack message template when notifying about a node being drained (default "Draining node %s")
messageTemplateReboot:""# slack message template when notifying about a node being rebooted (default "Rebooted node %s")
messageTemplateUncordon:""# slack message template when notifying about a node being uncordoned (default "Node %s rebooted & uncordoned successfully!")
startTime:""# only reboot after this time of day (default "0:00")
timeZone:""# time-zone to use (valid zones from "time" golang package)
annotateNodes:false# enable 'weave.works/kured-reboot-in-progress' and 'weave.works/kured-most-recent-reboot-needed' node annotations to signify kured reboot operations
lockReleaseDelay:0# hold lock after reboot by this amount of time (default 0, disabled)
preferNoScheduleTaint:""# Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to "weave.works/kured-node-reboot" to enable tainting.
logFormat:"text"# log format specified as text or json, defaults to text
preRebootNodeLabels:[]# labels to add to nodes before cordoning (default [])
postRebootNodeLabels:[]# labels to add to nodes after uncordoning (default [])
rbac:
create:true
serviceAccount:
create:true
name:
podSecurityPolicy:
create:false
containerSecurityContext:
privileged:true# Give permission to nsenter /proc/1/ns/mnt
# allowPrivilegeEscalation: true # Needed when using defaultAllowPrivilegedEscalation: false in psp
"Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to \"weave.works/kured-node-reboot\" to enable tainting.")
"if set, the annotations 'weave.works/kured-reboot-in-progress' and 'weave.works/kured-most-recent-reboot-needed' will be given to nodes undergoing kured reboots")
// Prefer to not schedule pods onto this node to avoid draing the same pod multiple times.
preferNoScheduleTaint.Enable()
continue
}
err=drain(client,node)
iferr!=nil{
if!forceReboot{
log.Errorf("Unable to cordon or drain %s: %v, will release lock and retry cordon and drain before rebooting when lock is next acquired",node.GetName(),err)
release(lock)
log.Infof("Performing a best-effort uncordon after failed cordon and drain")
uncordon(client,node)
continue
}
}
ifrebootDelay>0{
log.Infof("Delaying reboot for %v",rebootDelay)
time.Sleep(rebootDelay)
}
invokeReboot(nodeID,rebootCommand)
for{
log.Infof("Waiting for reboot")
time.Sleep(time.Minute)
}
}
}
// buildSentinelCommand creates the shell command line which will need wrapping to escape
// String returns a string representation of this time window.
func(tw*TimeWindow)String()string{
returnfmt.Sprintf("%s between %02d:%02d and %02d:%02d %s",tw.days.String(),tw.startTime.Hour(),tw.startTime.Minute(),tw.endTime.Hour(),tw.endTime.Minute(),tw.location.String())
}
// parseTime tries to parse a time with several formats.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.