* added notification when uncordoning
when reboot & uncordoning is succ
essful -> notification will be se
nt
* added uncordon message tmpl
added message template for
announcing successful uncor-
doning and reboot.
* added proper documentation about new flag
added readme note about new flag
* Added support for multi-arch image build
* Requested changes to multi-arch build
* Further optimizations of multi build
* multi needs QEMU for some pieces
* change main push for all platforms
* Update Dockerfile to call Makefile
* Remove manual workflow
* feat: update kubernetes dependencies
closes#525
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: update kind
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: missed kind-update
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: another kind update
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: use new toleration
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* fix: use both tolerations
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* build: some debugging
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
* revert [skip ci]
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Terminated pods should be excluded from the blocking a reboot as per https://github.com/weaveworks/kured/issues/227
This adds status filters to the fieldSelector in order to do that. I've not updated tests here but have successfully tested the exact same filter using kubectl
This commit introduces a new flag '--log-format' that allows a user
to configure json logging on the pods. If the log-format
is not specified, the formatter will default to the existing
text formatter.
In this PR the slack-hook-url is translated
into shoutrrr syntax. Therefore, slack pack
age as well as checks for slack-hook-url in
drain and reboot functions are removed.
Also added a unit test for flagCheck(), this
function also checks the (slack)URL syntax.
Currently, kured issues the system reboot command immediately after
kubectl drain finishes.
This is a problem for processes that need extra time to finish but aren't
running on pods and therefore aren't controlled by kubectl drain (e.g.
de-registering nodes from external load balancers).
This patch solves the problem by introducing a `reboot-delay` command
line argument that can be used to add a delay after kubectl drain
finishes but before the reboot command is issued.
* prometheus labels incl tests
* enable label in main, add log, docs
* revert the option to query by label
* revert the option to query by label
* PromClient instantiate by func,white space removal
* revert whitespace fix for readability.
* revert removal of newlines for readability
* rename New to NewPromClient to improve readability
Co-authored-by: simp <simp@saxobank.com>
shoutrrr now have versioned docs to allow directly linking to the version that matches the one you use
changes should always backwards compatible, but not the other way around
Without this, we get multiple questions about our testing.
This should help clarify the tests and our coverage by:
- Simplifying our coverage
- Documenting better the purpose of each workflow file
- Documenting our testing and development activities better.
We are relying on master, which might break anytime (or in this
case, moved to another branch).
Instead we should rely on a stable version, and unfreeze if
necessary. Dependabot helps us maintain those releases anyway.
Without this patch, it's not clear that we added command line
arguments recently. This should expose our latest changes in the
future released manifest.
Without this change, the "Test helm chart (install) action" will
rightfully succeed when our helm chart gets installed and has
no syntax issues. However, it doesn't test if kured is properly
installed. For example, the helm chart can try to install a
yet unpublished image, and our test will succeed, as the syntax
is still valid.
This is a problem, as everything looks green, but it's not
effectively working. Our other jobs are focusing on code changes,
so they rightfully override the image tag, which is not what
we want in this "Test helm chart" action.
This fixes it by adding an extra job in the workflow, depending
on the chart testing.
Without this, we can't know if the exposed prometheus metrics
behave properly.
This is a problem, as the only way we can evaluate the success
(right now), is a compilation success or failure from kured.
While this is a good start, it doesn't translate to what we
claim to offer: A boolean showing if a reboot is required.
This fixes it by creating a new github action workflow testing
if the float64 gauge is properly showing 0 for no reboot, 1 for reboot.
This is done by exposing the metrics endpoint through a node port.
A helm chart change was required to have the ability to expose
the service on a node port. We connect to the kind node through
docker in the `tests/test-metrics.sh`, where we curl the nodeport,
extract the only relevant metric, and compare it to the expected result.
This support throtteling of reboots across the cluster
and allows rebooted nodes to reschedule pods, e.g.
to synchronize replicated state before rebooting the next node.
Without this patch, chart-testing is using the branch named
"master" by default.
This is a problem, as we just renamed our development branch
"main" instead of "master".
This should fix it by pointing to the right branch.
- Make markdownlint happier in a couple of places.
- Rename '*-master-*' files
- Change default branches of some other projects
we rely on. They moved to 'main' as well.
- Standardise version of actions/checkout.
- Update last release in README to 1.6.1.
- Bbump chart version.
Eventually closes: #252
Signed-off-by: Daniel Holbach <daniel@weave.works>
Without this patch, the rebootCommand passed to invokeReboot is
ignored, and the command used for reboot is always systemctl reboot.
This is a problem, as we are aiming for flexible commands for this
release.
This fixes it by restoring the previous behaviour before commit
[1] happened.
[1]: 694957d56e
This patch gives the possibility to send notifications
across different technologies. Also, this patch makes
slack-hook-url, slack-username and slack-channel
deprecated (informed by a warning).
Also, updated the documentation (Readme).
Without this, go test will rightfully fail.
This is a problem, as we don't have go test enabled, but we want
to have this in the future.
This should fix it.
Without this patch, you cannot configure the reboot
command to use, or the use another command to trigger
a reboot.
This is a problem, as multiple users have asked for
it in the past, and we are lacking flexibility.
This fixes it by introducing two new parameters,
- one to provide a custom reboot command.
This should help people running kured on
non systemd OS
- one to provide a custom sentinel command.
This should help people running non Ubuntu OS,
as they can directly use their command instead of
generating a file (useful for CentOS/SUSE)
For this, several refactors had to be done, to
remove global state in some functions. Making those
functions closer to "pure functions" helps us
increase our test coverage here and later.
As commandReboot was very close to rebootCommand,
the function to reboot the node has been renamed
to invokeReboot.
Without this patch, we rely on global state in many functions for
which we check the reboot blockers.
This is a problem, as it's harder to test.
This patch fixes it by refactoring the reboot blockers. This also
includes a first series of unit tests for our main.
Without this patch, the version of 1.20 is taken in jobs as 1.2.
This is a problem, as it breaks all jobs, because there is no
file to provision a cluster with kubernetes 1.2 (and we shouldn't
do this!)
This fixes it by ensuring there is no mangling of the version
strings, and therefore the right file is used.
DeleteLocalData was deprecated for users of kubectl in 0.20 [1].
At the same time of the deprecation, the relevant code was also
removed [2] without warning: The DeleteLocalData from the helper
structure was simply renamed DeleteEmptyDirData, without shims
on the exposed pkg.
This is a problem, as it completely breaks kured.
This should fix it, by using the new field name.
[1]:
56ea9621b7
[2]:
56ea9621b7 (diff-041bdcdedca650a38a8d82cf15ab6f3665b7b84a0fb44a8bb5dcdc5cd944c63d)
Without this patch, go.mod will lag behind for the kubernetes
packages, as it's not automatically tested by dependabot.
We should bump versions with each new minor release of kured.
This should fix it.
adds a new --annotate-nodes daemonset runtime argument, which does the following when enabled:
- adds a new node annotation "weave.works/kured-most-recent-reboot-needed" with a value of the current RFC3339 timestamp as soon as kured identifies that a node needs to be rebooted
- adds a new node annotation "weave.works/kured-reboot-in-progress" with a value of the current RFC3339 timestamp as soon as kured identifies that a node needs to be rebooted
- removes the annotation "weave.works/kured-reboot-in-progress" when kured has successfully rebooted the node
This changes the pre-reboot drain functionality so that it always runs, regardless of the value of the Unschedulable node property.
Because kubectl drain is idempotent, we shouldn't have to worry about whether the node has already been set to Unschedulable (perhaps due to a prior, unsuccessful loop of the kured reboot cycle): we can run it over and over again. And because this drain func actually does a cordon + drain (and it only performs the drain if a cordon is successful), we can be sure that we aren't going to be thrashing this node w/ respect to scheduled pods.
This also fixes an edge case: if the node has been marked Unschedulable out-of-band, but workloads remain Running on this node, kured will no longer reboot the node's underlying VM/machine while it is actively running pods.
Without this, it's possible that the helm chart documentation
contains the `image tag` version which might not be equal to
the version in the helm chart, as it's only an example.
This is a confusing, so instead we should use make to edit the
application version everywhere.
This fixes it by updating the Makefile to modify text of the
chart's README using a regex looking for something similar to
a version; then I used the updated makefile to edit the README,
which in turns requires a bump of the version of the chart
itself.
Without this patch, the name of the image is not templated, which
cause the action to fail.
This should fix it, by ensuring the image scan action uses a
templated value, instead of incorrectly relying on shell templating,
which doesn't run in the action.
Without this patch, we are using outdated images in kind cluster
setup.
This should fix it, by removing 1.17 cluster (which is not tested
anymore), and updating 1.19 images.
Until a new alpine image is created, we should ensure the latest
packages are used, and therefore we should upgrade default
installed packages.
Without this patch, we'll have outdated and vulnerable packages
until a new 3.12 image is released.
This is a problem, as we'll publish broken images.
This should temporarily workaround it, at the expense of larger
images (contains package cache)
Without this patch, dependabot will still try to bump some k8s
dependencies.
This is a problem, as we need to bump them together, manually.
This should fix it by removing them all from dependabot.
We are now testing the helm charts on each PR. They are now
ensured to be passing our tests and reviewed before merging.
This also means that the merged changes in the master branch
are reliable, and therefore can be consumed immediately.
Currently, we are waiting for a release to publish a helm
chart.
This is a problem as it means that the helm chart will
always lag behind, and we'll miss a few semantic versions,
if for example the helm chart is adapted multiple times
before the next release.
This should fix it by ensuring ALL the merged changes in
our helm chart will result in a new published helm chart.
Without this patch, chart linting will fail: more than two
spaces are needed before a comment in the helm chart values.
This fixes it by adding one more space, and move the whole block
of comments for consistency.
This ensures we bump the code for 1.20.
It updates the testing to ensure kured works on a 1.20 cluster,
removes the testing on 1.17 (as it is now deprecated).
Libraries remain on 1.19, to avoid breaking 1.18 clusters.
Without this patch, the PR jobs are broken and no jobs are running.
This was a recently introduced typo in the last refactor of the
PR jobs.
This should fix it, and make the PR test working again.
Without this, golang version used is the golang version decided
by github.
This is a problem, as it might shift over time, without our control.
This fixes it by getting the golang version from the go.mod.
Without this patch, we'll get kubernetes updates.
This is not necessary, and could be even a problem on merge:
those kubernetes updates are done separately, knowingly,
to respect the life cycle of the kubernetes we need
(and stay one version below latest to have a larger coverage
of versions).
We could keep dependabot to update those on a lower frequency,
but that sounds clunky and not great. Instead disable them all,
and rely on the team to do this regular maintenance work.
There are lots of duplicated code in this workflow.
This fixes it by making a unique job with parameters. The
matrix buys us the parallelisation and the fail-fast.
Without this patch, the lint action incorrectly returns everything
is fine.
This is a problem, as lint effectively is not running, and
therefore we could merge broken charts.
This fixes it by updating to the latest practices you can find
in the official chart-repo-actions.
(See the official example in
i1a9640d998/.github/workflows/lint-test.yaml)
- Made all the file extensions ".yaml"
- Regrouped actions together to make it easy to see when they
are useful: on-pr is useful at every PR, on-tag when we are
ready to tag next image, on-pr-chart when we have a PR to
modify the chart with the published image, on-release when
we have released and need to publish the final helm chart
- Regrouped periodic jobs together, to deal with stale prs/issues
and ensuring that our helm chart always works.
When a failure is happening and the cluster doesn't manage to
be back up on time, we exit 1, and don't show docker logs.
This is a problem, as we would benefit from a detailed docker
output on those cases, when debugging.
This fixes it by ensuring the logging is always done at the
exit of the script.
We don't need to test with kustomize, manifest testing is good
enough, as we just test that the manifest are correct, not that
they are functional (which would require a change in the poll time).
This extends our test coverages for kured-* manifest changes on PRs,
and any eventual changes in kubernetes/kubectl on periodics.
Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>
In the past, we had lint issues which were merged into the code,
and/or lint changed without us adapting our code.
This should allow us to stay on top of linting issue by
highlighting them in PRs.
Without this patch, we might hold old issues and PR for a long
time. Instead we should close them. People can reopen if necessary.
This would show that we have a proper triage process, and a proper
way to handle those.
Without this patch, we need to build a cache, remove it.
Since apk allows to work with no-cache and won't leave artifacts,
we should use it.
This will make the dockle best practices scanner happier.
Without this patch, there is no way we can see, in the development
process, if the image we are about to publish is insecure.
This is a problem as we might be releasing new versions of kured
with outdated base image which contains vulnerabilities.
This fixes it by creating a job which will show any eventual
vulnerability.
This automates the manifest and helm chart version handling.
Just pass the organisation and version in the make command to
update the manifests/helm charts.
This does not automate the helm chart version and does not
create a manifest used in the release process.
Without this patch, we don't test on release whether kured actually
works and behave well.
This is a problem, as a functional issue could have been hidden by
a recent change, as our testing is minimalist (only test the
usability, not the functionality).
Instead of testing manually, we should ensure this in CI.
This fixes it by adding a github action which tests the previously
built artifacts before publishing a release. The job consume the helm
chart in our code tree (note: this relies on the last released image),
and run a functional test triggering a coordinated restart of a
whole 5 node cluster deployed with kind, through github actions.
Note: The github action needs to reset docker configuration, else
the reboot of the node (a docker container in kind) will fail.
It will be correctly triggered, but the node will not come back up,
with its systemd log mentioning: "Failed to attach 1 to compat systemd cgroup".
# Stale by default waits for 60 days before marking PR/issues as stale, and closes them after 21 days.
# Do not expire the first issues that would allow the community to grow.
- uses:actions/stale@v5
with:
repo-token:${{ secrets.GITHUB_TOKEN }}
stale-issue-message:'This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).'
stale-pr-message:'This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).'
stale-issue-label:'no-issue-activity'
stale-pr-label:'no-pr-activity'
exempt-issue-labels:'good first issue,keep'
days-before-close:21
check-docs-links:
name:Check docs for incorrect links
runs-on:ubuntu-latest
steps:
- uses:actions/checkout@v3
- name:Link Checker
id:lc
uses:peter-evans/link-checker@v1
with:
args:-r *.md *.yaml */*/*.go -x .cluster.local
- name:Fail if there were link errors
run:exit ${{ steps.lc.outputs.exit_code }}
vuln-scan:
name:Build image and scan it against known vulnerabilities
--drain-grace-period int time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default: -1)
--skip-wait-for-delete-timeout int when seconds is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node (default: 0)
--ds-name string name of daemonset on which to place lock (default "kured")
--ds-namespace string namespace containing daemonset on which to place lock (default "kube-system")
--end-time string only reboot before this time of day (default "23:59")
--end-time string schedule reboot only before this time of day (default "23:59:59")
--force-reboot bool force a reboot even if the drain is still running (default: false)
--drain-timeout duration timeout after which the drain is aborted (default: 0, infinite time)
-h, --help help for kured
--lock-annotation string annotation in which to record locking node (default "weave.works/kured-node-lock")
--lock-release-delay duration hold lock after reboot by this duration (default: 0, disabled)
--lock-ttl duration expire lock annotation after this duration (default: 0, disabled)
--message-template-uncordon string message template used to notify about a node being successfully uncordoned (default "Node %s rebooted & uncordoned successfully!")
--message-template-drain string message template used to notify about a node being drained (default "Draining node %s")
--message-template-reboot string message template used to notify about a node being rebooted (default "Rebooting node %s")
--notify-url url for reboot notifications (cannot use with --slack-hook-url flags)
--period duration reboot check period (default 1h0m0s)
--prefer-no-schedule-taint string Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to "weave.works/kured-node-reboot" to enable tainting.
--prometheus-url string Prometheus instance to probe for active alerts
--reboot-days strings only reboot on these days (default [su,mo,tu,we,th,fr,sa])
--reboot-command string command to run when a reboot is required by the sentinel (default "/sbin/systemctl reboot")
--reboot-days strings schedule reboot on these days (default [su,mo,tu,we,th,fr,sa])
--reboot-delay duration add a delay after drain finishes but before the reboot command is issued (default 0, no time)
--reboot-sentinel string path to file whose existence signals need to reboot (default "/var/run/reboot-required")
--reboot-sentinel-command string command for which a successful run signals need to reboot (default ""). If non-empty, sentinel file will be ignored.
--slack-channel string slack channel for reboot notfications
--slack-hook-url string slack hook URL for reboot notfications
--slack-hook-url string slack hook URL for reboot notfications [deprecated in favor of --notify-url]
--slack-username string slack username for reboot notfications (default "kured")
--start-time string only reboot after this time of day (default "0:00")
--time-zone string use this timezone to calculate allowed reboot time (default "UTC")
--start-time string schedule reboot only after this time of day (default "0:00")
--time-zone string use this timezone for schedule inputs (default "UTC")
--log-format string log format specified as text or json, defaults to "text"
```
### Reboot Sentinel File & Period
@@ -99,6 +131,23 @@ values with `--reboot-sentinel` and `--period`. Each replica of the
daemon uses a random offset derived from the period on startup so that
nodes don't all contend for the lock simultaneously.
### Reboot Sentinel Command
Alternatively, a reboot sentinel command can be used. If a reboot
sentinel command is used, the reboot sentinel file presence will be
ignored. When the command exits with code `0`, kured will assume
that a reboot is required.
For example, if you're using RHEL or its derivatives, you can
set the sentinel command to `sh -c "! needs-restarting --reboothint"`
(by default the command will return `1` if a reboot is required,
so we wrap it in `sh -c` and add `!` to negate the return value).
We recommend setting `--slack-username` to be the name of the
environment, e.g. `dev` or `prod`.
Alternatively you can use the `--message-template-drain`, `--message-template-reboot` and `--message-template-uncordon` to customize the text of the message, e.g.
```cli
--message-template-drain="Draining node %s part of *my-cluster* in region *xyz*"
```
Here is the syntax:
- slack: `slack://tokenA/tokenB/tokenC`
(`slack://<USERNAME>@tokenA/tokenB/tokenC` - in case you want to [respect username](https://github.com/weaveworks/kured/issues/482))
(`--slack-hook-url` is deprecated but possible to use)
> NB the `-` at the end of the command is important - it instructs
> `kubectl` to remove that annotation entirely.
### Automatic Unlock
In exceptional circumstances (especially when used with cluster-autoscaler) a node
which holds lock might be killed thus annotation will stay there for ever.
Using `--lock-ttl=30m` will allow other nodes to take over if TTL has expired (in this case 30min) and continue reboot process.
### Delaying Lock Release
Using `--lock-release-delay=30m` will cause nodes to hold the lock for the specified time frame (in this case 30min) before it is released and the reboot process continues. This can be used to throttle reboots across the cluster.
## Building
See the [CircleCI config](.circleci/config.yml) for the preferred
version of Golang. Kured now uses [Go
Kured now uses [Go
Modules](https://github.com/golang/go/wiki/Modules), so build
instructions vary depending on where you have checked out the
repository:
**Building outside $GOPATH:**
```
```console
make
```
**Building inside $GOPATH:**
```
```console
GO111MODULE=on make
```
You can find the current preferred version of Golang in the [go.mod file](go.mod).
If you are interested in contributing code to kured, please take a look at
| `configuration.preferNoScheduleTaint` | Taint name applied during pending node reboot | `""` |
| `configuration.preRebootNodeLabels` | Array of key-value-pairs to add to nodes before cordoning for multiple cli-parameters `--pre-reboot-node-labels` | `[]` |
| `configuration.postRebootNodeLabels` | Array of key-value-pairs to add to nodes after uncordoning for multiple cli-parameters `--post-reboot-node-labels` | `[]` |
| `rbac.create` | Create RBAC roles | `true` |
| `serviceAccount.create` | Create a service account | `true` |
| `serviceAccount.name` | Service account name to create (or use if `serviceAccount.create` is false) | (chart fullname) |
| `containerSecurityContext.allowPrivilegeEscalation`| Enables `allowPrivilegeEscalation` in container-specific security context. If not set it won't be configured. | |
| `resources` | Resources requests and limits. | `{}` |
| `metrics.create` | Create a ServiceMonitor for prometheus-operator | `false` |
| `metrics.namespace` | The namespace to create the ServiceMonitor in | `""` |
| `metrics.labels` | Additional labels for the ServiceMonitor | `{}` |
| `metrics.interval` | Interval prometheus should scrape the endpoint | `60s` |
| `metrics.scrapeTimeout` | A custom scrapeTimeout for prometheus | `""` |
| `service.create` | Create a Service for the metrics endpoint | `false` |
| `service.name ` | Service name for the metrics endpoint | `""` |
| `service.port` | Port of the service to expose | `8080` |
| `service.annotations` | Annotations to apply to the service (eg to add Prometheus annotations) | `{}` |
| `priorityClassName` | Priority Class to be used by the pods | `""` |
| `tolerations` | Tolerations to apply to the daemonset (eg to allow running on master) | `[{"key": "node-role.kubernetes.io/control-plane", "effect": "NoSchedule"}]` for Kubernetes 1.24.0 and greater, otherwise `[{"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}]`|
| `affinity` | Affinity for the daemonset (ie, restrict which nodes kured runs on) | `{}` |
| `nodeSelector` | Node Selector for the daemonset (ie, restrict which nodes kured runs on) | `{}` |
| `volumeMounts` | Maps of volumes mount to mount | `{}` |
| `volumes` | Maps of volumes to mount | `{}` |
See https://github.com/weaveworks/kured#configuration for values (not contained in the `configuration` object) for `extraArgs`. Note that
```yaml
extraArgs:
foo:1
bar-baz:2
```
becomes `/usr/bin/kured ... --foo=1 --bar-baz=2`.
## Prometheus Metrics
Kured exposes a single prometheus metric indicating whether a reboot is required or not (see [kured docs](https://github.com/weaveworks/kured#prometheus-metrics)) for details.
# endTime: "" # only reboot before this time of day (default "23:59")
# lockAnnotation: "" # annotation in which to record locking node (default "weave.works/kured-node-lock")
period:"1m"# reboot check period (default 1h0m0s)
# forceReboot: false # force a reboot even if the drain fails or times out (default: false)
# drainGracePeriod: "" # time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default: -1)
# drainTimeout: "" # timeout after which the drain is aborted (default: 0, infinite time)
# skipWaitForDeleteTimeout: "" # when time is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node (default: 0)
# prometheusUrl: "" # Prometheus instance to probe for active alerts
# rebootDays: [] # only reboot on these days (default [su,mo,tu,we,th,fr,sa])
# rebootSentinel: "" # path to file whose existence signals need to reboot (default "/var/run/reboot-required")
# rebootSentinelCommand: "" # command for which a successful run signals need to reboot (default ""). If non-empty, sentinel file will be ignored.
# slackChannel: "" # slack channel for reboot notfications
# slackHookUrl: "" # slack hook URL for reboot notfications
tag:""# will default to the appVersion in Chart.yaml
pullPolicy:IfNotPresent
pullSecrets:[]
updateStrategy:RollingUpdate
# requires RollingUpdate updateStrategy
maxUnavailable:1
podAnnotations:{}
dsAnnotations:{}
extraArgs:{}
extraEnvVars:
# - name: slackHookUrl
# valueFrom:
# secretKeyRef:
# name: secret_name
# key: secret_key
# - name: regularEnvVariable
# value: 123
configuration:
lockTtl:0# force clean annotation after this amount of time (default 0, disabled)
alertFilterRegexp:""# alert names to ignore when checking for active alerts
alertFiringOnly:false# only consider firing alerts when checking for active alerts
blockingPodSelector:[]# label selector identifying pods whose presence should prevent reboots
endTime:""# only reboot before this time of day (default "23:59")
lockAnnotation:""# annotation in which to record locking node (default "weave.works/kured-node-lock")
period:""# reboot check period (default 1h0m0s)
forceReboot: false # force a reboot even if the drain fails or times out (default:false)
drainGracePeriod:""# time in seconds given to each pod to terminate gracefully, if negative, the default value specified in the pod will be used (default: -1)
drainTimeout:""# timeout after which the drain is aborted (default: 0, infinite time)
skipWaitForDeleteTimeout:""# when time is greater than zero, skip waiting for the pods whose deletion timestamp is older than N seconds while draining a node (default: 0)
prometheusUrl:""# Prometheus instance to probe for active alerts
rebootDays:[]# only reboot on these days (default [su,mo,tu,we,th,fr,sa])
rebootSentinel:""# path to file whose existence signals need to reboot (default "/var/run/reboot-required")
rebootSentinelCommand:""# command for which a successful run signals need to reboot (default ""). If non-empty, sentinel file will be ignored.
rebootCommand:"/bin/systemctl reboot"# command to run when a reboot is required by the sentinel
rebootDelay:""# add a delay after drain finishes but before the reboot command is issued
slackChannel:""# slack channel for reboot notfications
slackHookUrl:""# slack hook URL for reboot notfications
slackUsername:""# slack username for reboot notfications (default "kured")
notifyUrl:""# notification URL with the syntax as follows: https://containrrr.dev/shoutrrr/services/overview/
messageTemplateDrain:""# slack message template when notifying about a node being drained (default "Draining node %s")
messageTemplateReboot:""# slack message template when notifying about a node being rebooted (default "Rebooted node %s")
startTime:""# only reboot after this time of day (default "0:00")
timeZone:""# time-zone to use (valid zones from "time" golang package)
annotateNodes:false# enable 'weave.works/kured-reboot-in-progress' and 'weave.works/kured-most-recent-reboot-needed' node annotations to signify kured reboot operations
lockReleaseDelay:0# hold lock after reboot by this amount of time (default 0, disabled)
preferNoScheduleTaint:""# Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to "weave.works/kured-node-reboot" to enable tainting.
logFormat:"text"# log format specified as text or json, defaults to text
preRebootNodeLabels:[]# labels to add to nodes before cordoning (default [])
postRebootNodeLabels:[]# labels to add to nodes after uncordoning (default [])
rbac:
create:true
serviceAccount:
create:true
name:
podSecurityPolicy:
create:false
containerSecurityContext:
privileged:true# Give permission to nsenter /proc/1/ns/mnt
# allowPrivilegeEscalation: true # Needed when using defaultAllowPrivilegedEscalation: false in psp
"Taint name applied during pending node reboot (to prevent receiving additional pods from other rebooting nodes). Disabled by default. Set e.g. to \"weave.works/kured-node-reboot\" to enable tainting.")
"if set, the annotations 'weave.works/kured-reboot-in-progress' and 'weave.works/kured-most-recent-reboot-needed' will be given to nodes undergoing kured reboots")
// Prefer to not schedule pods onto this node to avoid draing the same pod multiple times.
preferNoScheduleTaint.Enable()
continue
}
err=drain(client,node)
iferr!=nil{
if!forceReboot{
log.Errorf("Unable to cordon or drain %s: %v, will release lock and retry cordon and drain before rebooting when lock is next acquired",node.GetName(),err)
release(lock)
log.Infof("Performing a best-effort uncordon after failed cordon and drain")
uncordon(client,node)
continue
}
}
ifrebootDelay>0{
log.Infof("Delaying reboot for %v",rebootDelay)
time.Sleep(rebootDelay)
}
invokeReboot(nodeID,rebootCommand)
for{
log.Infof("Waiting for reboot")
time.Sleep(time.Minute)
}
}
}
// buildSentinelCommand creates the shell command line which will need wrapping to escape
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.