diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..843650f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,300 @@ +# Stakater Reloader Project Memory + +## Project Purpose + +Reloader is a Kubernetes operator that automatically triggers rolling restarts of workloads when the ConfigMaps or Secrets they reference are updated. Without it, Kubernetes does not restart pods when configuration changes — operators must do it manually or rely on GitOps pipelines. + +**What it watches**: ConfigMaps, Secrets, Namespaces, and (optionally) `SecretProviderClassPodStatus` (CSI-mounted secrets). + +**Workload types it can reload**: Deployment, StatefulSet, DaemonSet, CronJob, Job, Argo Rollout, and OpenShift DeploymentConfig. + +**How restarts are triggered**: Two strategies (selected via `--reload-strategy`): +1. **env-vars** (default) — injects an environment variable (`STAKATER_{NAME}_{TYPE}`) into every container with the SHA1 hash of the resource's data. A change in data changes the env var value, causing Kubernetes to restart pods. +2. **annotations** — writes the SHA1 hash into the pod template's annotations, which also forces a rollout. + +**The core problem it solves**: ConfigMaps and Secrets are decoupled from pod lifecycle in Kubernetes. Applications reading config at startup see stale data after a config update unless pods are restarted. Reloader closes that gap automatically and selectively. + +**Potential improvements observed**: +- **Duplicate reload suppression**: If a workload references both a ConfigMap and a Secret that are updated in the same controller reconcile cycle, it may get reloaded twice. Could be solved with a per-workload debounce map keyed by namespace/name/resourceVersion, flushed after a short TTL. +- **CronJob/Job reload is destructive**: Jobs are deleted and recreated on change, which loses run history. Could instead only annotate the CronJob template without spawning a new Job. +- **No per-resource reload rate limiting**: A rapid-fire ConfigMap update (e.g., from a CI pipeline) can trigger many restarts. A cooldown window per resource would help. +- **CSI integration gap**: CSI volumes are watched at the `SecretProviderClassPodStatus` level but the link back to the workload is indirect and may miss edge cases. Needs a direct map from SecretProviderClass → workloads that mount it. + +--- + +## Repo Map + +| Path | Owns | Inspect when | +|---|---|---| +| `main.go` | Entry point, delegates to `app.Run()` | Never needs changes | +| `internal/pkg/app/` | `Run()` bootstrap, Cobra command wiring | Startup sequence changes | +| `internal/pkg/cmd/` | CLI flags parsing, `startReloader()`, controller/HA wiring | Adding new flags or startup behavior | +| `internal/pkg/controller/` | Informer/queue per resource type, event handlers (Add/Update/Delete) | Watching new resource types, queue tuning | +| `internal/pkg/handler/` | Per-event handlers (create, update, delete), `doRollingUpgrade()`, pause deployment | Core reload logic changes | +| `internal/pkg/callbacks/` | Workload-specific get/list/update/patch functions, `RollingUpgradeFuncs` struct | Adding new workload types | +| `internal/pkg/options/` | All CLI flag variables, defaults, `ArgoRolloutStrategy` type | Adding or renaming flags | +| `internal/pkg/constants/` | Constants: env var postfixes, annotation prefix, strategy names, HA lock name | Renaming global identifiers | +| `internal/pkg/metrics/` | Prometheus `Collectors` struct, all metric registration and recording helpers | Adding metrics | +| `internal/pkg/alerts/` | Slack/Teams/GChat/raw webhook alerting, env var config | Alert sink changes | +| `internal/pkg/util/` | SHA generation via `crypto/sha.go`, env var name conversion, namespace/label utilities | Utility/hash changes | +| `internal/pkg/crypto/` | `GenerateSHA(data)` — SHA1 hex digest | Hash algorithm changes | +| `internal/pkg/leadership/` | Leader election via Kubernetes Lease, HA stop/start of controllers | HA behavior changes | +| `internal/pkg/testutil/` | Fake Kubernetes objects for unit tests | Writing new tests | +| `pkg/common/` | `ReloadCheckResult`, `ReloaderOptions`, `ShouldReload()` logic, `Config` struct | Reload decision logic, annotation precedence | +| `pkg/kube/` | `Clients` struct (k8s + OpenShift + Argo + CSI), `GetKubernetesClient()`, `ResourceMap` | Client initialization, new CRD clients | +| `deployments/` | Helm chart (`deployments/kubernetes/chart/reloader/`), Kustomize manifests | Helm values, RBAC, deployment config | +| `docs/` | User-facing annotation documentation, architecture notes | Writing docs or confirming annotation behavior | +| `scripts/` | Shell scripts used by CI and Makefile | Build/release pipeline | +| `test/loadtest/` | Load test CLI (`cmd/loadtest`), 13 scenarios (S1–S13), Kind cluster setup | Performance testing, regression benchmarks | +| `.github/` | CI workflows: lint, test, Kind e2e, multi-arch Docker build, release | CI changes | + +--- + +## Core Runtime Flow + +**1. Entry** — `main.go:10` calls `app.Run()`. + +**2. CLI Init** — `internal/pkg/app/app.go` calls `cmd.NewReloaderCommand()` which registers all Cobra flags from `options/flags.go` and runs `startReloader()`. + +**3. Client Setup** — `pkg/kube/client.go`: builds `kube.Clients` with: +- `kubernetes.Interface` — standard k8s client +- `appsclient.Interface` — OpenShift client (auto-detected by probing `deploymentconfigs`) +- `argorollout.Interface` — if `--is-Argo-Rollouts=true` +- `csiclient.Interface` — if `--enable-csi-integration` + +**4. Controller Creation** — `startReloader()` iterates `kube.ResourceMap` (configmaps, secrets, namespaces, and optionally secretproviderclasspodstatuses) and calls `controller.NewController()` for each resource in each watched namespace. + +**5. Informer/Queue** — `controller.NewController()`: +- Creates a `cache.NewFilteredListWatchFromClient` with label/field selectors. +- Registers `Add`, `Update`, `Delete` event handlers. +- Creates a `workqueue.TypedRateLimitingQueue` for async processing. + +**6. Event Detection**: +- `Add` — enqueues only if `ReloadOnCreate` is enabled (skips during initial sync unless `SyncAfterRestart`). +- `Update` — compares SHA of old vs new object data; enqueues only on real changes. +- `Delete` — enqueues only if `ReloadOnDelete` is enabled. +- Namespace events update `selectedNamespacesCache` for namespace-selector filtering. + +**7. Handler Dispatch** — The queue worker calls `handler.Handle()` on the dequeued item. Three handler types: +- `ResourceCreatedHandler` (`create.go`) — fires `doRollingUpgrade` or sends webhook. +- `ResourceUpdatedHandler` (`update.go`) — fires `doRollingUpgrade` or sends webhook. +- `ResourceDeleteHandler` (`delete.go`) — calls `invokeDeleteStrategy` (removes env vars or clears annotation). + +**8. Workload Discovery** — `doRollingUpgrade()` (`upgrade.go:181`) calls `rollingUpgrade()` for each workload type. For each type, `ItemsFunc` lists all workloads in the namespace, then `pkg/common.ShouldReload()` checks annotations to decide which ones need reloading. + +**9. Reload Execution** — `invokeReloadStrategy()` either: +- **env-vars**: mutates container env vars; uses JSON patch if `SupportsPatch=true`, full update otherwise. +- **annotations**: writes SHA to pod template annotations; same patch/update split. + +**10. Post-reload** — optionally pauses the Deployment via `pause_deployment.go`, records Kubernetes Events via `recorder`, updates Prometheus metrics, sends alert webhooks. + +**HA Mode**: if `--enable-ha`, `internal/pkg/leadership/` runs Kubernetes Lease-based leader election. Only the leader runs controllers; losing leadership stops them and marks the pod unhealthy. + +**HTTP Server**: port `:9090` serves `/metrics` (Prometheus) and liveness/readiness probes. + +--- + +## Reload Behavior And Annotations + +All annotation names are configurable via CLI flags; the values below are defaults. + +### Trigger Annotations (on workloads) + +| Annotation | Value | Behavior | +|---|---|---| +| `reloader.stakater.com/auto` | `"true"` | Reload on change to **any** ConfigMap or Secret referenced by the workload (via envFrom, env valueFrom, or volumes) | +| `configmap.reloader.stakater.com/auto` | `"true"` | Reload on change to **any referenced ConfigMap** only | +| `secret.reloader.stakater.com/auto` | `"true"` | Reload on change to **any referenced Secret** only | +| `secretproviderclass.reloader.stakater.com/auto` | `"true"` | Reload on change to **any referenced SecretProviderClass** only | +| `configmap.reloader.stakater.com/reload` | `"cm1,cm2"` | Reload only when the **named ConfigMaps** change (regex supported) | +| `secret.reloader.stakater.com/reload` | `"sec1,sec2"` | Reload only when the **named Secrets** change (regex supported) | +| `secretproviderclass.reloader.stakater.com/reload` | `"spc1"` | Reload only when the **named SecretProviderClass** changes | +| `reloader.stakater.com/search` | `"true"` | Reload when any ConfigMap/Secret tagged with `reloader.stakater.com/match: "true"` changes | + +### Exclude Annotations (on workloads) + +| Annotation | Value | Behavior | +|---|---|---| +| `reloader.stakater.com/ignore` | `"true"` | Skip this workload entirely | +| `configmaps.exclude.reloader.stakater.com/reload` | `"cm1,cm2"` | Exclude these named ConfigMaps from triggering reload | +| `secrets.exclude.reloader.stakater.com/reload` | `"sec1,sec2"` | Exclude these named Secrets | +| `secretproviderclasses.exclude.reloader.stakater.com/reload` | `"spc1"` | Exclude these named SecretProviderClasses | + +### Behavior Annotations (on workloads) + +| Annotation | Value | Behavior | +|---|---|---| +| `reloader.stakater.com/rollout-strategy` | `"restart"` or `"rollout"` | For Argo Rollouts: `"restart"` uses restartAt, `"rollout"` (default) uses full rollout update | +| `deployment.reloader.stakater.com/pause-period` | Go duration e.g. `"30s"` | Pause Deployment for this duration after reload | +| `deployment.reloader.stakater.com/paused-at` | RFC3339 timestamp | Set by Reloader to track pause start time; do not set manually | + +### Search/Match Pattern + +The `reloader.stakater.com/search` annotation on a workload pairs with `reloader.stakater.com/match: "true"` on a ConfigMap or Secret. Any workload with `search: true` will reload when any `match: true` resource changes. + +### Global Flag Overrides + +- `--auto-reload-all` — reload all workloads on any ConfigMap/Secret change; annotation not required. +- `--resources-to-ignore=configMaps` or `=secrets` — skip one type entirely. +- `--ignored-workload-types=jobs,cronjobs` — skip Job and CronJob reload. +- `--namespaces-to-ignore` — comma-separated namespace names to skip. +- `--namespace-selector` — only watch namespaces with matching labels. +- `--resource-label-selector` — only watch ConfigMaps/Secrets with matching labels. + +### Precedence Rules + +1. `reloader.stakater.com/ignore: "true"` wins everything — workload is skipped. +2. Exclude annotations override include annotations for specific named resources. +3. Named annotations (`.../reload`) are checked before auto annotations. +4. `--auto-reload-all` is the lowest-priority fallback (only applies if no annotation matches). +5. Annotations are checked on both the workload and its pod template (pod template takes precedence in some paths — verify in `pkg/common/common.go:ShouldReload()`). + +--- + +## Workload Support + +| Workload | SupportsPatch | Update Mechanism | Key files | +|---|---|---|---| +| **Deployment** | Yes | JSON patch or full update | `callbacks/rolling_upgrade.go`, `handler/upgrade.go:38` | +| **StatefulSet** | Yes | JSON patch or full update | `callbacks/rolling_upgrade.go`, `handler/upgrade.go:109` | +| **DaemonSet** | Yes | JSON patch or full update | `callbacks/rolling_upgrade.go`, `handler/upgrade.go:91` | +| **CronJob** | No | Creates a new Job from CronJob spec (adds `cronjob.kubernetes.io/instantiate: manual`) | `callbacks.CreateJobFromCronjob`, `handler/upgrade.go:55` | +| **Job** | No | Deletes old Job, creates new one (strips ResourceVersion, UID, Status, controller labels) | `callbacks.ReCreateJobFromjob`, `handler/upgrade.go:73` | +| **Argo Rollout** | No | Full update via Argo Rollouts client | `callbacks.UpdateRollout`, `handler/upgrade.go:127`; requires `--is-Argo-Rollouts=true` | +| **DeploymentConfig** | Yes | OpenShift DeploymentConfigs API | `callbacks/rolling_upgrade.go`; auto-detected by probing `deploymentconfigs` | + +**Reload flow per workload**: `doRollingUpgrade()` → `rollingUpgrade()` per type → `ItemsFunc` lists workloads → `ShouldReload()` filters → `invokeReloadStrategy()` patches or updates → optional pause + metrics + alert. + +--- + +## CSI Support + +**Enabled by**: `--enable-csi-integration` + +**What is watched**: `SecretProviderClassPodStatus` resources (from `sigs.k8s.io/secrets-store-csi-driver`). Resource name constant: `constants.SecretProviderClassController = "secretproviderclasspodstatuses"`. + +**How it works**: +1. The CSI driver injects secrets into pods as volume mounts and tracks injection state via `SecretProviderClassPodStatus` objects. +2. Reloader watches these objects for version changes. +3. When a version change is detected, it computes a SHA of the object's IDs and versions. +4. It then looks up the referenced `SecretProviderClass` and treats the event like a Secret update, triggering workload reloads. + +**Workload annotation**: `secretproviderclass.reloader.stakater.com/reload: "my-spc"` or `secretproviderclass.reloader.stakater.com/auto: "true"`. + +**Required**: CSI CRDs must be installed in the cluster. Reloader auto-detects their presence at startup. + +**Env var postfix**: `STAKATER_{NAME}_SECRETPROVIDERCLASS`. + +**Known limitations**: +- Only works for secrets mounted as volumes via CSI, not env-var-based CSI injection. +- The link from `SecretProviderClassPodStatus` → workload is indirect; edge cases may be missed. +- Requires the CSI driver CRDs to be pre-installed; Reloader won't start CSI controller if CRDs are absent. + +--- + +## Build, Test, And Run Commands + +**Go version**: `go 1.26.2` (from `go.mod`) + +| Purpose | Command | +|---|---| +| Run locally | `go run ./main.go` | +| Build binary | `make build` → `go build -o Reloader` | +| Unit tests | `make test` → `go test -timeout 1800s -v ./...` | +| Lint | `make lint` → `golangci-lint run ./...` (v2.6.1) | +| Docker build (single arch) | `make build-image ARCH=amd64` | +| Docker push | `make push` | +| Full release (build+push+manifest) | `make release ARCH=amd64` | +| Multi-arch release | `make release-all` | +| Generate k8s manifests | `make k8s-manifests` (Kustomize v5.3.0) | +| Load test (quick) | `make loadtest-quick LOADTEST_OLD_IMAGE=... LOADTEST_NEW_IMAGE=...` (runs S1, S4, S6) | +| Load test (full) | `make loadtest-full LOADTEST_OLD_IMAGE=... LOADTEST_NEW_IMAGE=...` | +| Load test (custom) | `make loadtest LOADTEST_SCENARIOS=S1,S3 LOADTEST_DURATION=120` | + +**Docker image**: `ghcr.io/stakater/reloader` — multi-arch (amd64, arm64, arm), distroless nonroot base. + +**Helm chart**: `deployments/kubernetes/chart/reloader/` — install via Helm or `kubectl apply -f deployments/kubernetes/reloader.yaml`. + +--- + +## Coding Conventions + +**Package boundaries**: Each `internal/pkg/` package has a single clear responsibility. Cross-package access goes through exported types/functions only. + +**Error handling**: `logrus.Errorf(...)` for non-fatal, `logrus.Fatalf(...)` for startup failures. Errors are returned up the call stack and logged at the point of action, not at every layer. Retry uses `k8s.io/client-go/util/retry.RetryOnConflict`. + +**Logging**: `logrus` with structured fields. Format controlled by `--log-format=json` flag. Log level controlled by `--log-level`. Messages follow the pattern: `"Changes detected in '%s' of type '%s' in namespace '%s'"`. + +**Kubernetes client patterns**: All k8s operations go through the `kube.Clients` struct. Use `context.TODO()` for context (no request-scoped contexts). List/watch via informers, not polling. + +**Callback pattern**: Workload-specific logic is encapsulated in `callbacks.RollingUpgradeFuncs` structs returned by `handler.Get*RollingUpgradeFuncs()`. Adding a new workload type = add a new `RollingUpgradeFuncs` factory function and call it in `doRollingUpgrade()`. + +**Test style**: Standard `testing.T`, `testify/assert`. Fake k8s objects via `testutil/kube.go`. Tests live alongside source in the same package. Large integration-style tests in `handler/upgrade_test.go`. + +**Naming patterns**: +- Annotation variables: `XxxUpdateOnChangeAnnotation`, `XxxReloaderAutoAnnotation` +- Callback funcs: `GetXxxItem`, `GetXxxItems`, `UpdateXxx`, `PatchXxx` +- Handler factories: `GetXxxRollingUpgradeFuncs()` + +**Adding new behavior**: Add flag to `options/flags.go` + `common.ReloaderOptions` struct → wire in `cmd/reloader.go` → implement logic in `handler/` or `callbacks/` → add metrics recording → write tests in `*_test.go`. + +--- + +## Gotchas And Risks + +**Duplicate reloads**: If a workload references multiple ConfigMaps/Secrets and all change simultaneously, each change event fires a separate reload. No deduplication exists within a reconcile window. This can cause unnecessary rolling restarts. + +**Controller init guard**: `secretControllerInitialized` and `configmapControllerInitialized` booleans in `controller/controller.go` prevent processing Add events during the initial list/sync (to avoid reloading everything on startup). If `--sync-after-restart` is set, both are pre-set to `true`, bypassing the guard. Be careful when this interacts with `--reload-on-create`. + +**Namespace filtering**: `--namespaces-to-ignore` does a name match; `--namespace-selector` watches namespaces by label and caches them in `selectedNamespacesCache`. The cache is updated on Namespace Add/Update/Delete events. A race between cache population and first ConfigMap event could cause missed reloads on startup in label-selected deployments. + +**RBAC**: Reloader requires get/list/watch on secrets and configmaps, and get/list/watch/update/patch on all workload types it manages. Missing RBAC silently causes no reloads (not an error — just empty lists). Check ClusterRole in `deployments/kubernetes/chart/reloader/templates/`. + +**GitOps drift**: If a GitOps tool (Flux, ArgoCD) manages the same Deployments, annotation or env var changes made by Reloader will be detected as drift and reverted. Use `--reload-strategy=annotations` with care in GitOps setups; `env-vars` strategy is generally safer since it modifies the pod template rather than workload-level annotations. + +**Annotation precedence edge case**: Annotations are checked first on the workload object, then on the pod template. If both are set to conflicting values, the behavior depends on which path `ShouldReload()` hits first. Verify in `pkg/common/common.go`. + +**CronJob/Job destructive reload**: Job recreation deletes the old Job. Any in-flight pod from that Job will be terminated. This is intentional but surprising. There is no protection for long-running jobs. + +**OpenShift DeploymentConfig**: Auto-detected by probing for the `deploymentconfigs` resource. If the probe fails at startup, OpenShift support is silently disabled. Check `pkg/kube/client.go`. + +**Argo Rollouts**: Must be explicitly enabled via `--is-Argo-Rollouts=true`. Without it, Rollout objects are never listed. The `SupportsPatch=false` means full object updates are used — be aware of potential conflicts with Argo's own controller. + +**CSI rotation behavior**: `SecretProviderClassPodStatus` is updated by the CSI driver when secrets rotate. Reloader reacts to those updates. However, if the CSI driver updates the status in a way that doesn't change the versions Reloader tracks, the reload will be missed. + +**Backward compatibility**: Annotation names are configurable, so changing defaults would break existing clusters. Never change default annotation values without a migration path. + +**Tests to update for risky changes**: `handler/upgrade_test.go` (large suite covering all workload types), `controller/controller_test.go` (event handling), `pkg/common/common_test.go` (reload decision logic). + +--- + +## Open Questions + +- **Exact `ShouldReload()` precedence**: The code in `pkg/common/common.go` checks annotations in a specific order. The exact tie-breaking when both workload-level and pod-template-level annotations are set should be verified by reading that function fully before making annotation behavior changes. +- **CSI → workload mapping**: How exactly does Reloader map a `SecretProviderClassPodStatus` change back to workloads? Is it via the SecretProviderClass name matching an annotation on the workload, or via volume reference scanning? Needs confirmation before adding CSI-related features. +- **`ContainerPatchPathFunc` field**: `RollingUpgradeFuncs` has a `ContainerPatchPathFunc` field but it is not documented — unclear if/how it differs from `ContainersFunc` in patch scenarios. +- **Webhook vs alert**: `--webhook-url` replaces reloading with a POST request. `ALERT_WEBHOOK_URL` env var sends an alert *after* reloading. These are two different mechanisms; the naming is confusing and easy to conflate. +- **Load test scenarios S7–S13**: Only S1, S4, and S6 are confirmed from CI. The behavior and coverage of the remaining scenarios is unknown without reading `test/loadtest/` in full. +- **`SyncAfterRestart` semantics**: Flag docs say it "syncs add events after restart" but only if `ReloadOnCreate` is also true. The interaction between these two flags in HA mode (where controllers restart on leader change) needs verification. + +--- + +## Important Files + +| File | Description | +|---|---| +| `internal/pkg/cmd/reloader.go` | `startReloader()` — main wiring of clients, controllers, HA, and HTTP server | +| `internal/pkg/handler/upgrade.go` | `doRollingUpgrade()` + all `Get*RollingUpgradeFuncs()` factories | +| `internal/pkg/callbacks/rolling_upgrade.go` | All workload-specific get/update/patch implementations | +| `pkg/common/common.go` | `ShouldReload()` — the annotation decision tree | +| `internal/pkg/options/flags.go` | Every configurable option with defaults | +| `internal/pkg/controller/controller.go` | Informer setup, queue, event handlers | +| `pkg/kube/client.go` | Multi-client initialization and OpenShift/CSI detection | +| `internal/pkg/handler/pause_deployment.go` | Pause/resume deployment logic with timers | +| `internal/pkg/leadership/leadership.go` | HA leader election | +| `internal/pkg/metrics/prometheus.go` | All Prometheus collector definitions | +| `internal/pkg/alerts/alert.go` | Slack/Teams/GChat alerting | +| `internal/pkg/constants/constants.go` | Global constants (env var prefixes, annotation prefix, strategy names) | +| `deployments/kubernetes/chart/reloader/values.yaml` | Helm chart defaults — source of truth for production config | +| `handler/upgrade_test.go` | Largest test suite; must be updated for any reload logic change | +| `Makefile` | All build/test/release/loadtest commands |