When a ManifestWorkReplicaSet's placementRef was changed, the
ManifestWorks created for the old placement were not deleted,
causing orphaned resources.
The deployReconciler only processed placements currently in the spec
and never cleaned up ManifestWorks from removed placements.
This commit adds cleanup logic that:
- Builds a set of current placement names from the spec
- Lists all ManifestWorks belonging to the ManifestWorkReplicaSet
- Deletes any ManifestWorks with placement labels not in current spec
Also adds comprehensive tests:
- Integration test verifying placement change cleanup
- Unit tests for single and multiple placement change scenarios
Fixes#1203🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Jian Qiu <jqiu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
* Upgrade addon framework
Signed-off-by: zhujian <jiazhu@redhat.com>
* Use specific addon template instead of default in CSR functions
- Pass real ManagedClusterAddOn to GetDesiredAddOnTemplate instead of nil
- Enable per-addon template selection using addon.Status.ConfigReferences
- Replace utilruntime.HandleError with explicit error returns
- Update CSRConfigurationsFunc to return ([]RegistrationConfig, error)
- Update CSRSignerFunc to return ([]byte, error)
- Add addon parameter to CSR functions for better context
- Convert runtime errors to structured logging with cluster/addon context
- Update tests to verify error conditions
This allows each ManagedClusterAddOn instance to use its specific template
configuration rather than falling back to the ClusterManagementAddon default.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Fix error assertion logic in registration tests and improve error handling
- Fix inverted error assertion logic in TestTemplateCSRConfigurationsFunc and TestTemplateCSRSignFunc
- Change tests to properly check if expectedErr is empty vs non-empty
- When no error expected, assert err == nil; when error expected, assert err != nil and contains substring
- Fix strings.Contains argument order to check if actual error contains expected substring
- Add nil template checks with proper error messages in CSRSign and PermissionConfig functions
- Improve logging consistency with clusterName/addonName format across CSR functions
- Guard against nil pointer access by checking err == nil before calling err.Error()
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
---------
Signed-off-by: zhujian <jiazhu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Post / images (amd64, addon-manager) (push) Failing after 33s
Post / images (amd64, placement) (push) Failing after 41s
Post / images (amd64, registration) (push) Failing after 40s
Post / images (amd64, registration-operator) (push) Failing after 38s
Post / images (amd64, work) (push) Failing after 36s
Post / images (arm64, addon-manager) (push) Failing after 35s
Post / images (arm64, placement) (push) Failing after 39s
Post / images (arm64, registration) (push) Failing after 34s
Post / images (arm64, registration-operator) (push) Failing after 33s
Post / images (arm64, work) (push) Failing after 35s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Scorecard supply-chain security / Scorecard analysis (push) Failing after 41s
Close stale issues and PRs / stale (push) Failing after 27s
* Fix ManagedClusterAddons not removed when ClusterManagementAddon is deleted
The addon template controller was stopping addon managers immediately when
ClusterManagementAddon was deleted, without waiting for pre-delete jobs
to complete or ManagedClusterAddons to be cleaned up via owner reference
cascading deletion.
This change implements the TODO at line 105 by checking if all
ManagedClusterAddons are deleted before stopping the manager. The controller
now uses field selectors to efficiently query for remaining ManagedClusterAddons
and requeues after 10 seconds if any still exist, allowing time for proper
cleanup.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* add e2e test
Signed-off-by: zhujian <jiazhu@redhat.com>
* return err when stopUnusedManagers failed
Signed-off-by: zhujian <jiazhu@redhat.com>
* Address review comments for addon manager deletion fix
- Use lister instead of API client for better performance
- Add named constant for requeue delay
- Fix test cache synchronization issues
- Improve test coverage from 74.7% to 75.6%
Addresses review feedback from Qiujian16 and CodeRabbit.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Fix e2e test timeout for configmap deletion check
Add explicit 180s timeout for pre-delete job configmap cleanup.
The default 90s timeout was insufficient for the deletion workflow.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Improve error logging in template agent
- Replace utilruntime.HandleError with structured logging in CSR functions
- Add more context to error messages for better debugging
- Use logger.Info for template retrieval errors to provide better visibility
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Use ManagedClusterAddonByName index for efficient lookup
- Replace inefficient list-and-filter with indexed lookup
- Add managedClusterAddonIndexer field to controller struct
- Update comment to accurately describe functionality
- Fix unit tests to properly set up the required index
This addresses the PR review feedback to use the existing index
instead of listing all ManagedClusterAddOns and filtering by name.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Remove unused mcaLister field
Since we now use managedClusterAddonIndexer for efficient lookup,
the mcaLister field is no longer needed. This cleanup reduces
memory usage and simplifies the controller structure.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Replace inefficient list-and-filter with indexed lookup in runController
Use managedClusterAddonIndexer.ByIndex() instead of listing all ManagedClusterAddOns
and filtering by name. This provides O(1) indexed lookup instead of O(n) linear scan.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Fix review comments for addon manager deletion
- Fix closure capture bug in controller test by using captured variables
- Fix typo 'copyiedConfig' to 'copiedConfig' in e2e tests
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Optimize ManagedClusterAddOn event handling in addon template controller
Replace filtered event handling with custom event handlers that only trigger
reconciliation when AddOnTemplate configReferences actually change. This
reduces unnecessary reconciliation cycles by using reflect.DeepEqual to
compare config references between old and new objects.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Revert "Optimize ManagedClusterAddOn event handling in addon template controller"
This reverts commit 4649d1b9ac.
Signed-off-by: zhujian <jiazhu@redhat.com>
---------
Signed-off-by: zhujian <jiazhu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
* Fix helm command syntax in e2e test deployment
Remove unnecessary commas from --set flags in helm commands.
Multiple --set flags don't require commas between them.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Add comprehensive image tag validation to e2e tests
- Add configurable expected-image-tag parameter to e2e test suite
- Create image_tag_validation_test.go with comprehensive validation:
* Validates all container images in cluster-manager and klusterlet deployments use expected tag
* Validates test image variables (registrationImage, workImage, singletonImage) end with expected tag
* Validates ClusterManager spec imagePullSpec fields use expected tag
- Pass IMAGE_TAG to e2e tests via expected-image-tag parameter
- Use BeforeEach to avoid duplicate validation checks
This ensures IMAGE_TAG=e2e properly propagates to all OCM components.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Move image validation from separate test to BeforeSuite
- Move all image validation logic from image_tag_validation_test.go to BeforeSuite in e2e_suite_test.go
- Validates test image variables, ClusterManager spec, and deployment containers in setup phase
- Provides early failure detection if image configuration is incorrect
- Removes duplicate test file and function declarations
This ensures image validation runs once during test setup rather than as separate test cases.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
---------
Signed-off-by: zhujian <jiazhu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 54s
Post / coverage (push) Failing after 28s
Post / images (amd64, addon-manager) (push) Failing after 41s
Post / images (amd64, placement) (push) Failing after 23s
Post / images (amd64, registration) (push) Failing after 22s
Post / images (amd64, registration-operator) (push) Failing after 24s
Post / images (amd64, work) (push) Failing after 28s
Post / images (arm64, addon-manager) (push) Failing after 24s
Post / images (arm64, placement) (push) Failing after 26s
Post / images (arm64, registration) (push) Failing after 35s
Post / images (arm64, registration-operator) (push) Failing after 30s
Post / images (arm64, work) (push) Failing after 24s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Close stale issues and PRs / stale (push) Successful in 1m12s
The key queue for clustermanagementaddon informer is not correct for
several controllers, fix it by introducing a new queuekey func
Signed-off-by: Jian Qiu <jqiu@redhat.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 43s
Post / coverage (push) Failing after 47s
Post / images (amd64, addon-manager) (push) Failing after 33s
Post / images (amd64, placement) (push) Failing after 39s
Post / images (amd64, registration) (push) Failing after 32s
Post / images (amd64, registration-operator) (push) Failing after 37s
Post / images (amd64, work) (push) Failing after 39s
Post / images (arm64, addon-manager) (push) Failing after 42s
Post / images (arm64, placement) (push) Failing after 42s
Post / images (arm64, registration) (push) Failing after 36s
Post / images (arm64, registration-operator) (push) Failing after 34s
Post / images (arm64, work) (push) Failing after 27s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
In integration test, there is change that creating cluster fails
since the cluster is created in the test. The alreadyExist
error should be ignored
Signed-off-by: Jian Qiu <jqiu@redhat.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m58s
Post / coverage (push) Failing after 36m24s
Post / images (amd64) (push) Failing after 9m7s
Post / images (arm64) (push) Failing after 8m30s
Post / image manifest (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Close stale issues and PRs / stale (push) Successful in 57s
* Skip manifests in work reconcile that are marked Complete
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Aggregate Complete condition to work from manifests
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Delete work that is complete and satisfies configured TTL
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* tests
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* lint
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* go.mod
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Helper funcs for conditions
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Generic condition aggregation
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Support integration test args
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Remove work deletion from spoke, will be moved to hub GC
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Cleanup
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* update api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Wait for NS to exist before testing
Signed-off-by: Ben Perry <bhperry94@gmail.com>
---------
Signed-off-by: Ben Perry <bhperry94@gmail.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m40s
Post / coverage (push) Failing after 35m43s
Post / images (amd64) (push) Failing after 8m36s
Post / images (arm64) (push) Failing after 8m8s
Post / image manifest (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Close stale issues and PRs / stale (push) Successful in 48s
* Import OCM API changes for workload conditions
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Implement condition rule evaluator
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Evaluate manifest condition rules after apply
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* note to self
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Cleanup
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Return config option if rules are set
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* update api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Always return an error to inform user about the state of their condition rule
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Condition rule errors should not result in retrying apply
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Test condition rule reconciliation
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Return condition status Unknown when an internal CEL error occurs
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Switch to common CEL lib
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update to simplified celExpressions format
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Formatting
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* tidy
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update ocm api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update sdk-go
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Switch to sdk-go ConditionLib
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update API
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Switch to WellKnownConditions with required Condition field
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Support CEL evaluation budget
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update sdk-go
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update API
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* lint
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update go.mod
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Tests and comments
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Move condition reader to status controller for more frequent updates
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Ignore missing WellKnownCondition
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Fix test
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update condition tests
Signed-off-by: Ben Perry <bhperry94@gmail.com>
---------
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Code refactor on registration driver
1. Move driver's options to each drivers implemenation
2. Add BuildClients interface so it is possible
to support other driver later.
3. create a general option to simplify driver init
4. remove option with any type
Signed-off-by: Jian Qiu <jqiu@redhat.com>
* Change to bootstrapKubeConfigFile
Signed-off-by: Jian Qiu <jqiu@redhat.com>
---------
Signed-off-by: Jian Qiu <jqiu@redhat.com>