* Add watch-based feedback with dynamic informer lifecycle management
Implements dynamic informer registration and cleanup for resources
configured with watch-based status feedback (FeedbackScrapeType=Watch).
This enables real-time status updates for watched resources while
efficiently managing resource lifecycle.
Features:
- Automatically register informers for resources with FeedbackWatchType
- Skip informer registration for FeedbackPollType or when not configured
- Clean up informers when resources are removed from manifestwork
- Clean up informers during applied manifestwork finalization
- Clean up informers when feedback type changes from watch to poll
Implementation:
- Refactored ObjectReader to interface for better modularity
- Added UnRegisterInformerFromAppliedManifestWork helper for bulk cleanup
- Enhanced AvailableStatusController to conditionally register informers
- Updated finalization controllers to unregister informers on cleanup
- Added nil safety checks to prevent panics during cleanup
Testing:
- Unit tests for informer registration based on feedback type
- Unit tests for bulk unregistration and nil safety
- Integration test for end-to-end watch-based feedback workflow
- Integration test for informer cleanup on manifestwork deletion
- All existing tests updated and passing
This feature improves performance by using watch-based updates for
real-time status feedback while maintaining efficient resource cleanup.
Signed-off-by: Jian Qiu <jqiu@redhat.com>
* Fallback to get from client when informer is not synced
Signed-off-by: Jian Qiu <jqiu@redhat.com>
---------
Signed-off-by: Jian Qiu <jqiu@redhat.com>
This commit adds validation to detect and reject duplicate manifests
in ManifestWork resources. A manifest is considered duplicate when
it has the same apiVersion, kind, namespace, and name as another
manifest in the same ManifestWork.
This prevents issues where duplicate manifests with different specs
can cause state inconsistency, as the Work Agent applies manifests
sequentially and later entries would overwrite earlier ones.
The validation returns a clear error message indicating the duplicate
manifest's index and the index of its first occurrence.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: xuezhaojun <zxue@redhat.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m3s
Post / images (amd64, addon-manager) (push) Failing after 7m31s
Post / coverage (push) Failing after 9m30s
Post / images (amd64, registration-operator) (push) Failing after 57s
Post / images (amd64, work) (push) Failing after 52s
Post / images (arm64, addon-manager) (push) Failing after 50s
Post / images (arm64, placement) (push) Failing after 52s
Post / images (arm64, registration) (push) Failing after 50s
Post / images (arm64, registration-operator) (push) Failing after 52s
Post / images (arm64, work) (push) Failing after 49s
Post / images (amd64, registration) (push) Failing after 7m6s
Post / images (amd64, placement) (push) Failing after 27m47s
Post / image manifest (addon-manager) (push) Has been cancelled
Post / image manifest (placement) (push) Has been cancelled
Post / image manifest (registration) (push) Has been cancelled
Post / image manifest (registration-operator) (push) Has been cancelled
Post / image manifest (work) (push) Has been cancelled
Post / trigger clusteradm e2e (push) Has been cancelled
Close stale issues and PRs / stale (push) Successful in 3s
Fixed a bug where AppliedManifestWorks were not evicted immediately
after the appliedmanifestwork-eviction-grace-period expired.
Root cause: The controller used an exponential backoff rate limiter
to schedule requeue delays, which caused:
1. Exponentially increasing delays during grace period (1min -> 2min -> 4min...)
2. Unpredictable delays after grace period expired
Solution: Replace rate limiter with direct time calculation. Now the
controller calculates the exact remaining time until eviction and
schedules the next sync for that precise moment:
remainingTime := evictionTime.Sub(now)
Changes:
- Removed rateLimiter field and workqueue import
- Calculate exact remaining time instead of using exponential backoff
- Added V(4) logging to show scheduled eviction time and remaining time
- Updated unit test expectations (queue length 0 for delayed items)
Impact: AppliedManifestWorks are now evicted immediately when the
grace period expires, instead of being delayed by minutes due to
exponential backoff.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: zhujian <jiazhu@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 20s
Post / images (amd64, placement) (push) Failing after 45s
Post / images (amd64, registration) (push) Failing after 42s
Post / images (amd64, registration-operator) (push) Failing after 40s
Post / images (amd64, work) (push) Failing after 41s
Post / images (arm64, addon-manager) (push) Failing after 41s
Post / images (arm64, placement) (push) Failing after 40s
Post / images (arm64, registration) (push) Failing after 39s
Post / images (arm64, registration-operator) (push) Failing after 39s
Post / images (arm64, work) (push) Failing after 41s
Post / images (amd64, addon-manager) (push) Failing after 7m30s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Post / coverage (push) Failing after 9m44s
Update with success count
Remove status references
Add unit tests
Fix unit tests
Update unit tests
Test fix
Fix tests for lastTransitionTime
Fix integration tests
Signed-off-by: annelau <annelau@salesforce.com>
Co-authored-by: annelau <annelau@salesforce.com>
Post / images (amd64, addon-manager) (push) Failing after 43s
Post / images (amd64, placement) (push) Failing after 36s
Post / images (amd64, registration) (push) Failing after 36s
Post / images (amd64, registration-operator) (push) Failing after 36s
Post / images (amd64, work) (push) Failing after 38s
Post / images (arm64, placement) (push) Failing after 37s
Post / images (arm64, registration) (push) Failing after 37s
Post / images (arm64, registration-operator) (push) Failing after 38s
Post / images (arm64, work) (push) Failing after 38s
Post / images (arm64, addon-manager) (push) Failing after 14m20s
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m28s
Post / image manifest (addon-manager) (push) Has been cancelled
Post / image manifest (placement) (push) Has been cancelled
Post / image manifest (registration) (push) Has been cancelled
Post / image manifest (registration-operator) (push) Has been cancelled
Post / image manifest (work) (push) Has been cancelled
Post / trigger clusteradm e2e (push) Has been cancelled
Close stale issues and PRs / stale (push) Successful in 4s
Update code changes to only update observed generation without lastTransitionTime
Update with simple tests
Update with the latest PR changes
Add unit test changes
Add integration test generated by cursor
Fix unit tests
Signed-off-by: annelau <annelau@salesforce.com>
Co-authored-by: annelau <annelau@salesforce.com>
Post / images (amd64, addon-manager) (push) Failing after 46s
Post / images (amd64, placement) (push) Failing after 41s
Post / images (amd64, registration-operator) (push) Failing after 39s
Post / images (amd64, work) (push) Failing after 42s
Post / images (arm64, addon-manager) (push) Failing after 39s
Post / images (arm64, placement) (push) Failing after 39s
Post / images (arm64, registration) (push) Failing after 40s
Post / images (arm64, registration-operator) (push) Failing after 42s
Post / images (arm64, work) (push) Failing after 39s
Post / images (amd64, registration) (push) Failing after 7m46s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Post / coverage (push) Failing after 14m33s
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m25s
Close stale issues and PRs / stale (push) Successful in 46s
* Add addon conversion webhook for v1alpha1/v1beta1 API migration
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Qing Hao <qhao@redhat.com>
* Fix GroupVersion compatibility issues after API dependency update
This commit fixes compilation and test errors introduced by updating
the API dependency to use native conversion functions from PR #411.
Changes include:
1. Fix GroupVersion type mismatches across the codebase:
- Updated OwnerReference creation to use schema.GroupVersion
- Fixed webhook scheme registration to use proper GroupVersion type
- Applied fixes to addon, placement, migration, work, and registration controllers
2. Enhance addon conversion webhook:
- Use native API conversion functions from addon/v1beta1/conversion.go
- Fix InstallNamespace annotation key to match expected format
- Add custom logic to populate deprecated ConfigReferent field in ConfigReferences
- Properly preserve annotations during v1alpha1 <-> v1beta1 conversion
3. Remove duplicate conversion code:
- Deleted pkg/addon/webhook/conversion/ directory (~500 lines)
- Now using native conversion functions from the API repository
4. Patch vendored addon-framework:
- Fixed GroupVersion errors in agentdeploy utils
All unit tests pass successfully (97 packages, 0 failures).
Signed-off-by: Qing Hao <qhao@redhat.com>
---------
Signed-off-by: Qing Hao <qhao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Skip garbage collection for ManifestWorks that have the
ManifestWorkReplicaSet controller label, as these should be
managed exclusively by the ManifestWorkReplicaSet controller.
Changes:
- Fix logic bug in controller to properly check for ReplicaSet label
- Add unit tests for label-based GC skip behavior
- Add integration test to verify GC skip for ReplicaSet-managed works
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Jian Qiu <jqiu@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This change adds log tracing support to the work agent controllers by:
- Upgrading SDK to version with logging.SetLogTracingByObject helper
- Setting tracing keys from ManifestWork objects in all work controllers
- Adding clusterName to the base logger for better log context
- Propagating tracing context through cloud events
The tracing keys enable better correlation of logs across the work
lifecycle from source to agent.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Jian Qiu <jqiu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m11s
Post / coverage (push) Failing after 37m30s
Post / images (amd64, addon-manager) (push) Failing after 7m29s
Post / images (amd64, placement) (push) Failing after 6m57s
Post / images (amd64, registration) (push) Failing after 7m5s
Post / images (amd64, registration-operator) (push) Failing after 7m5s
Post / images (amd64, work) (push) Failing after 7m2s
Post / images (arm64, addon-manager) (push) Failing after 7m18s
Post / images (arm64, placement) (push) Failing after 7m7s
Post / images (arm64, registration) (push) Failing after 7m13s
Post / images (arm64, registration-operator) (push) Failing after 7m6s
Post / images (arm64, work) (push) Failing after 7m2s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Close stale issues and PRs / stale (push) Successful in 45s
* Use base controller in sdk-go
We can leverage contextual logger in base controller.
Signed-off-by: Jian Qiu <jqiu@redhat.com>
* Fix integration test error
Signed-off-by: Jian Qiu <jqiu@redhat.com>
---------
Signed-off-by: Jian Qiu <jqiu@redhat.com>
When a ManifestWorkReplicaSet's placementRef was changed, the
ManifestWorks created for the old placement were not deleted,
causing orphaned resources.
The deployReconciler only processed placements currently in the spec
and never cleaned up ManifestWorks from removed placements.
This commit adds cleanup logic that:
- Builds a set of current placement names from the spec
- Lists all ManifestWorks belonging to the ManifestWorkReplicaSet
- Deletes any ManifestWorks with placement labels not in current spec
Also adds comprehensive tests:
- Integration test verifying placement change cleanup
- Unit tests for single and multiple placement change scenarios
Fixes#1203🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Jian Qiu <jqiu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 50s
Post / coverage (push) Failing after 55s
Post / images (amd64, addon-manager) (push) Failing after 36s
Post / images (amd64, placement) (push) Failing after 40s
Post / images (amd64, registration) (push) Failing after 28s
Post / images (amd64, registration-operator) (push) Failing after 20s
Post / images (amd64, work) (push) Failing after 29s
Post / images (arm64, addon-manager) (push) Failing after 33s
Post / images (arm64, placement) (push) Failing after 32s
Post / images (arm64, registration) (push) Failing after 25s
Post / images (arm64, registration-operator) (push) Failing after 33s
Post / images (arm64, work) (push) Failing after 29s
Post / image manifest (addon-manager) (push) Has been skipped
Post / image manifest (placement) (push) Has been skipped
Post / image manifest (registration) (push) Has been skipped
Post / image manifest (registration-operator) (push) Has been skipped
Post / image manifest (work) (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
* 🧹 Remove resolved TODO comments
- Remove TODO comment about confirming subject in CustomSignerConfigurations
- Remove TODO comment about namespace value in manifestwork validating webhook
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
* Ensure addon namespace and image pull secret sync
- Remove outdated TODO comment about addon deployment in Hosted mode
- Namespace creation is responsibility of addon developers
- Image pull secrets are synced to namespaces with addon.open-cluster-management.io/namespace label by the AddonPullImageSecretController
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: zhujian <jiazhu@redhat.com>
---------
Signed-off-by: zhujian <jiazhu@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m58s
Post / coverage (push) Failing after 36m24s
Post / images (amd64) (push) Failing after 9m7s
Post / images (arm64) (push) Failing after 8m30s
Post / image manifest (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Close stale issues and PRs / stale (push) Successful in 57s
* Skip manifests in work reconcile that are marked Complete
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Aggregate Complete condition to work from manifests
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Delete work that is complete and satisfies configured TTL
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* tests
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* lint
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* go.mod
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Helper funcs for conditions
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Generic condition aggregation
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Support integration test args
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Remove work deletion from spoke, will be moved to hub GC
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Cleanup
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* update api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Wait for NS to exist before testing
Signed-off-by: Ben Perry <bhperry94@gmail.com>
---------
Signed-off-by: Ben Perry <bhperry94@gmail.com>
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m40s
Post / coverage (push) Failing after 35m43s
Post / images (amd64) (push) Failing after 8m36s
Post / images (arm64) (push) Failing after 8m8s
Post / image manifest (push) Has been skipped
Post / trigger clusteradm e2e (push) Has been skipped
Close stale issues and PRs / stale (push) Successful in 48s
* Import OCM API changes for workload conditions
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Implement condition rule evaluator
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Evaluate manifest condition rules after apply
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* note to self
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Cleanup
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Return config option if rules are set
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* update api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Always return an error to inform user about the state of their condition rule
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Condition rule errors should not result in retrying apply
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Test condition rule reconciliation
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Return condition status Unknown when an internal CEL error occurs
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Switch to common CEL lib
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update to simplified celExpressions format
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Formatting
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* tidy
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update ocm api
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update sdk-go
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Switch to sdk-go ConditionLib
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update API
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Switch to WellKnownConditions with required Condition field
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Support CEL evaluation budget
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update sdk-go
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update API
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* lint
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update go.mod
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Tests and comments
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Move condition reader to status controller for more frequent updates
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Ignore missing WellKnownCondition
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Fix test
Signed-off-by: Ben Perry <bhperry94@gmail.com>
* Update condition tests
Signed-off-by: Ben Perry <bhperry94@gmail.com>
---------
Signed-off-by: Ben Perry <bhperry94@gmail.com>
Post / trigger clusteradm e2e (push) Has been skipped
Scorecard supply-chain security / Scorecard analysis (push) Failing after 1m9s
Close stale issues and PRs / stale (push) Successful in 40s
Instead of requeue all each resyncInterval, we requeue
for each item separately with a jitter to avoud bursty request
Signed-off-by: Jian Qiu <jqiu@redhat.com>