docs: refine sbctl bundle contract design

Design clarifications: - sbctl remains separate binary, imports shared libs from troubleshoot - Add bundle schema versioning to discovery.json - Add CRD table collection as explicit requirement - Add failsafe collection guarantee (table failures never block raw) - Add hybrid field selector discovery (runtime + static fallback) - Add Limitations section (watch, subresources, writes, pagination) - Add Testing Strategy section Copyediting: - Apply Strunk's Elements of Style throughout - Use active voice consistently - Remove hedging and redundant phrases - Tighten prose for clarity
2026-02-14 10:19:54 +00:00 · 2026-01-22 16:18:13 -05:00
parent a1991ae953
commit 0c4bf958ca
1 changed files with 318 additions and 52 deletions
--- a/docs/design/2026-01-22-sbctl-merge.md
+++ b/docs/design/2026-01-22-sbctl-merge.md
@@ -1,4 +1,4 @@
-# Design: Merging sbctl into Troubleshoot
+# Design: Shared Bundle Contract for sbctl and Troubleshoot
 **Date:** 2026-01-22
 **Status:** Draft
@@ -6,34 +6,35 @@
 ## Problem Statement
-sbctl is a companion tool to troubleshoot that creates a local Kubernetes API server from support bundles, allowing `kubectl` commands to be run against captured cluster state. Currently:
+sbctl is a companion tool to troubleshoot that creates a local Kubernetes API server from support bundles, enabling users to run `kubectl` commands against captured cluster state. Currently:
 1. **Maintenance burden**: sbctl requires constant updates to stay in sync with troubleshoot's collector changes
-2. **No formal contract**: There's no schema defining how troubleshoot writes resources and how sbctl reads them
+2. **No formal contract**: No schema defines how troubleshoot writes resources or how sbctl reads them
 3. **Type-specific code**: sbctl has ~1800 lines of switch statements handling each resource type individually
 4. **Drift risk**: As troubleshoot adds capabilities, sbctl may not reflect them, causing inconsistent behavior
 ## Goals
-1. **Eliminate maintenance burden** - Stop having to update sbctl every time troubleshoot adds a new collector or changes file structure
+1. **Eliminate maintenance burden** - sbctl should not require updates when troubleshoot adds collectors or changes file structure
-2. **Enable dynamic resource support** - sbctl should serve any Kubernetes resource without code changes, as long as troubleshoot collected it
+2. **Enable dynamic resource support** - sbctl should serve any Kubernetes resource troubleshoot collects, without code changes
 3. **Perfect fidelity** - sbctl output must match what a real Kubernetes API server would return; debuggers must trust the output
-4. **Backwards compatibility** - New bundles work with old sbctl; new sbctl works with old bundles
+4. **Backwards compatibility** - New and old versions of bundles and sbctl must interoperate
 ## Non-Goals
- Modifying how users invoke sbctl (CLI interface unchanged)
+- Modifying the sbctl CLI interface
- Supporting write operations (sbctl remains read-only)
+- Supporting write operations
- Real-time cluster connection (sbctl only serves static bundle data)
+- Connecting to live clusters
 ## Solution Overview
-Merge sbctl into the troubleshoot codebase with a shared `pkg/bundle` package that defines the contract between collection and serving. Key changes:
+A new `pkg/bundle` package in troubleshoot defines the contract between collection and serving. sbctl remains a separate binary but imports shared libraries from troubleshoot. Key changes:
-1. **Pre-compute table representations** at collection time (when cluster access exists)
+1. **Pre-compute table representations** at collection time, when cluster access exists
 2. **Store API discovery metadata** alongside resources
-3. **Use shared selectable fields mapping** for field selector support
+3. **Discover field selectors** at runtime with static fallback
 4. **Serve resources generically** using unstructured objects with metadata-driven behavior
 5. **Collect failsafe** - table and metadata failures never block raw resource collection
 ## Detailed Design
@@ -100,10 +101,13 @@ This is stored as `{resource}/{namespace}.table.json` alongside the raw `{namesp
 ### Discovery Metadata
-`_meta/discovery.json` captures API server discovery information:
+`_meta/discovery.json` captures API server discovery information and bundle schema version:
 ```json
 {
  "bundleSchemaVersion": "1.0",
  "collectedAt": "2026-01-22T10:30:00Z",
  "kubernetesVersion": "v1.28.4",
  "apiVersion": "v1",
  "groups": [
    {
@@ -129,6 +133,11 @@ This is stored as `{resource}/{namespace}.table.json` alongside the raw `{namesp
 }
 ```
 The `bundleSchemaVersion` field enables future schema evolution:
 - sbctl checks version and uses appropriate parsing logic
 - Unknown versions fall back to legacy behavior
 - Semantic versioning: minor versions are backwards compatible
 ### Shared Package: `pkg/bundle`
 A new package shared between collectors and sbctl defines the bundle contract.
@@ -161,6 +170,16 @@ func TablePath(resource, namespace string) string {
    }
    return fmt.Sprintf("%s/%s.table.json", resource, namespace)
 }
 // SelectableFieldsPath returns the path for discovered selectable fields
 func SelectableFieldsPath(gvr schema.GroupVersionResource) string {
    group := gvr.Group
    if group == "" {
        group = "core"
    }
    return filepath.Join(MetaDir, SelectableFieldsDir,
        fmt.Sprintf("%s.%s.%s.json", group, gvr.Version, gvr.Resource))
 }
 ```
 #### `pkg/bundle/fields.go`
@@ -249,6 +268,24 @@ func GetSelectableFields(group, version, resource string) []string {
    // Default: all resources support metadata selectors
    return []string{"metadata.name", "metadata.namespace"}
 }
 // LoadSelectableFields loads selectable fields using hybrid approach:
 // 1. Check for discovered fields in bundle (preferred)
 // 2. Fall back to static SelectableFields map
 // 3. Fall back to metadata-only fields
 func LoadSelectableFields(bundlePath string, gvr schema.GroupVersionResource) []string {
    // Try discovered fields first
    discoveredPath := filepath.Join(bundlePath, ClusterResourcesDir, SelectableFieldsPath(gvr))
    if data, err := os.ReadFile(discoveredPath); err == nil {
        var fields []string
        if json.Unmarshal(data, &fields) == nil && len(fields) > 0 {
            return fields
        }
    }
    // Fall back to static map
    return GetSelectableFields(gvr.Group, gvr.Version, gvr.Resource)
 }
 ```
 #### `pkg/bundle/extract.go`
@@ -311,26 +348,47 @@ func formatSlice(slice []interface{}) string {
 ### Collection Changes
-The `clusterResources` collector gains additional collection logic:
+The `clusterResources` collector gains additional collection logic. **Critical requirement: collection must be failsafe.** Failures in table or metadata collection must never prevent raw resource collection from succeeding.
 ```go
 func (c *CollectClusterResources) Collect(progressChan chan<- interface{}) (CollectorResult, error) {
    // ... existing collection logic ...
-    // For each resource type collected:
+    // For each resource type collected (including CRDs):
    for _, gvr := range collectedResources {
-        // 1. Collect raw resources (existing)
+        // 1. Collect raw resources (existing) - this MUST succeed
-        rawResult := collectRawResources(client, gvr, namespace)
+        rawResult, err := collectRawResources(client, gvr, namespace)
        if err != nil {
            // Handle error per existing behavior
            continue
        }
        output.SaveResult(bundle.ResourcePath(gvr.Resource, namespace), rawResult)
-        // 2. Collect table representation (NEW)
+        // 2. Collect table representation (NEW) - failsafe, errors logged but not fatal
-        tableResult := collectTableRepresentation(client, gvr, namespace)
+        tableResult, err := collectTableRepresentation(client, gvr, namespace)
-        output.SaveResult(bundle.TablePath(gvr.Resource, namespace), tableResult)
+        if err != nil {
            // Log warning but continue - some resources (especially CRDs) may not support table format
            log.Warnf("Failed to collect table for %s: %v", gvr, err)
        } else {
            output.SaveResult(bundle.TablePath(gvr.Resource, namespace), tableResult)
        }
        // 3. Discover selectable fields (NEW) - failsafe
        fields, err := discoverSelectableFields(client, gvr)
        if err != nil {
            log.Debugf("Could not discover selectable fields for %s: %v", gvr, err)
        } else if len(fields) > 0 {
            output.SaveResult(bundle.SelectableFieldsPath(gvr), fields)
        }
    }
-    // Collect discovery metadata (NEW)
+    // Collect discovery metadata (NEW) - failsafe
-    discovery := collectDiscoveryMetadata(client)
+    discovery, err := collectDiscoveryMetadata(client)
-    output.SaveResult(path.Join(bundle.MetaDir, bundle.DiscoveryFile), discovery)
+    if err != nil {
        log.Warnf("Failed to collect API discovery metadata: %v", err)
    } else {
        output.SaveResult(path.Join(bundle.MetaDir, bundle.DiscoveryFile), discovery)
    }
    return output, nil
 }
@@ -343,6 +401,30 @@ func collectTableRepresentation(client rest.Interface, gvr schema.GroupVersionRe
    return req.DoRaw(ctx)
 }
 // discoverSelectableFields probes the API server to discover which fields are selectable
 func discoverSelectableFields(client rest.Interface, gvr schema.GroupVersionResource) ([]byte, error) {
    // Make request with invalid field selector to get error listing valid fields
    req := client.Get().
        Resource(gvr.Resource).
        Param("fieldSelector", "__invalid_field__=x")
    _, err := req.DoRaw(ctx)
    if err == nil {
        // Unexpected success - return empty
        return nil, nil
    }
    // Parse error message for field list
    // Error format: "field label not supported: __invalid_field__"
    // or lists valid fields in some K8s versions
    fields := parseSelectableFieldsFromError(err)
    if len(fields) == 0 {
        return nil, fmt.Errorf("could not parse selectable fields from error")
    }
    return json.Marshal(fields)
 }
 ```
 ### sbctl Serving Changes
@@ -426,26 +508,21 @@ This works because:
 ## Migration Path
 sbctl remains a separate binary with its own repository, but imports shared libraries from troubleshoot.
 | Phase | Changes |
 |-------|---------|
 | Phase 1 | Add `pkg/bundle` to troubleshoot with schema definitions |
-| Phase 2 | Update collectors to write `_meta/` and `.table.json` files |
+| Phase 2 | Update collectors to write `_meta/` and `.table.json` files (failsafe) |
-| Phase 3 | Move sbctl code into troubleshoot as `pkg/sbctl` or `cmd/sbctl` |
+| Phase 3 | sbctl adds troubleshoot as a Go module dependency |
 | Phase 4 | Refactor sbctl to use `pkg/bundle` with fallback to legacy |
-| Phase 5 | Deprecate standalone sbctl repository |
+| Phase 5 | (Later) Remove legacy fallback code from sbctl |
 | Phase 6 | (Later) Remove legacy fallback code from sbctl |
 ## Bundle Size Impact
-Pre-computed table files add storage overhead:
+Pre-computed table files add minimal storage overhead:
- Approximately 2x storage for resource data (raw + table)
+- The `cluster-resources/` directory rarely exceeds 5MB
- Table format rows are often smaller than full objects (cells vs full spec)
+- Table JSON compresses well within the tar.gz bundle
 - Estimated increase: 30-50% of cluster-resources directory size
 This tradeoff is acceptable because:
 1. Fidelity guarantee is more valuable than bundle size for debugging
 2. Support bundles are typically compressed (tar.gz)
 3. Table data compresses well (repetitive structure)
 ## Alternatives Considered
@@ -454,35 +531,133 @@ This tradeoff is acceptable because:
 Generate table output at serve time using Kubernetes' internal printers.
 **Rejected because:**
- Requires typed objects for each resource (switch statements remain)
+- Typed objects for each resource preserve switch statements
- Complex columns (e.g., Pod "Ready" = running/total containers) require type-specific logic
+- Complex columns (e.g., Pod "Ready" = running/total containers) demand type-specific logic
- Cannot achieve perfect fidelity without cluster access
+- Perfect fidelity requires cluster access
 ### Alternative 2: Unstructured with Basic Tables
 Serve all resources as unstructured, provide only NAME/NAMESPACE/AGE columns.
 **Rejected because:**
- Violates fidelity requirement
+- Violates the fidelity requirement
- Debuggers lose valuable context from resource-specific columns
+- Strips resource-specific columns that debuggers rely on
- UX degradation unacceptable for troubleshooting tool
+- Degrades the troubleshooting experience unacceptably
-### Alternative 3: Parse Error Messages for Selectable Fields
+### Alternative 3: Static-Only Selectable Fields Mapping
-Discover selectable fields by making invalid requests and parsing error messages.
+Maintain only a static mapping of selectable fields without runtime discovery.
 **Rejected because:**
- Fragile (error message format may change)
+- Map becomes stale as Kubernetes adds new selectable fields
- Requires cluster access at serve time (defeats offline purpose)
+- No way to know selectable fields for CRDs
- Shared mapping is more maintainable
+- Requires manual updates to track Kubernetes releases
 The chosen hybrid approach (runtime discovery at collection time + static fallback) provides better coverage while maintaining offline functionality.
 ## Design Decisions
 ### CRD Table Collection
 **Decision: Yes, collect table representations for CRDs.**
 Rationale:
 - CRDs often provide the most valuable custom printer columns (e.g., Certificate expiry, VolumeSnapshot ready state)
 - Users expect consistent behavior between core resources and CRDs
 - Failsafe collection ensures raw data even when a CRD lacks table format support
 ### Failsafe Collection
 **Decision: Table and metadata collection failures must never block raw resource collection.**
 Rationale:
 - The cluster-resources collector exists primarily to capture raw resource data
 - Some resources, especially CRDs, lack table format support
 - Degraded clusters may fail API server discovery
 - Users must always receive at least the raw resource data
 ### Field Selector Discovery
 **Decision: Hybrid approach - runtime discovery at collection time with static fallback.**
 At collection time:
 1. Probe API server with invalid field selector
 2. Parse error message to extract valid field list
 3. Store discovered fields in `_meta/selectable-fields/`
 At serve time (sbctl):
 1. Check for discovered fields in bundle
 2. Fall back to static `SelectableFields` map
 3. Fall back to metadata-only (`metadata.name`, `metadata.namespace`)
 **Note:** Field discovery via error parsing is inherently fragile as error message formats vary between Kubernetes versions. The static fallback ensures functionality even when discovery fails.
 ### Bundle Schema Versioning
 **Decision: Include `bundleSchemaVersion` in `_meta/discovery.json`.**
 Rationale:
 - Enables future schema changes without breaking compatibility
 - sbctl can adapt behavior based on schema version
 - Unknown versions trigger fallback to legacy behavior
 - Semantic versioning: `1.x` versions are backwards compatible with `1.0`
 ## Limitations
 The following features are explicitly **not supported** by sbctl and are out of scope for this design:
 ### Watch Operations
 sbctl serves static bundle data and cannot support watch operations. Commands like `kubectl get pods -w`:
 - Return the initial list
 - Immediately close the watch stream
 This is inherent to sbctl's design as a snapshot-based tool.
 ### Subresources
 Subresources that require live cluster interaction are not supported:
 | Subresource | Reason |
 |-------------|--------|
 | `pods/exec` | Requires running container |
 | `pods/attach` | Requires running container |
 | `pods/portforward` | Requires running container |
 | `pods/log` | Log data not collected by default (separate collector) |
 | `*/scale` | Write operation |
 | `*/status` | Could be supported if status is in collected data |
 ### Write Operations
 All mutating operations return `405 Method Not Allowed`:
 - `kubectl create`, `kubectl apply`, `kubectl delete`
 - `kubectl edit`, `kubectl patch`
 - `kubectl scale`, `kubectl rollout`
 ### Real-time Data
 sbctl serves point-in-time snapshot data. It cannot reflect:
 - Current pod status
 - Live metrics
 - Recent events (only events collected at bundle time)
 - Changes since bundle collection
 ### List Pagination
 Kubernetes `continue` tokens for paginated lists are not implemented. sbctl returns all matching resources in a single response. This design works because:
 - Bundles contain finite resources, already filtered at collection time
 - Typical bundle sizes fit comfortably in single responses
 ### resourceVersion Semantics
 sbctl does not implement proper `resourceVersion` semantics:
 - All resources return a static or omitted `resourceVersion`
 - Optimistic concurrency checks are not meaningful
 - Watch `resourceVersion` parameters are ignored
 ## Open Questions
-1. **CRD table columns**: Should we capture table representations for custom resources? (Recommended: yes, same approach applies)
+None remaining. All questions have been resolved in the Design Decisions section.
 2. **Bundle format versioning**: Should we add a version field to `_meta/` for future schema changes?
 3. **Selective table collection**: Should table collection be opt-in to reduce bundle size for users who don't use sbctl?
 ## Success Criteria
@@ -491,6 +666,97 @@ Discover selectable fields by making invalid requests and parsing error messages
 3. Field selectors work for documented selectable fields
 4. Old bundles continue to work with new sbctl
 5. No switch statements for resource types in sbctl serving code
 6. Table/metadata collection failures never prevent raw resource collection
 7. CRD resources have table representations when the CRD supports them
 ## Testing Strategy
 ### Unit Tests
 **pkg/bundle tests:**
 - `ResourcePath()`, `TablePath()`, `SelectableFieldsPath()` return correct paths
 - `GetSelectableFields()` returns correct fields for known GVRs
 - `GetSelectableFields()` returns metadata-only fallback for unknown GVRs
 - `LoadSelectableFields()` prefers discovered fields over static map
 - `ExtractSelectableFields()` correctly extracts nested field values
 **Collection tests:**
 - Table collection failure doesn't prevent raw collection
 - Discovery metadata collection failure doesn't prevent resource collection
 - Field selector discovery failure logs warning and continues
 - CRDs without table support are collected as raw-only
 ### Integration Tests
 **Fidelity tests (compare sbctl vs live cluster):**
 ```bash
 # Collect bundle from live cluster
 support-bundle --spec=test-spec.yaml -o test-bundle.tar.gz
 # For each resource type:
 # 1. Query live cluster
 kubectl get pods -o json > live-pods.json
 kubectl get pods > live-pods-table.txt
 # 2. Query sbctl
 sbctl serve test-bundle.tar.gz &
 kubectl --server=https://localhost:8443 get pods -o json > sbctl-pods.json
 kubectl --server=https://localhost:8443 get pods > sbctl-pods-table.txt
 # 3. Compare (ignoring dynamic fields like resourceVersion)
 diff <(jq 'del(.metadata.resourceVersion)' live-pods.json) \
     <(jq 'del(.metadata.resourceVersion)' sbctl-pods.json)
 diff live-pods-table.txt sbctl-pods-table.txt
 ```
 **Backwards compatibility tests:**
 ```bash
 # Test new sbctl with old bundle (no _meta/)
 sbctl serve old-bundle-without-meta.tar.gz
 kubectl --server=https://localhost:8443 get pods  # Should work via legacy path
 # Test old sbctl with new bundle (has _meta/)
 old-sbctl serve new-bundle-with-meta.tar.gz
 kubectl --server=https://localhost:8443 get pods  # Should work, ignoring _meta/
 ```
 **Failsafe tests:**
 ```bash
 # Simulate table collection failure (mock API server returning errors)
 # Verify raw resources still collected
 # Verify bundle is valid and usable
 # Simulate discovery metadata failure
 # Verify resources still collected
 # Verify sbctl falls back to legacy behavior
 ```
 ### Field Selector Tests
 ```bash
 # Test field selectors work correctly
 kubectl --server=https://localhost:8443 get pods --field-selector=status.phase=Running
 kubectl --server=https://localhost:8443 get pods --field-selector=spec.nodeName=node-1
 kubectl --server=https://localhost:8443 get events --field-selector=involvedObject.name=my-pod
 # Compare results with live cluster output
 ```
 ### CRD Tests
 ```bash
 # Collect bundle from cluster with CRDs (e.g., cert-manager Certificates)
 # Verify CRD resources have table representations
 # Verify kubectl get certificates shows custom columns (Ready, Secret, Age)
 # Verify CRDs without additionalPrinterColumns still work (basic columns)
 ```
 ### Performance Tests
 For large bundles (10k+ resources):
 - Measure response time for list operations
 - Measure memory usage during filtering
 - Verify no timeouts or OOM conditions
 ## References