mirror of https://github.com/replicatedhq/troubleshoot.git synced 2026-02-14 18:29:53 +00:00

Files

Marc Campbell 35759c47af V1beta3 (#1873 )

* Change workflow branch from 'main' to 'v1beta3'

* Auto updater (#1849)

* added auto updater

* updated docs

* commit to trigger actions

* Auto-collectors: foundational discovery, image metadata, CLI integrat… (#1845)

* Auto-collectors: foundational discovery, image metadata, CLI integration; reset PRD markers

* Address PR review feedback

- Implement missing namespace exclude patterns functionality
- Fix image facts collector to use empty Data field instead of static string
- Correct APIVersion to use troubleshoot.sh/v1beta2 consistently

* Fix bug bot issues: API parsing, EOF error, and API group corrections

- Fix RBAC API parsing errors in rbac_checker.go (getAPIGroup/getAPIVersion functions)
- Fix FakeReader EOF error to use standard io.EOF instead of custom error
- Fix incorrect API group from troubleshoot.sh to troubleshoot.replicated.com in run.go

These changes address the issues identified by the bug bot and ensure proper
interface compliance and consistent API group usage.

* Fix multiple bug bot issues

- Fix RBAC API parsing errors in rbac_checker.go (getAPIGroup/getAPIVersion functions)
- Fix FakeReader EOF error to use standard io.EOF instead of custom error
- Fix incorrect API group from troubleshoot.sh to troubleshoot.replicated.com in run.go
- Fix image facts collector Data field to contain structured JSON instead of static strings

These changes address all issues identified by the bug bot and ensure proper
interface compliance, consistent API usage, and meaningful data fields.

* Update auto_discovery.go

* Fix TODO comments in Auto-collector section

Fixed 3 of 4 TODOs as requested in PR review:

1. pkg/collect/images/registry_client.go (line 46):
   - Implement custom CA certificate loading
   - Add x509 import and certificate parsing logic
   - Enables image collection from private registries with custom CAs

2. cmd/troubleshoot/cli/diff.go (line 209):
   - Implement bundle file count functionality
   - Add tar/gzip imports and getFileCountFromBundle() function
   - Properly counts files in support bundle archives (.gz/.tgz)

3. cmd/troubleshoot/cli/run.go (line 338):
   - Replace TODO with clarifying comment about RemoteCollectors usage
   - Confirmed RemoteCollectors are still actively used in preflights

The 4th TODO (diff.go line 196) is left as-is since it's explicitly marked
as Phase 4 future work (Support Bundle Differencing implementation).

Addresses PR review feedback about unimplemented TODO comments.

---------

Co-authored-by: Benjamin Yang <benjaminyang@Benjamins-MacBook-Pro.local>

* resetting make targets and github workflows to support v1beta3 releas… (#1853)

* resetting make targets and github workflows to support v1beta3 release later

* removing generate

* remove

* removing

* removing

* Support bundle diff (#1855)

implemented support bundle diff command

* Preflight docs and template subcommands (#1847)

* Added docs and template subcommands with test files

* uses helm templating preflight yaml files

* merge doc requirements for multiple inputs

* Helm aware rendering and markdown output

* v1beta3 yaml structure better mirrors beta2

* Update sample-preflight-templated.yaml

* Added docs and template subcommands with test files

* uses helm templating preflight yaml files

* merge doc requirements for multiple inputs

* Helm aware rendering and markdown output

* v1beta3 yaml structure better mirrors beta2

* Update sample-preflight-templated.yaml

* Added/updated documentation on subcommands

* Update docs.go

* commit to trigger actions

* Updated yaml spec (#1851)

* v1beta3 spec can be read by preflight

* added test files for ease of testing

* updated v1beta3 guide doc and added tests

* fixed not removing tmp files from v1beta3 processing

* created v1beta2 to v1beta3 converter

* Updated yaml spec (#1863)

* v1beta3 spec can be read by preflight

* added test files for ease of testing

* v1beta3 renderer fixes

* fixed gitignore issue

* Auto support bundle upload (#1860)

* basic auto uploading support bundles

* added upload command

* added default vendor endpoint

* added auth system from replicated cli

* fixed case sensitivity issue in YAML parsing

* support bundle uploads for end customers

* app slug flag and detection without licenseID

* moved v1beta3 examples to proper directory

* does not auto update for package managers (#1850)

* V1beta3 cleanup (#1869)

* moving some files around

* more cleanup

* removing more unused

* update ci for v1beta3 (#1870)

* fmt:

* removing unused examples

* add a v1beta3 fixture

* removing coverage reporting

* adding brew (#1872)

* Fixing testing errors (#1871)

fix: resolve failing unit tests and diff consistency in v1beta3

- Fix readLinesFromReader to return lines WITH newlines (like difflib.SplitLines)
- Update test expectations to match correct function behavior with newlines
- This ensures consistency between streaming and non-streaming diff paths
- Fix timeout test by changing from 10ms to 500ms to eliminate flaky failures

Fixes TestReadLinesFromReader and Test_loadSupportBundleSpecsFromURIs_TimeoutError
Resolves diff output inconsistency between code paths

* Fix/exec textanalyze path clean (#1865)

* created roadmap and yaml claude agent

* Update roadmap.md

* Fix textAnalyze analyzer to auto-match exec collector nested paths

- Auto-detect exec output files (*-stdout.txt, *-stderr.txt, *-errors.json)
- Convert simple filenames to wildcard patterns automatically
- Preserve existing wildcard patterns
- Fixes 'No matching file' errors for exec + textAnalyze workflows

---------

Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>

* bump goreleaser to v2

* remove collect binary and risc binary

* remove this check

* add debug logging

* larger runner for release

* dropping goreleaser

* fix syntax

* fix syntax

* goreleaser

* larger

* prerelease auto and more

* publish to directory:

* some more goreleaser/homebrew stuffs

* removing risc

* bump example

* Advanced analysis clean (#1868)

* created roadmap and yaml claude agent

* Update roadmap.md

* feat: Clean advanced analysis implementation - core agents, engine, artifacts

* Remove unrelated files - keep only advanced analysis implementation

* fix: Fix goroutine leak in hosted agent rate limiter

- Added stop channel and stopped flag to RateLimiter struct
- Modified replenishTokens to listen for stop signal and exit cleanly
- Added Stop() method to gracefully shutdown rate limiter
- Added Stop() method to HostedAgent to cleanup rate limiter on shutdown

Fixes cursor bot issue: Rate Limiter Goroutine Leak

* fix: Fix analyzer config and model validation bugs

Bug 1: Analyzer Config Missing File Path
- Added filePath to DeploymentStatus analyzer config in convertAnalyzerToSpec
- Sets namespace-specific path (cluster-resources/deployments/{namespace}.json)
- Falls back to generic path (cluster-resources/deployments.json) if no namespace
- Fixes LocalAgent.analyzeDeploymentStatus backward compatibility

Bug 2: HealthCheck Fails Model Validation
- Changed Ollama model validation from prefix match to exact match
- Prevents false positives where llama2:13b would match request for llama2:7b
- Ensures agent only reports healthy when exact model is available

Both fixes address cursor bot reported issues and maintain backward compatibility.

* fixing lint errors

* fixing lint errors

* adding CLI flags

* fix: resolve linting errors for CI

- Remove unnecessary nil check in host_kernel_configs.go (len() for nil slices is zero)
- Remove unnecessary fmt.Sprintf() calls in ceph.go for static strings
- Apply go fmt formatting fixes

Fixes failing lint CI check

* fix: resolve CI failures in build-test workflow and Ollama tests

1. Fix GitHub Actions workflow logic error:
   - Replace problematic contains() expression with explicit job result checks
   - Properly handle failure and cancelled states for each job
   - Prevents false positive failures in success summary job

2. Fix Ollama agent parseLLMResponse panics:
   - Add proper error handling for malformed JSON in LLM responses
   - Return error when JSON is found but invalid (instead of silent fallback)
   - Add error when no meaningful content can be parsed from response
   - Prevents nil pointer dereference in test assertions

Fixes failing build-test/success and build-test/test CI checks

* fix: resolve all CI failures and cursor bot issues

1. Fix disable-ollama flag logic bug:
   - Remove disable-ollama from advanced analysis trigger condition
   - Prevents unintended advanced analysis mode when no agents registered
   - Allows proper fallback to legacy analysis

2. Fix diff test consistency:
   - Update test expectations to match function behavior (lines with newlines)
   - Ensures consistency between streaming and non-streaming diff paths

3. Fix Ollama agent error handling:
   - Add proper error return for malformed JSON in LLM responses
   - Add meaningful content validation for markdown parsing
   - Prevents nil pointer panics in test assertions

4. Fix analysis engine mock agent:
   - Mock agent now processes and returns results for all provided analyzers
   - Fixes test expectation mismatch (expected 8 results, got 1)

Resolves all failing CI checks: lint, test, and success workflow logic

---------

Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>

* Auto-Collect (#1867)

* Fix auto-collector missing files issue

- Add KOTS-aware detection for diagnostic files
- Replace silent RBAC filtering with user warnings
- Enhance error file collection for troubleshooting
- Achieve parity with traditional support bundles

Resolves issue where auto-collector was missing:
- KOTS diagnostic files (now 4 vs 3)
- ConfigMaps (now 6 vs 6)
- Maintains superior log collection (24 vs 0)

Final result: [SUCCESS] comprehensive collection achieved

* fixing bugbog

* fix: resolve production readiness issues in auto-collect branch

1. Fix diff test expectations (lines should have newlines for difflib consistency)
2. Fix preflight tests to use existing v1beta3 example file
3. Fix autodiscovery test context parameter (function signature update)

Resolves TestReadLinesFromReader and preflight v1beta3 test failures

* fix: resolve autodiscovery tests and cursor bot image matching issues

1. Fix cursor bot image matching bug in isKotsadmImage:
   - Replace flawed prefix matching with proper image component detection
   - Handle private registries correctly (registry.company.com/kotsadm/kotsadm:v1.0.0)
   - Prevent false positives with proper delimiter checking
   - Add helper functions: containsImageComponent, splitImagePath, removeTagAndDigest

2. Fix autodiscovery test failures:
   - Add TestMode flag to DiscoveryOptions to control KOTS diagnostic collection
   - Tests use TestMode=true to get only foundational collectors (no KOTS diagnostics)
   - Preserves production behavior while enabling clean testing

Resolves failing TestDiscoverer_DiscoverFoundational tests and cursor bot issues

* Cron job clean (#1862)

* created roadmap and yaml claude agent

* Update roadmap.md

* chore(deps): bump sigstore/cosign-installer from 3.9.2 to 3.10.0 (#1857)

Bumps [sigstore/cosign-installer](https://github.com/sigstore/cosign-installer) from 3.9.2 to 3.10.0.
- [Release notes](https://github.com/sigstore/cosign-installer/releases)
- [Commits](https://github.com/sigstore/cosign-installer/compare/v3.9.2...v3.10.0)

---
updated-dependencies:
- dependency-name: sigstore/cosign-installer
  dependency-version: 3.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the security group with 2 updates (#1858)

Bumps the security group with 2 updates: [github.com/vmware-tanzu/velero](https://github.com/vmware-tanzu/velero) and [helm.sh/helm/v3](https://github.com/helm/helm).


Updates `github.com/vmware-tanzu/velero` from 1.16.2 to 1.17.0
- [Release notes](https://github.com/vmware-tanzu/velero/releases)
- [Changelog](https://github.com/vmware-tanzu/velero/blob/main/CHANGELOG.md)
- [Commits](https://github.com/vmware-tanzu/velero/compare/v1.16.2...v1.17.0)

Updates `helm.sh/helm/v3` from 3.18.6 to 3.19.0
- [Release notes](https://github.com/helm/helm/releases)
- [Commits](https://github.com/helm/helm/compare/v3.18.6...v3.19.0)

---
updated-dependencies:
- dependency-name: github.com/vmware-tanzu/velero
  dependency-version: 1.17.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: security
- dependency-name: helm.sh/helm/v3
  dependency-version: 3.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: security
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump helm.sh/helm/v3 from 3.18.6 to 3.19.0 in /examples/sdk/helm-template in the security group (#1859)

chore(deps): bump helm.sh/helm/v3

Bumps the security group in /examples/sdk/helm-template with 1 update: [helm.sh/helm/v3](https://github.com/helm/helm).


Updates `helm.sh/helm/v3` from 3.18.6 to 3.19.0
- [Release notes](https://github.com/helm/helm/releases)
- [Commits](https://github.com/helm/helm/compare/v3.18.6...v3.19.0)

---
updated-dependencies:
- dependency-name: helm.sh/helm/v3
  dependency-version: 3.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: security
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add cron job support bundle scheduler

Complete implementation with K8s integration:
- pkg/schedule/job.go: Job management and persistence
- pkg/schedule/daemon.go: Real-time scheduler daemon
- pkg/schedule/cli.go: CLI commands (create, list, delete, daemon)
- pkg/schedule/schedule_test.go: Comprehensive unit tests
- cmd/troubleshoot/cli/root.go: CLI integration

* fixing bugbot

* Fix all bugbot errors: auto-update stability, job cooldown timing, and daemon execution

* Deleting Agent

* removed unused flags

* fixing auto-upload

* fixing markdown files

* namespace not required flag for auto collectors to work

* loosened cron job validation

* writes logs to logfile

* fix: resolve autoFromEnv variable scoping issue for CI

- Ensure autoFromEnv variable and its usage are in correct scope
- Fix build errors: declared and not used / undefined variable
- All functionality preserved and tested locally
- Force add to override gitignore

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: clean tokenization system implementation (#1874)

Core tokenization functionality with minimal file changes:

✅ Core Features:
- Intelligent tokenization engine (tokenizer.go)
- Context-aware secret classification (PASSWORD, APIKEY, DATABASE, etc.)
- Cross-file correlation with deterministic HMAC-SHA256 tokens
- Optional encrypted mapping for token→original value resolution

✅ Integration:
- CLI flags: --tokenize, --redaction-map, --encrypt-redaction-map
- Updated all redactor types: literal, single-line, multi-line, YAML
- Support bundle integration with auto-upload compatibility
- Backward compatibility: preserves ***HIDDEN*** when disabled

✅ Production Ready:
- Only 11 essential files (vs 31 in original PR)
- No excessive test files or documentation
- Clean build, all functionality verified
- Maintains existing redaction behavior by default

Token format: ***TOKEN_<TYPE>_<HASH>*** (e.g., ***TOKEN_PASSWORD_A1B2C3***)

* Removes silent failing (#1877)

* preserves stdout and stderr from collectors

* Delete eliminate-silent-failures.md

* Update host_kernel_modules_test.go

* added error logs when a collector fails to start

* Update host_filesystem_performance_linux.go

* fixed error saving logic inconsistency

* Update collect.go

* Improved error handling for support bundles and redactors for windows (#1878)

* improved error handling and window locking

* Delete all-windows-collectors.yaml

* addressing bugbot concerns

* Update host_tcpportstatus.go

* Update redact.go

* Add regression test suite to github actions

* Update regression-test.yaml

* Update regression-test.yaml

* Update regression-test.yaml

* create test/output directory

* handle node-specific files and multiple report arguments

* simplify comparison to detect code regressions only

* handle empty structural_compare rules

* removed v1beta3 branch from github workflow

* Update Makefile

* removed outdated actions

* Update Makefile

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>
Co-authored-by: Benjamin Yang <82779168+bennyyang11@users.noreply.github.com>
Co-authored-by: Benjamin Yang <benjaminyang@Benjamins-MacBook-Pro.local>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

2025-10-08 10:22:11 -07:00

60 KiB

Raw Blame History

Person 2 PRD: Collectors, Redaction, Analysis, Diff, Remediation

CRITICAL CODEBASE ANALYSIS UPDATE

This PRD has been updated based on comprehensive analysis of the current troubleshoot codebase. Key findings:

Current State Analysis

API Schema: Current API group is troubleshoot.replicated.com (not troubleshoot.sh), with v1beta1 and v1beta2 available
Binary Structure: Multiple binaries already exist (preflight, support-bundle, collect, analyze)
CLI Structure: support-bundle root command exists with analyze and redact subcommands
Collection System: Comprehensive collection framework in pkg/collect/ with 15+ collector types
Redaction System: Functional redaction system in pkg/redact/ with multiple redactor types
Analysis System: Mature analysis system in pkg/analyze/ with 60+ built-in analyzers
Support Bundle: Complete support bundle system in pkg/supportbundle/ with archiving and processing

Implementation Strategy

This PRD now focuses on EXTENDING existing systems rather than building from scratch:

Auto-collectors: NEW package pkg/collect/autodiscovery/ extending existing collection
Redaction tokenization: ENHANCE existing pkg/redact/ system
Agent-based analysis: WRAP existing pkg/analyze/ system with agent abstraction
Bundle differencing: COMPLETELY NEW pkg/supportbundle/diff/ capability

Overview

Person 2 is responsible for the core data collection, processing, and analysis capabilities of the troubleshoot project. This involves implementing auto-collectors, advanced redaction with tokenization, agent-based analysis, support bundle differencing, and remediation suggestions.

Scope & Responsibilities

Auto-collectors (namespace-scoped, RBAC-aware), include image digests & tags
Redaction with tokenization (optional local LLM-assisted pass), emit redaction-map.json
Analyzer via agents (local/hosted) and "generate analyzers from requirements"
Support bundle diffs and remediation suggestions

Primary Code Areas

pkg/collect - Collection engine and auto-collectors (extending existing collection system)
pkg/redact - Redaction engine with tokenization (enhancing existing redaction system)
pkg/analyze - Analysis engine and agent integration (extending existing analysis system)
pkg/supportbundle - Bundle readers/writers and artifact management (extending existing support bundle system)
examples/* - Reference implementations and test cases

Critical API Contract: All implementations must use ONLY the current API group troubleshoot.replicated.com/v1beta2 types and be prepared for future migration to Person 1's planned schema updates. No schema modifications allowed.

Deliverables

Core Deliverables (Based on Current CLI Structure)

support-bundle --namespace ns --auto - enhance existing root command with auto-discovery capabilities
Redaction/tokenization profiles - streaming integration in collection path, emit redaction-map.json
support-bundle analyze --agent claude|local --bundle bundle.tgz - enhance existing analyze subcommand with agent support
support-bundle diff old.tgz new.tgz - NEW subcommand with structured diff.json output
"Generate analyzers from requirements" - create analyzers from requirement specifications
Remediation blocks - surfaced in analysis outputs with actionable suggestions

Note: The current CLI structure has support-bundle as the root collection command, with analyze and redact as subcommands. The diff subcommand will be newly added.

Critical Implementation Constraints

NO schema alterations: Person 2 consumes but never modifies schemas/types from Person 1
Streaming redaction: Must run as streaming step during collection (per IO flow contract)
Exact CLI compliance: Implement commands exactly as specified in CLI contracts
Artifact format compliance: Follow exact naming conventions for all output files

Component 1: Auto-Collectors

Objective

Implement intelligent, namespace-scoped auto-collectors that enhance the current YAML-driven collection system with automatic foundational data discovery. This creates a dual-path collection strategy that ensures comprehensive troubleshooting data is always gathered.

Dual-Path Collection Strategy

Current System (YAML-only):

Collects only what vendors specify in YAML collector specs
Limited to predefined collector configurations
May miss critical cluster state information

New Auto-Collectors System:

Path 1 - No YAML: Automatically discover and collect foundational cluster data (logs, deployments, services, configmaps, secrets, events, etc.)
Path 2 - With YAML: Collect vendor-specified YAML collectors PLUS automatically collect foundational data as well
Always ensures comprehensive baseline data collection for effective troubleshooting

Requirements

Foundational collection: Always collect essential cluster resources (pods, deployments, services, configmaps, events, logs)
Namespace-scoped collection: Respect namespace boundaries and permissions
RBAC-aware: Only collect data the user has permission to access
Image metadata: Include digests, tags, and repository information for discovered containers
Deterministic expansion: Same cluster state should produce consistent foundational collection
YAML augmentation: When YAML specs provided, add foundational collection to vendor-specified collectors
Streaming integration: Work with redaction pipeline during collection

Technical Specifications

1.1 Auto-Discovery Engine

Location: pkg/collect/autodiscovery/

Components:

discoverer.go - Main discovery orchestrator
rbac_checker.go - Permission validation
namespace_scanner.go - Namespace-aware resource enumeration
resource_expander.go - Convert discovered resources to collector specs

API Contract:

type AutoCollector interface {
    // Discover foundational collectors based on cluster state
    DiscoverFoundational(ctx context.Context, opts DiscoveryOptions) ([]CollectorSpec, error)
    // Augment existing YAML collectors with foundational collectors
    AugmentWithFoundational(ctx context.Context, yamlCollectors []CollectorSpec, opts DiscoveryOptions) ([]CollectorSpec, error)
    // Validate permissions for discovered resources
    ValidatePermissions(ctx context.Context, resources []Resource) ([]Resource, error)
}

type DiscoveryOptions struct {
    Namespaces         []string
    IncludeImages      bool
    RBACCheck          bool
    MaxDepth           int
    FoundationalOnly   bool   // Path 1: Only collect foundational data
    AugmentMode        bool   // Path 2: Add foundational to existing YAML specs
}

type FoundationalCollectors struct {
    // Core Kubernetes resources always collected
    Pods           []PodCollector
    Deployments    []DeploymentCollector
    Services       []ServiceCollector
    ConfigMaps     []ConfigMapCollector
    Secrets        []SecretCollector
    Events         []EventCollector
    Logs           []LogCollector
    // Container image metadata
    ImageFacts     []ImageFactsCollector
}

1.2 Image Metadata Collection

Location: pkg/collect/images/

Components:

registry_client.go - Registry API integration
digest_resolver.go - Convert tags to digests
manifest_parser.go - Parse image manifests
facts_builder.go - Build structured image facts

Data Structure:

type ImageFacts struct {
    Repository string            `json:"repository"`
    Tag        string            `json:"tag"`
    Digest     string            `json:"digest"`
    Registry   string            `json:"registry"`
    Size       int64             `json:"size"`
    Created    time.Time         `json:"created"`
    Labels     map[string]string `json:"labels"`
    Platform   Platform          `json:"platform"`
}

type Platform struct {
    Architecture string `json:"architecture"`
    OS           string `json:"os"`
    Variant      string `json:"variant,omitempty"`
}

Implementation Checklist

Phase 1: Core Auto-Discovery (Week 1-2)

Discovery Engine Setup
- Create pkg/collect/autodiscovery/ package structure
- Implement Discoverer interface and base implementation
- Add Kubernetes client integration for resource enumeration
- Create namespace filtering logic
- Add discovery configuration parsing
RBAC Integration
- Implement RBACChecker for permission validation
- Add SelfSubjectAccessReview integration
- Create permission caching layer for performance (5min TTL)
- Add fallback strategies for limited permissions
Resource Expansion
- Implement resource-to-collector mapping via ResourceExpander
- Add standard resource patterns (pods, deployments, services, configmaps, secrets, events)
- Create expansion rules configuration with priority system
- Add dependency graph resolution and deduplication
Unit Testing ALL TESTS PASSING
- Test Discoverer.DiscoverFoundational() with mock Kubernetes clients
- Test RBACChecker.FilterByPermissions() with various permission scenarios
- Test namespace enumeration and filtering with different configurations
- Test ResourceExpander with all foundational resource types
- Test collector deduplication and conflict resolution (YAML overrides foundational)
- Test error handling and graceful degradation scenarios
- Test permission caching and RBAC integration
- Test collector priority sorting and dual-path logic

Phase 2: Image Metadata Collection (Week 3)

Registry Integration
- Create pkg/collect/images/ package
- Implement registry client with authentication support (Docker Hub, ECR, GCR, Harbor, etc.)
- Add manifest parsing for Docker v2 and OCI formats
- Create digest resolution from tags
Facts Generation
- Implement ImageFacts data structure with comprehensive metadata
- Add image scanning and metadata extraction (platform, layers, config)
- Create facts serialization to JSON with FactsBundle format
- Add error handling and fallback modes with ContinueOnError
Integration
- Integrate image collection into auto-discovery system
- Add image facts to foundational collectors
- Create facts.json output specification with summary statistics
- Add Kubernetes image extraction from pods, deployments, daemonsets, statefulsets
Unit Testing ALL TESTS PASSING
- Test registry client authentication and factory patterns for different registry types
- Test manifest parsing for Docker v2, OCI, and legacy v1 image formats
- Test digest resolution and validation with various formats
- Test ImageFacts data structure serialization/deserialization
- Test image metadata extraction with comprehensive validation
- Test error handling for network failures and authentication
- Test concurrent collection with rate limiting and semaphores
- Test image facts caching and deduplication logic with LRU cleanup

Phase 3: CLI Integration (Week 4)

Note: Current CLI structure has --namespace already available. Successfully added --auto flag and related options.

CLI Usage Patterns for Dual-Path Approach

Path 1 - Foundational Only (No YAML):

# Collect foundational data for default namespace
support-bundle --auto

# Collect foundational data for specific namespace(s)  
support-bundle --auto --namespace myapp

# Include container image metadata
support-bundle --auto --namespace myapp --include-images

# Use comprehensive discovery profile
support-bundle --auto --discovery-profile comprehensive --include-images

Path 2 - YAML + Foundational (Augmented):

# Collect vendor YAML specs + foundational data
support-bundle vendor-spec.yaml --auto

# Multiple YAML specs + foundational data  
support-bundle spec1.yaml spec2.yaml --auto --namespace myapp

# Exclude system namespaces from foundational collection
support-bundle vendor-spec.yaml --auto --exclude-namespaces "kube-*,cattle-*"

Current Behavior (Preserved):

# Only collect what's in YAML (no foundational data added)
support-bundle vendor-spec.yaml

New Diff Command:

# Compare two support bundles
support-bundle diff old-bundle.tgz new-bundle.tgz

# Output to JSON file
support-bundle diff old.tgz new.tgz --output json -f diff-report.json

# Generate HTML report with remediation
support-bundle diff old.tgz new.tgz --output html --include-remediation

Command Enhancement
- Add --auto flag to support-bundle root command
- Implement dual-path logic: no args+--auto = foundational only
- Implement augmentation logic: YAML args+--auto = YAML + foundational
- Integrate with existing --namespace filtering
- Add --include-images option for container image metadata collection
- Create --rbac-check validation mode (enabled by default)
- Add support-bundle diff subcommand with full flag set
Configuration
- Add discovery profiles (minimal, standard, comprehensive, paranoid)
- Add namespace exclusion/inclusion patterns with glob support
- Implement dry-run mode integration for auto-discovery
- Create discovery configuration file support with JSON format
- Add profile-based timeout and collection behavior configuration
Unit Testing ALL TESTS PASSING
- Test CLI flag parsing and validation for all auto-discovery options
- Test discovery profile loading and validation logic
- Test dry-run mode integration and output
- Test namespace filtering with glob patterns
- Test command help text and flag descriptions
- Test error handling for invalid CLI flag combinations
- Test configuration file loading, validation, and fallbacks
- Test dual-path mode detection and routing logic

Testing Strategy

Unit Tests ALL PASSING
- RBAC checker with mock Kubernetes API
- Resource expansion logic and deduplication
- Image metadata parsing and registry integration
- Discovery configuration validation and pattern matching
- CLI flag validation and profile loading
- Bundle diff validation and output formatting
Integration Tests IMPLEMENTED
- End-to-end auto-discovery workflow testing
- Permission boundary validation with mock RBAC
- Image registry integration with mock HTTP servers
- Namespace isolation verification
- CLI integration with existing support-bundle system
Performance Tests BENCHMARKED
- Large cluster discovery performance (1000+ resources)
- Image metadata collection at scale with concurrent processing
- Memory usage during auto-discovery with caching
- CLI flag parsing and configuration loading performance

Step-by-Step Implementation

Step 1: Set up Auto-Discovery Foundation

Create package structure: pkg/collect/autodiscovery/
Define AutoCollector interface with dual-path methods in interfaces.go
Implement FoundationalDiscoverer struct in discoverer.go
Define foundational collectors list (pods, deployments, services, configmaps, secrets, events, logs)
Add Kubernetes client initialization and configuration
Create unit tests for basic discovery functionality

Step 2: Implement Foundational Collection (Path 1)

Create foundational.go with predefined essential collector specs
Implement namespace-scoped resource enumeration for foundational resources
Add RBAC checking for each foundational collector type
Create deterministic resource expansion (same cluster → same collectors)
Add comprehensive unit tests for foundational collection

Step 3: Implement YAML Augmentation (Path 2)

Create augmenter.go to merge YAML collectors with foundational collectors
Implement deduplication logic (avoid collecting same resource twice)
Add priority system (YAML specs override foundational specs when conflict)
Create merger validation and conflict resolution
Add comprehensive unit tests for augmentation logic

Step 4: Build RBAC Checking Engine

Create rbac_checker.go with SelfSubjectAccessReview integration
Add permission caching with TTL for performance
Implement batch permission checking for efficiency
Add fallback modes for clusters with limited RBAC visibility
Create comprehensive RBAC test suite

Step 5: Add Image Metadata Collection

Create pkg/collect/images/ package with registry client
Implement manifest parsing for Docker v2 and OCI formats
Add authentication support (Docker Hub, ECR, GCR, etc.)
Create ImageFacts generation from manifest data
Add error handling and retry logic for registry operations

Step 6: Integrate with Existing Collection Pipeline

Modify existing pkg/collect/collect.go to support auto-discovery modes
Add CLI integration for --auto flag (Path 1) and YAML+auto mode (Path 2)
Create seamless integration with existing collector framework
Add streaming integration with redaction pipeline
Create facts.json output format and writer
Implement progress reporting and user feedback
Add configuration validation and error reporting

Component 2: Advanced Redaction with Tokenization

Objective

Enhance the existing redaction system (currently in pkg/redact/) with tokenization capabilities, optional local LLM assistance, and reversible redaction mapping for data owners.

Current State: The codebase has a functional redaction system with:

File-based redaction using regex patterns
Multiple redactor types (SingleLineRedactor, MultiLineRedactor, YamlRedactor, etc.)
Redaction tracking and reporting via RedactionList
Integration with collection pipeline

Requirements

Streaming redaction: Enhance existing system to work as streaming step during collection
Tokenization: Replace sensitive values with consistent tokens for traceability (new capability)
LLM assistance: Optional local LLM for intelligent redaction detection (new capability)
Reversible mapping: Generate redaction-map.json for token reversal by data owners (new capability)
Performance: Maintain/improve performance of existing system for large support bundles
Profiles: Extend existing redactor configuration with redaction profiles

Technical Specifications

2.1 Redaction Engine Architecture

Location: pkg/redact/

Core Components:

engine.go - Main redaction orchestrator
tokenizer.go - Token generation and mapping
processors/ - File type specific processors
llm/ - Local LLM integration (optional)
profiles/ - Pre-defined redaction profiles

API Contract:

type RedactionEngine interface {
    ProcessStream(ctx context.Context, input io.Reader, output io.Writer, opts RedactionOptions) (*RedactionMap, error)
    GenerateTokens(ctx context.Context, values []string) (map[string]string, error)
    LoadProfile(name string) (*RedactionProfile, error)
}

type RedactionOptions struct {
    Profile        string
    EnableLLM      bool
    TokenPrefix    string
    StreamMode     bool
    PreserveFormat bool
}

type RedactionMap struct {
    Tokens    map[string]string `json:"tokens"`    // token -> original value
    Stats     RedactionStats    `json:"stats"`     // redaction statistics
    Timestamp time.Time         `json:"timestamp"` // when redaction was performed
    Profile   string            `json:"profile"`   // profile used
}

2.2 Tokenization System

Location: pkg/redact/tokenizer.go

Features:

Consistent token generation for same values
Configurable token formats and prefixes
Token collision detection and resolution
Metadata preservation (type hints, length preservation)

Token Format:

***TOKEN_<TYPE>_<HASH>***
Examples:
- ***TOKEN_PASSWORD_A1B2C3***
- ***TOKEN_EMAIL_X7Y8Z9***
- ***TOKEN_IP_D4E5F6***

2.3 LLM Integration (Optional)

Location: pkg/redact/llm/

Supported Models:

Ollama integration for local models
OpenAI compatible APIs
Hugging Face transformers (via local API)

LLM Tasks:

Intelligent sensitive data detection
Context-aware redaction decisions
False positive reduction
Custom pattern learning

Implementation Checklist

Phase 1: Enhanced Redaction Engine (Week 1-2)

Core Engine Refactoring
- Refactor existing pkg/redact to support streaming
- Create new RedactionEngine interface
- Implement streaming processor for different file types
- Add configurableprocessing pipelines
Tokenization Implementation
- Create Tokenizer with consistent hash-based token generation
- Implement token mapping and reverse lookup
- Add token format configuration and validation
- Create collision detection and resolution
File Type Processors
- Create specialized processors for JSON, YAML, logs, config files
- Add context-aware redaction (e.g., preserve YAML structure)
- Implement streaming processing for large files
- Add error recovery and partial redaction support
Unit Testing
- Test RedactionEngine with various input stream types and sizes
- Test Tokenizer consistency - same input produces same tokens
- Test token collision detection and resolution algorithms
- Test file type processors with malformed/corrupted input files
- Test streaming redaction performance with large files (GB scale)
- Test error recovery and partial redaction scenarios
- Test redaction map generation and serialization
- Test token format validation and configuration options

Phase 2: Redaction Profiles (Week 3)

Profile System
- Create RedactionProfile data structure and parser
- Implement built-in profiles (minimal, standard, comprehensive, paranoid)
- Add profile validation and testing
- Create profile override and customization system
Profile Definitions
- Minimal: Basic passwords, API keys, tokens
- Standard: + IP addresses, URLs, email addresses
- Comprehensive: + usernames, hostnames, file paths
- Paranoid: + any alphanumeric strings > 8 chars, custom patterns
Configuration
- Add profile selection to support bundle specs
- Create profile inheritance and composition
- Implement runtime profile switching
- Add profile documentation and examples
Unit Testing
- Test redaction profile parsing and validation
- Test profile inheritance and composition logic
- Test built-in profiles (minimal, standard, comprehensive, paranoid)
- Test custom profile creation and validation
- Test profile override and customization mechanisms
- Test runtime profile switching without state corruption
- Test profile configuration serialization/deserialization
- Test profile pattern matching accuracy and coverage

Phase 3: LLM Integration (Week 4)

LLM Framework
- Create LLMProvider interface for different backends
- Implement Ollama integration for local models
- Add OpenAI-compatible API client
- Create fallback modes when LLM is unavailable
Intelligent Detection
- Design prompts for sensitive data detection
- Implement confidence scoring for LLM suggestions
- Add human-readable explanation generation
- Create feedback loop for improving detection
Privacy & Security
- Ensure LLM processing respects data locality
- Add data minimization for LLM requests
- Implement secure prompt injection prevention
- Create audit logging for LLM interactions
Unit Testing
- Test LLMProvider interface implementations for different backends
- Test LLM prompt generation and response parsing
- Test confidence scoring algorithms for LLM suggestions
- Test fallback mechanisms when LLM services are unavailable
- Test prompt injection prevention with malicious inputs
- Test data minimization - only necessary data sent to LLM
- Test LLM response validation and sanitization
- Test audit logging completeness and security

Phase 4: Integration & Artifacts (Week 5)

Collection Integration
- Integrate redaction engine into collection pipeline
- Add streaming redaction during data collection
- Implement progress reporting for redaction operations
- Add redaction statistics and reporting
Artifact Generation
- Implement redaction-map.json generation and format
- Add redaction statistics to support bundle metadata
- Create redaction audit trail and logging
- Implement secure token storage and encryption options
Unit Testing
- Test redaction integration with existing collection pipeline
- Test streaming redaction performance during data collection
- Test progress reporting accuracy and timing
- Test redaction-map.json format compliance and validation
- Test redaction statistics calculation and accuracy
- Test redaction audit trail completeness
- Test secure token storage encryption/decryption
- Test error handling during redaction pipeline failures

Testing Strategy

Unit Tests
- Token generation and collision handling
- File type processor accuracy
- Profile loading and validation
- LLM integration mocking
Integration Tests
- End-to-end redaction with real support bundles
- LLM provider integration testing
- Performance testing with large files
- Streaming redaction pipeline validation
Security Tests
- Token uniqueness and unpredictability
- Redaction completeness verification
- Information leakage prevention
- LLM prompt injection resistance

Step-by-Step Implementation

Step 1: Streaming Redaction Foundation

Analyze existing redaction code in pkg/redact
Design streaming architecture with io.Reader/Writer interfaces
Create RedactionEngine interface and base implementation
Implement file type detection and routing
Add comprehensive unit tests for streaming operations

Step 2: Tokenization System

Create Tokenizer with hash-based consistent token generation
Implement token mapping data structures and serialization
Add token format configuration and validation
Create collision detection and resolution algorithms
Add comprehensive testing for token consistency and security

Step 3: File Type Processors

Create processor interface and registry system
Implement JSON processor with path-aware redaction
Add YAML processor with structure preservation
Create log file processor with context awareness
Add configuration file processors for common formats

Step 4: Redaction Profiles

Design profile schema and configuration format
Implement built-in profile definitions
Create profile loading, validation, and inheritance system
Add profile documentation and examples
Create comprehensive profile testing suite

Step 5: LLM Integration (Optional)

Create LLM provider interface and abstraction layer
Implement Ollama integration for local models
Design prompts for sensitive data detection
Add confidence scoring and human-readable explanations
Create comprehensive privacy and security safeguards

Step 6: Integration and Artifacts

Integrate redaction engine into support bundle collection
Implement redaction-map.json generation and format
Add CLI flags for redaction options and profiles
Create comprehensive documentation and examples
Add performance monitoring and optimization

Component 3: Agent-Based Analysis

Objective

Enhance the existing analysis system (currently in pkg/analyze/) with agent-based capabilities and analyzer generation from requirements. This addresses the overview requirement for "Analyzer via agents (local/hosted) and 'generate analyzers from requirements'".

Current State: The codebase has a comprehensive analysis system with:

60+ built-in analyzers for various Kubernetes resources and conditions
Host analyzers for system-level checks
Structured analyzer results (AnalyzeResult type)
Analysis download and local bundle processing
Integration with support bundle collection
JSON/YAML output formatting

Requirements

Agent abstraction: Wrap existing analyzers and support local, hosted, and future agent types
Analyzer generation: Create analyzers from requirement specifications (new capability)
Analysis artifacts: Enhance existing results to generate structured analysis.json with remediation
Offline capability: Maintain current local analysis capabilities
Extensibility: Add plugin architecture for custom analysis engines while preserving existing analyzers

Technical Specifications

3.1 Analysis Engine Architecture

Location: pkg/analyze/

Core Components:

engine.go - Analysis orchestrator
agents/ - Agent implementations (local, hosted, custom)
generators/ - Analyzer generation from requirements
artifacts/ - Analysis result formatting and serialization

API Contract:

type AnalysisEngine interface {
    Analyze(ctx context.Context, bundle *SupportBundle, opts AnalysisOptions) (*AnalysisResult, error)
    GenerateAnalyzers(ctx context.Context, requirements *RequirementSpec) ([]AnalyzerSpec, error)
    RegisterAgent(name string, agent Agent) error
}

type Agent interface {
    Name() string
    Analyze(ctx context.Context, data []byte, analyzers []AnalyzerSpec) (*AgentResult, error)
    HealthCheck(ctx context.Context) error
    Capabilities() []string
}

type AnalysisResult struct {
    Results     []AnalyzerResult  `json:"results"`
    Remediation []RemediationStep `json:"remediation"`
    Summary     AnalysisSummary   `json:"summary"`
    Metadata    AnalysisMetadata  `json:"metadata"`
}

3.2 Agent Types

3.2.1 Local Agent

Location: pkg/analyze/agents/local/

Features:

Built-in analyzer implementations
No external dependencies
Fast execution and offline capability
Extensible through plugins

3.2.2 Hosted Agent

Location: pkg/analyze/agents/hosted/

Features:

REST API integration with hosted analysis services
Advanced ML/AI capabilities
Cloud-scale processing
Authentication and rate limiting

3.2.3 LLM Agent (Optional)

Location: pkg/analyze/agents/llm/

Features:

Local or cloud LLM integration
Natural language analysis descriptions
Context-aware remediation suggestions
Multi-modal analysis (text, logs, configs)

3.3 Analyzer Generation

Location: pkg/analyze/generators/

Requirements-to-Analyzers Mapping:

type RequirementSpec struct {
    APIVersion string                 `json:"apiVersion"`
    Kind       string                 `json:"kind"`
    Metadata   RequirementMetadata    `json:"metadata"`
    Spec       RequirementSpecDetails `json:"spec"`
}

type RequirementSpecDetails struct {
    Kubernetes KubernetesRequirements `json:"kubernetes"`
    Resources  ResourceRequirements   `json:"resources"`
    Storage    StorageRequirements    `json:"storage"`
    Network    NetworkRequirements    `json:"network"`
    Custom     []CustomRequirement    `json:"custom"`
}

Implementation Checklist

Phase 1: Analysis Engine Foundation (Week 1-2)

Engine Architecture
- Create pkg/analyze/ package structure
- Design and implement AnalysisEngine interface
- Create agent registry and management system
- Add analysis result formatting and serialization
Local Agent Implementation
- Create LocalAgent with built-in analyzer implementations
- Port existing analyzer logic to new agent framework
- Add plugin loading system for custom analyzers
- Implement performance optimization and caching
Analysis Artifacts
- Design analysis.json schema and format
- Implement result aggregation and summarization
- Add analysis metadata and provenance tracking
- Create structured error handling and reporting
Unit Testing
- Test AnalysisEngine interface implementations
- Test agent registry and management system functionality
- Test LocalAgent with various built-in analyzers
- Test analysis result formatting and serialization
- Test result aggregation algorithms and accuracy
- Test error handling for malformed analyzer inputs
- Test analysis metadata and provenance tracking
- Test plugin loading system with mock plugins

Phase 2: Hosted Agent Integration (Week 3)

Hosted Agent Framework
- Create HostedAgent with REST API integration
- Implement authentication and authorization
- Add rate limiting and retry logic
- Create configuration management for hosted endpoints
API Integration
- Design hosted agent API specification
- Implement request/response handling
- Add data serialization and compression
- Create secure credential management
Fallback Mechanisms
- Implement graceful degradation when hosted agents unavailable
- Add local fallback for critical analyzers
- Create hybrid analysis modes
- Add user notification for service limitations
Unit Testing
- Test HostedAgent REST API integration with mock servers
- Test authentication and authorization with various providers
- Test rate limiting and retry logic with simulated failures
- Test request/response handling and data serialization
- Test fallback mechanisms when hosted agents are unavailable
- Test hybrid analysis mode coordination and result merging
- Test secure credential management and rotation
- Test analysis quality assessment algorithms

Phase 3: Analyzer Generation (Week 4)

Requirements Parser
- Create RequirementSpec parser and validator
- Implement requirement categorization and mapping
- Add support for vendor and Replicated requirement specs
- Create requirement merging and conflict resolution
Generator Framework
- Design analyzer generation templates
- Implement rule-based analyzer creation
- Add analyzer validation and testing
- Create generated analyzer documentation
Integration
- Integrate generator with analysis engine
- Add CLI flags for analyzer generation
- Create generated analyzer debugging and validation
- Add generator configuration and customization
Unit Testing
- Test requirement specification parsing with various input formats
- Test analyzer generation from requirement specifications
- Test requirement-to-analyzer mapping algorithms
- Test custom analyzer template generation and validation
- Test analyzer code generation quality and correctness
- Test generated analyzer testing and validation frameworks
- Test requirement specification validation and error reporting
- Test analyzer generation performance and scalability

Phase 4: Remediation & Advanced Features (Week 5)

Remediation System
- Design RemediationStep data structure
- Implement remediation suggestion generation
- Add remediation prioritization and categorization
- Create remediation execution framework (future)
Advanced Analysis
- Add cross-analyzer correlation and insights
- Implement trend analysis and historical comparison
- Create analysis confidence scoring
- Add analysis explanation and reasoning
Unit Testing
- Test RemediationStep data structure and serialization
- Test remediation suggestion generation algorithms
- Test remediation prioritization and categorization logic
- Test cross-analyzer correlation algorithms
- Test trend analysis and historical comparison accuracy
- Test analysis confidence scoring calculations
- Test analysis explanation and reasoning generation
- Test remediation framework extensibility and plugin system

Testing Strategy

Unit Tests
- Agent interface compliance
- Analysis result serialization
- Analyzer generation logic
- Remediation suggestion accuracy
Integration Tests
- End-to-end analysis with real support bundles
- Hosted agent API integration
- Analyzer generation from real requirements
- Multi-agent analysis coordination
Performance Tests
- Large support bundle analysis performance
- Concurrent agent execution
- Memory usage during analysis
- Hosted agent latency and throughput

Step-by-Step Implementation

Step 1: Analysis Engine Foundation

Create package structure: pkg/analyze/
Define AnalysisEngine and Agent interfaces
Implement basic analysis orchestration
Create agent registry and management
Add comprehensive unit tests

Step 2: Local Agent Implementation

Create LocalAgent struct and implementation
Port existing analyzer logic to agent framework
Add plugin system for custom analyzers
Implement result caching and optimization
Create comprehensive test suite

Step 3: Analysis Artifacts

Design analysis.json schema and validation
Implement result serialization and formatting
Add analysis metadata and provenance
Create structured error handling
Add comprehensive format validation

Step 4: Hosted Agent Integration

Create HostedAgent with REST API client
Implement authentication and rate limiting
Add fallback and error handling
Create configuration management
Add integration testing with mock services

Step 5: Analyzer Generation

Create RequirementSpec parser and validator
Implement analyzer generation templates
Add rule-based analyzer creation logic
Create analyzer validation and testing
Add comprehensive generation testing

Step 6: Remediation System

Design remediation data structures
Implement suggestion generation algorithms
Add remediation prioritization and categorization
Create comprehensive documentation
Add remediation testing and validation

Component 4: Support Bundle Differencing

Objective

Implement comprehensive support bundle comparison and differencing capabilities to track changes over time and identify issues through comparison. This is a completely NEW capability not present in the current codebase.

Current State: The codebase has support bundle parsing utilities in pkg/supportbundle/parse.go that can extract and read bundle contents, but no comparison or differencing capabilities.

Requirements

Bundle comparison: Compare two support bundles with detailed diff output (completely new)
Change categorization: Categorize changes by type and impact (new)
Diff artifacts: Generate structured diff.json for programmatic consumption (new)
Visualization: Human-readable diff reports (new)
Performance: Handle large bundles efficiently using existing parsing utilities

Technical Specifications

4.1 Diff Engine Architecture

Location: pkg/supportbundle/diff/

Core Components:

engine.go - Main diff orchestrator
comparators/ - Type-specific comparison logic
formatters/ - Output formatting (JSON, HTML, text)
filters/ - Diff filtering and noise reduction

API Contract:

type DiffEngine interface {
    Compare(ctx context.Context, oldBundle, newBundle *SupportBundle, opts DiffOptions) (*BundleDiff, error)
    GenerateReport(ctx context.Context, diff *BundleDiff, format string) (io.Reader, error)
}

type BundleDiff struct {
    Summary      DiffSummary         `json:"summary"`
    Changes      []Change            `json:"changes"`
    Metadata     DiffMetadata        `json:"metadata"`
    Significance SignificanceReport  `json:"significance"`
}

type Change struct {
    Type        ChangeType         `json:"type"`        // added, removed, modified
    Category    string             `json:"category"`    // resource, log, config, etc.
    Path        string             `json:"path"`        // file path or resource path
    Impact      ImpactLevel        `json:"impact"`      // high, medium, low, none
    Details     map[string]any     `json:"details"`     // change-specific details
    Remediation *RemediationStep   `json:"remediation,omitempty"`
}

4.2 Comparison Types

4.2.1 Resource Comparisons

Kubernetes resource specifications
Resource status and health changes
Configuration drift detection
RBAC and security policy changes

4.2.2 Log Comparisons

Error pattern analysis
Log volume and frequency changes
New error types and patterns
Performance metric changes

4.2.3 Configuration Comparisons

Configuration file changes
Environment variable differences
Secret and ConfigMap modifications
Application configuration drift

Implementation Checklist

Phase 1: Diff Engine Foundation (Week 1-2)

Core Engine
- Create pkg/supportbundle/diff/ package structure
- Implement DiffEngine interface and base implementation
- Create bundle loading and parsing utilities
- Add diff metadata and tracking
Change Detection
- Implement file-level change detection
- Create content comparison utilities
- Add change categorization and classification
- Implement impact assessment algorithms
Data Structures
- Define BundleDiff and related data structures
- Create change serialization and deserialization
- Add diff statistics and summary generation
- Implement diff validation and consistency checks
Unit Testing
- Test DiffEngine with various support bundle pairs
- Test bundle loading and parsing utilities with different formats
- Test file-level change detection algorithms
- Test content comparison utilities with binary and text files
- Test change categorization and classification accuracy
- Test BundleDiff data structure serialization/deserialization
- Test diff statistics calculation and accuracy
- Test diff validation and consistency check algorithms

Phase 2: Specialized Comparators (Week 3)

Resource Comparator
- Create Kubernetes resource diff logic
- Add YAML/JSON structural comparison
- Implement semantic resource analysis
- Add resource health status comparison
Log Comparator
- Create log file comparison utilities
- Add error pattern extraction and comparison
- Implement log volume analysis
- Create performance metric comparison
Configuration Comparator
- Add configuration file diff logic
- Create environment variable comparison
- Implement secret and sensitive data handling
- Add configuration drift detection
Unit Testing
- Test Kubernetes resource diff logic with various resource types
- Test YAML/JSON structural comparison algorithms
- Test semantic resource analysis and health status comparison
- Test log file comparison utilities with different log formats
- Test error pattern extraction and comparison accuracy
- Test log volume analysis algorithms
- Test configuration file diff logic with various config formats
- Test sensitive data handling in configuration comparisons

Phase 3: Output and Visualization (Week 4)

Diff Artifacts
- Implement diff.json generation and format
- Add diff metadata and provenance
- Create diff validation and schema
- Add diff compression and storage
Report Generation
- Create HTML diff reports with visualization
- Add interactive diff navigation and filtering
- Implement diff report customization and theming
- Create diff report export and sharing capabilities
Unit Testing
- Test diff.json generation and format validation
- Test diff metadata and provenance tracking
- Test diff compression and storage mechanisms
- Test HTML diff report generation with various diff types
- Test interactive diff navigation functionality
- Test diff report customization and theming options
- Test diff visualization accuracy and clarity
- Test diff report export formats and compatibility
- Add text-based diff output
- Implement diff filtering and noise reduction
- Create diff summary and executive reports

Phase 4: CLI Integration (Week 5)

Command Implementation
- Add support-bundle diff command
- Implement command-line argument parsing
- Add progress reporting and user feedback
- Create diff command validation and error handling
Configuration
- Add diff configuration and profiles
- Create diff ignore patterns and filters
- Implement diff output customization
- Add diff performance optimization options

Step-by-Step Implementation

Step 1: Diff Engine Foundation

Create package structure: pkg/supportbundle/diff/
Design DiffEngine interface and core data structures
Implement basic bundle loading and parsing
Create change detection algorithms
Add comprehensive unit tests

Step 2: Change Detection and Classification

Implement file-level change detection
Create content comparison utilities with different strategies
Add change categorization and impact assessment
Create change significance scoring
Add comprehensive classification testing

Step 3: Specialized Comparators

Create comparator interface and registry
Implement resource comparator with semantic analysis
Add log comparator with pattern analysis
Create configuration comparator with drift detection
Add comprehensive comparator testing

Step 4: Output Generation

Implement diff.json schema and serialization
Create HTML report generation with visualization
Add text-based diff formatting
Create diff filtering and noise reduction
Add comprehensive output validation

Step 5: CLI Integration

Add diff command to support-bundle CLI
Implement argument parsing and validation
Add progress reporting and user experience
Create comprehensive CLI testing
Add documentation and examples

Integration & Testing Strategy

Integration Contracts (Critical Constraints)

Person 2 is a CONSUMER of Person 1's work and must NOT alter schema definitions or CLI contracts.

Schema Contract (Owned by Person 1)

CRITICAL UPDATE: Based on current codebase analysis:

Current API Group: troubleshoot.replicated.com (NOT troubleshoot.sh)
Current Versions: v1beta1 and v1beta2 are available (NO v1beta3 exists yet)
Use ONLY troubleshoot.replicated.com/v1beta2 CRDs/YAML spec definitions until Person 1 provides schema migration plan
Follow EXACTLY agreed-upon artifact filenames (analysis.json, diff.json, redaction-map.json, facts.json)
NO modifications to schema definitions, types, or API contracts
All schemas act as the cross-team contract with clear compatibility rules

CLI Contract (Owned by Person 1)

CRITICAL UPDATE: Based on current CLI structure analysis:

Current Structure: support-bundle (root/collect), support-bundle analyze, support-bundle redact
Existing Flags: --namespace, --redact, --collect-without-permissions, etc. already available
NEW Commands to Add: support-bundle diff (completely new)
NEW Flags to Add: --auto, --include-images, --rbac-check, --agent
NO changes to existing CLI surface area, help text, or command structure
Must integrate new capabilities into existing command structure

IO Flow Contract (Owned by Person 2)

Collect/analyze/diff operations read and write ONLY via defined schemas and filenames
Redaction runs as streaming step during collection (no intermediate files)
All input/output must conform to Person 1's schema specifications

Golden Samples Contract

Use checked-in example specs and artifacts for contract testing
Ensure changes don't break consumers or violate schema contracts
Maintain backward compatibility with existing artifact formats

Cross-Component Integration

Collection → Redaction Pipeline

// Example integration flow
func CollectWithRedaction(ctx context.Context, opts CollectionOptions) (*SupportBundle, error) {
    // 1. Auto-discover collectors
    collectors, err := autoCollector.Discover(ctx, opts.DiscoveryOptions)
    if err != nil {
        return nil, err
    }
    
    // 2. Collect with streaming redaction
    bundle := &SupportBundle{}
    for _, collector := range collectors {
        data, err := collector.Collect(ctx)
        if err != nil {
            continue
        }
        
        redactedData, redactionMap, err := redactionEngine.ProcessStream(ctx, data, opts.RedactionOptions)
        if err != nil {
            return nil, err
        }
        
        bundle.AddFile(collector.OutputPath(), redactedData)
        bundle.AddRedactionMap(redactionMap)
    }
    
    return bundle, nil
}

Analysis → Remediation Integration

// Example analysis to remediation flow
func AnalyzeWithRemediation(ctx context.Context, bundle *SupportBundle) (*AnalysisResult, error) {
    // 1. Run analysis
    result, err := analysisEngine.Analyze(ctx, bundle, opts)
    if err != nil {
        return nil, err
    }
    
    // 2. Generate remediation suggestions
    for i, analyzerResult := range result.Results {
        if analyzerResult.IsFail() {
            remediation, err := generateRemediation(ctx, analyzerResult)
            if err == nil {
                result.Results[i].Remediation = remediation
            }
        }
    }
    
    return result, nil
}

Comprehensive Testing Strategy

Unit Testing Requirements

Coverage Target: >80% code coverage for all components
Mock Dependencies: Mock all external dependencies (K8s API, registries, LLM APIs)
Error Scenarios: Test all error paths and edge cases
Performance: Unit benchmarks for critical paths

Integration Testing Requirements

End-to-End Flows: Complete collection → redaction → analysis → diff workflows
Real Cluster Testing: Integration with actual Kubernetes clusters
Large Bundle Testing: Performance with multi-GB support bundles
Network Conditions: Testing with limited/intermittent connectivity

Performance Testing Requirements

Memory Usage: Monitor memory consumption during large operations
CPU Utilization: Profile CPU usage for optimization opportunities
I/O Performance: Test with large files and slow storage
Concurrency: Test multi-threaded operations and race conditions

Security Testing Requirements

Redaction Completeness: Verify no sensitive data leakage
Token Security: Ensure token unpredictability and uniqueness
Access Control: Verify RBAC enforcement
Input Validation: Test against malicious inputs

Golden Sample Testing

Reference Bundles: Create standard test support bundles
Expected Outputs: Define expected analysis, diff, and redaction outputs
Regression Testing: Automated comparison against golden outputs
Schema Validation: Ensure all outputs conform to schemas

Documentation Requirements

User Documentation

Collection Guide: How to use auto-collectors and namespace scoping
Redaction Guide: Redaction profiles, tokenization, and LLM integration
Analysis Guide: Agent configuration and remediation interpretation
Diff Guide: Bundle comparison workflows and interpretation

Developer Documentation

API Documentation: Go doc comments for all public APIs
Architecture Guide: Component interaction and data flow
Extension Guide: How to add custom agents, analyzers, and processors
Performance Guide: Optimization techniques and benchmarks

Configuration Documentation

Schema Reference: Complete reference for all configuration options
Profile Examples: Example redaction and analysis profiles
Integration Examples: Sample integrations with CI/CD and monitoring

Timeline & Milestones

Month 1: Foundation

Week 1-2: Auto-collectors and RBAC integration
Week 3-4: Advanced redaction with tokenization

Month 2: Advanced Features

Week 5-6: Agent-based analysis system
Week 7-8: Support bundle differencing

Month 3: Integration & Polish

Week 9-10: Cross-component integration and testing
Week 11-12: Documentation, optimization, and release preparation

Key Milestones

M1: Auto-discovery working with RBAC (Week 2)
M2: Streaming redaction with tokenization (Week 4)
M3: Local and hosted agents functional (Week 6)
M4: Bundle diffing and remediation (Week 8)
M5: Full integration and testing complete (Week 10)
M6: Documentation and release ready (Week 12)

Success Criteria

Functional Requirements

support-bundle collect --namespace ns --auto produces complete bundles
Redaction with tokenization works with streaming pipeline
Analysis generates structured results with remediation
Bundle diffing produces actionable comparison reports

Performance Requirements

Auto-discovery completes in <30 seconds for typical clusters
Redaction processes 1GB+ bundles without memory issues
Analysis completes in <2 minutes for standard bundles
Diff generation completes in <1 minute for bundle pairs

Quality Requirements

>80% code coverage with comprehensive tests
Zero critical security vulnerabilities
Complete API documentation and user guides
Successful integration with Person 1's schema and CLI contracts

Final Integration Testing Phase

After all components are implemented and unit tested, conduct comprehensive integration testing to verify the complete system works together:

End-to-End Integration Testing

1. Complete Workflow Testing

Test full support-bundle collect --namespace ns --auto workflow
Test auto-discovery → collection → redaction → analysis → diff pipeline
Test CLI integration with real Kubernetes clusters
Test support bundle generation with all auto-discovered collectors
Test complete artifact generation (bundle.tgz, facts.json, redaction-map.json, analysis.json)

2. Cross-Component Integration

Test auto-discovery integration with image metadata collection
Test streaming redaction integration with collection pipeline
Test analysis engine integration with auto-discovered collectors and redacted data
Test support bundle diff functionality with complete bundles
Test remediation suggestions integration with analysis results

3. Real-World Scenario Testing

Test against real Kubernetes clusters with various configurations
Test with different RBAC permission levels and restrictions
Test with various application types (web apps, databases, microservices)
Test with large clusters (1000+ pods, 100+ namespaces)
Test with different container registries (Docker Hub, ECR, GCR, Harbor)

4. Performance and Reliability Integration

Test end-to-end performance with large, complex clusters
Test system reliability with network failures and API errors
Test memory usage and resource consumption across all components
Test concurrent operations and thread safety
Test scalability limits and graceful degradation under load

5. Security and Privacy Integration

Test RBAC enforcement across the entire pipeline
Test redaction effectiveness with real sensitive data
Test token reversibility and data owner access to redaction maps
Test LLM integration security and data locality compliance
Test audit trail completeness across all operations

6. User Experience Integration

Test CLI usability and help documentation
Test configuration file examples and documentation
Test error messages and user feedback across all components
Test progress reporting and operation status visibility
Test troubleshoot.sh ecosystem integration and compatibility

7. Artifact and Output Integration

Test support bundle format compliance and compatibility
Test analysis.json schema validation and tool compatibility
Test diff.json format and visualization integration
Test redaction-map.json usability and token reversal
Test facts.json integration with analysis and visualization tools

MAJOR CHANGES FROM ORIGINAL PRD

This section documents all critical changes made to align the PRD with the actual troubleshoot codebase:

1. API Schema Reality Check

CHANGED: API group from troubleshoot.sh/v1beta3 → troubleshoot.replicated.com/v1beta2
REASON: Current codebase only has v1beta1 and v1beta2, using troubleshoot.replicated.com group

2. Implementation Strategy Shift

CHANGED: From "build from scratch" → "extend existing systems"
REASON: Discovered mature, production-ready systems already exist
IMPACT: Faster implementation, better integration, lower risk

3. CLI Structure Alignment

CHANGED: Command structure from support-bundle collect/analyze/diff → enhance existing support-bundle root + subcommands
REASON: Current structure already has support-bundle (collect), support-bundle analyze, support-bundle redact
NEW: Only support-bundle diff is completely new

4. Binary Architecture Reality

DISCOVERED: Multiple binaries already exist (preflight, support-bundle, collect, analyze)
IMPACT: Two-binary approach already partially implemented
FOCUS: Enhance existing support-bundle binary capabilities

5. Existing System Capabilities

Collection: 15+ collector types, RBAC integration, progress reporting
Redaction: Regex-based, multiple redactor types, tracking/reporting
Analysis: 60+ analyzers, host+cluster analysis, structured results
Support Bundle: Complete archiving, parsing, metadata system

6. Removed All Completion Markers

CHANGED: All ``, [ ], "" markers → [ ] (pending)
REASON: Starting implementation from scratch despite existing foundation

7. Technical Approach Updates

Auto-collectors: NEW package extending existing collection framework with dual-path approach
Redaction: ENHANCE existing system with tokenization and streaming
Analysis: WRAP existing analyzers with agent abstraction layer
Diff: COMPLETELY NEW capability using existing bundle parsing

8. Auto-Collectors Foundational Data Definition

What "Foundational Data" Includes:

Pods: All pods in target namespace(s) with full spec and status
Deployments/ReplicaSets: All deployment resources and their managed replica sets
Services: All service definitions and endpoints
ConfigMaps: All configuration data (with redaction)
Secrets: All secret metadata (values redacted by default)
Events: Recent cluster events for troubleshooting context
Pod Logs: Container logs from all pods (with retention limits)
Image Facts: Container image metadata (digests, tags, registry info)
Network Policies: Any network policies affecting the namespace
RBAC: Relevant roles, role bindings, service accounts

This foundational collection ensures that even without vendor-specific YAML specs, support bundles contain the essential data needed for troubleshooting most Kubernetes issues.

This updated PRD provides a realistic, implementable roadmap that leverages existing production-ready code while adding the new capabilities specified in the original requirements. The implementation risk is significantly reduced, and the timeline is more achievable.

60 KiB Raw Blame History

Person 2 PRD: Collectors, Redaction, Analysis, Diff, Remediation

CRITICAL CODEBASE ANALYSIS UPDATE

Current State Analysis

Implementation Strategy

Overview

Scope & Responsibilities

Primary Code Areas

Deliverables

Core Deliverables (Based on Current CLI Structure)

Critical Implementation Constraints

Component 1: Auto-Collectors

Objective

Dual-Path Collection Strategy

Requirements

Technical Specifications

1.1 Auto-Discovery Engine

1.2 Image Metadata Collection

Implementation Checklist

Phase 1: Core Auto-Discovery (Week 1-2)

Phase 2: Image Metadata Collection (Week 3)

Phase 3: CLI Integration (Week 4)

CLI Usage Patterns for Dual-Path Approach

Testing Strategy

Step-by-Step Implementation

Step 1: Set up Auto-Discovery Foundation

Step 2: Implement Foundational Collection (Path 1)

Step 3: Implement YAML Augmentation (Path 2)

Step 4: Build RBAC Checking Engine

Step 5: Add Image Metadata Collection

Step 6: Integrate with Existing Collection Pipeline

Component 2: Advanced Redaction with Tokenization

Objective

Requirements

Technical Specifications

2.1 Redaction Engine Architecture

2.2 Tokenization System

2.3 LLM Integration (Optional)

Implementation Checklist

Phase 1: Enhanced Redaction Engine (Week 1-2)

Phase 2: Redaction Profiles (Week 3)

Phase 3: LLM Integration (Week 4)

Phase 4: Integration & Artifacts (Week 5)

Testing Strategy

Step-by-Step Implementation

Step 1: Streaming Redaction Foundation

Step 2: Tokenization System

Step 3: File Type Processors

Step 4: Redaction Profiles

Step 5: LLM Integration (Optional)

Step 6: Integration and Artifacts

Component 3: Agent-Based Analysis

Objective

Requirements

Technical Specifications

3.1 Analysis Engine Architecture

3.2 Agent Types

3.2.1 Local Agent

3.2.2 Hosted Agent

3.2.3 LLM Agent (Optional)

3.3 Analyzer Generation

Implementation Checklist

Phase 1: Analysis Engine Foundation (Week 1-2)

Phase 2: Hosted Agent Integration (Week 3)

Phase 3: Analyzer Generation (Week 4)

Phase 4: Remediation & Advanced Features (Week 5)

Testing Strategy

Step-by-Step Implementation

Step 1: Analysis Engine Foundation

Step 2: Local Agent Implementation

Step 3: Analysis Artifacts

Step 4: Hosted Agent Integration

Step 5: Analyzer Generation

Step 6: Remediation System

Component 4: Support Bundle Differencing

Objective

Requirements

Technical Specifications

4.1 Diff Engine Architecture

4.2 Comparison Types

60 KiB

Raw Blame History