mirror of
https://github.com/replicatedhq/troubleshoot.git
synced 2026-02-14 18:29:53 +00:00
* Change workflow branch from 'main' to 'v1beta3' * Auto updater (#1849) * added auto updater * updated docs * commit to trigger actions * Auto-collectors: foundational discovery, image metadata, CLI integrat… (#1845) * Auto-collectors: foundational discovery, image metadata, CLI integration; reset PRD markers * Address PR review feedback - Implement missing namespace exclude patterns functionality - Fix image facts collector to use empty Data field instead of static string - Correct APIVersion to use troubleshoot.sh/v1beta2 consistently * Fix bug bot issues: API parsing, EOF error, and API group corrections - Fix RBAC API parsing errors in rbac_checker.go (getAPIGroup/getAPIVersion functions) - Fix FakeReader EOF error to use standard io.EOF instead of custom error - Fix incorrect API group from troubleshoot.sh to troubleshoot.replicated.com in run.go These changes address the issues identified by the bug bot and ensure proper interface compliance and consistent API group usage. * Fix multiple bug bot issues - Fix RBAC API parsing errors in rbac_checker.go (getAPIGroup/getAPIVersion functions) - Fix FakeReader EOF error to use standard io.EOF instead of custom error - Fix incorrect API group from troubleshoot.sh to troubleshoot.replicated.com in run.go - Fix image facts collector Data field to contain structured JSON instead of static strings These changes address all issues identified by the bug bot and ensure proper interface compliance, consistent API usage, and meaningful data fields. * Update auto_discovery.go * Fix TODO comments in Auto-collector section Fixed 3 of 4 TODOs as requested in PR review: 1. pkg/collect/images/registry_client.go (line 46): - Implement custom CA certificate loading - Add x509 import and certificate parsing logic - Enables image collection from private registries with custom CAs 2. cmd/troubleshoot/cli/diff.go (line 209): - Implement bundle file count functionality - Add tar/gzip imports and getFileCountFromBundle() function - Properly counts files in support bundle archives (.gz/.tgz) 3. cmd/troubleshoot/cli/run.go (line 338): - Replace TODO with clarifying comment about RemoteCollectors usage - Confirmed RemoteCollectors are still actively used in preflights The 4th TODO (diff.go line 196) is left as-is since it's explicitly marked as Phase 4 future work (Support Bundle Differencing implementation). Addresses PR review feedback about unimplemented TODO comments. --------- Co-authored-by: Benjamin Yang <benjaminyang@Benjamins-MacBook-Pro.local> * resetting make targets and github workflows to support v1beta3 releas… (#1853) * resetting make targets and github workflows to support v1beta3 release later * removing generate * remove * removing * removing * Support bundle diff (#1855) implemented support bundle diff command * Preflight docs and template subcommands (#1847) * Added docs and template subcommands with test files * uses helm templating preflight yaml files * merge doc requirements for multiple inputs * Helm aware rendering and markdown output * v1beta3 yaml structure better mirrors beta2 * Update sample-preflight-templated.yaml * Added docs and template subcommands with test files * uses helm templating preflight yaml files * merge doc requirements for multiple inputs * Helm aware rendering and markdown output * v1beta3 yaml structure better mirrors beta2 * Update sample-preflight-templated.yaml * Added/updated documentation on subcommands * Update docs.go * commit to trigger actions * Updated yaml spec (#1851) * v1beta3 spec can be read by preflight * added test files for ease of testing * updated v1beta3 guide doc and added tests * fixed not removing tmp files from v1beta3 processing * created v1beta2 to v1beta3 converter * Updated yaml spec (#1863) * v1beta3 spec can be read by preflight * added test files for ease of testing * v1beta3 renderer fixes * fixed gitignore issue * Auto support bundle upload (#1860) * basic auto uploading support bundles * added upload command * added default vendor endpoint * added auth system from replicated cli * fixed case sensitivity issue in YAML parsing * support bundle uploads for end customers * app slug flag and detection without licenseID * moved v1beta3 examples to proper directory * does not auto update for package managers (#1850) * V1beta3 cleanup (#1869) * moving some files around * more cleanup * removing more unused * update ci for v1beta3 (#1870) * fmt: * removing unused examples * add a v1beta3 fixture * removing coverage reporting * adding brew (#1872) * Fixing testing errors (#1871) fix: resolve failing unit tests and diff consistency in v1beta3 - Fix readLinesFromReader to return lines WITH newlines (like difflib.SplitLines) - Update test expectations to match correct function behavior with newlines - This ensures consistency between streaming and non-streaming diff paths - Fix timeout test by changing from 10ms to 500ms to eliminate flaky failures Fixes TestReadLinesFromReader and Test_loadSupportBundleSpecsFromURIs_TimeoutError Resolves diff output inconsistency between code paths * Fix/exec textanalyze path clean (#1865) * created roadmap and yaml claude agent * Update roadmap.md * Fix textAnalyze analyzer to auto-match exec collector nested paths - Auto-detect exec output files (*-stdout.txt, *-stderr.txt, *-errors.json) - Convert simple filenames to wildcard patterns automatically - Preserve existing wildcard patterns - Fixes 'No matching file' errors for exec + textAnalyze workflows --------- Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com> * bump goreleaser to v2 * remove collect binary and risc binary * remove this check * add debug logging * larger runner for release * dropping goreleaser * fix syntax * fix syntax * goreleaser * larger * prerelease auto and more * publish to directory: * some more goreleaser/homebrew stuffs * removing risc * bump example * Advanced analysis clean (#1868) * created roadmap and yaml claude agent * Update roadmap.md * feat: Clean advanced analysis implementation - core agents, engine, artifacts * Remove unrelated files - keep only advanced analysis implementation * fix: Fix goroutine leak in hosted agent rate limiter - Added stop channel and stopped flag to RateLimiter struct - Modified replenishTokens to listen for stop signal and exit cleanly - Added Stop() method to gracefully shutdown rate limiter - Added Stop() method to HostedAgent to cleanup rate limiter on shutdown Fixes cursor bot issue: Rate Limiter Goroutine Leak * fix: Fix analyzer config and model validation bugs Bug 1: Analyzer Config Missing File Path - Added filePath to DeploymentStatus analyzer config in convertAnalyzerToSpec - Sets namespace-specific path (cluster-resources/deployments/{namespace}.json) - Falls back to generic path (cluster-resources/deployments.json) if no namespace - Fixes LocalAgent.analyzeDeploymentStatus backward compatibility Bug 2: HealthCheck Fails Model Validation - Changed Ollama model validation from prefix match to exact match - Prevents false positives where llama2:13b would match request for llama2:7b - Ensures agent only reports healthy when exact model is available Both fixes address cursor bot reported issues and maintain backward compatibility. * fixing lint errors * fixing lint errors * adding CLI flags * fix: resolve linting errors for CI - Remove unnecessary nil check in host_kernel_configs.go (len() for nil slices is zero) - Remove unnecessary fmt.Sprintf() calls in ceph.go for static strings - Apply go fmt formatting fixes Fixes failing lint CI check * fix: resolve CI failures in build-test workflow and Ollama tests 1. Fix GitHub Actions workflow logic error: - Replace problematic contains() expression with explicit job result checks - Properly handle failure and cancelled states for each job - Prevents false positive failures in success summary job 2. Fix Ollama agent parseLLMResponse panics: - Add proper error handling for malformed JSON in LLM responses - Return error when JSON is found but invalid (instead of silent fallback) - Add error when no meaningful content can be parsed from response - Prevents nil pointer dereference in test assertions Fixes failing build-test/success and build-test/test CI checks * fix: resolve all CI failures and cursor bot issues 1. Fix disable-ollama flag logic bug: - Remove disable-ollama from advanced analysis trigger condition - Prevents unintended advanced analysis mode when no agents registered - Allows proper fallback to legacy analysis 2. Fix diff test consistency: - Update test expectations to match function behavior (lines with newlines) - Ensures consistency between streaming and non-streaming diff paths 3. Fix Ollama agent error handling: - Add proper error return for malformed JSON in LLM responses - Add meaningful content validation for markdown parsing - Prevents nil pointer panics in test assertions 4. Fix analysis engine mock agent: - Mock agent now processes and returns results for all provided analyzers - Fixes test expectation mismatch (expected 8 results, got 1) Resolves all failing CI checks: lint, test, and success workflow logic --------- Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com> * Auto-Collect (#1867) * Fix auto-collector missing files issue - Add KOTS-aware detection for diagnostic files - Replace silent RBAC filtering with user warnings - Enhance error file collection for troubleshooting - Achieve parity with traditional support bundles Resolves issue where auto-collector was missing: - KOTS diagnostic files (now 4 vs 3) - ConfigMaps (now 6 vs 6) - Maintains superior log collection (24 vs 0) Final result: [SUCCESS] comprehensive collection achieved * fixing bugbog * fix: resolve production readiness issues in auto-collect branch 1. Fix diff test expectations (lines should have newlines for difflib consistency) 2. Fix preflight tests to use existing v1beta3 example file 3. Fix autodiscovery test context parameter (function signature update) Resolves TestReadLinesFromReader and preflight v1beta3 test failures * fix: resolve autodiscovery tests and cursor bot image matching issues 1. Fix cursor bot image matching bug in isKotsadmImage: - Replace flawed prefix matching with proper image component detection - Handle private registries correctly (registry.company.com/kotsadm/kotsadm:v1.0.0) - Prevent false positives with proper delimiter checking - Add helper functions: containsImageComponent, splitImagePath, removeTagAndDigest 2. Fix autodiscovery test failures: - Add TestMode flag to DiscoveryOptions to control KOTS diagnostic collection - Tests use TestMode=true to get only foundational collectors (no KOTS diagnostics) - Preserves production behavior while enabling clean testing Resolves failing TestDiscoverer_DiscoverFoundational tests and cursor bot issues * Cron job clean (#1862) * created roadmap and yaml claude agent * Update roadmap.md * chore(deps): bump sigstore/cosign-installer from 3.9.2 to 3.10.0 (#1857) Bumps [sigstore/cosign-installer](https://github.com/sigstore/cosign-installer) from 3.9.2 to 3.10.0. - [Release notes](https://github.com/sigstore/cosign-installer/releases) - [Commits](https://github.com/sigstore/cosign-installer/compare/v3.9.2...v3.10.0) --- updated-dependencies: - dependency-name: sigstore/cosign-installer dependency-version: 3.10.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump the security group with 2 updates (#1858) Bumps the security group with 2 updates: [github.com/vmware-tanzu/velero](https://github.com/vmware-tanzu/velero) and [helm.sh/helm/v3](https://github.com/helm/helm). Updates `github.com/vmware-tanzu/velero` from 1.16.2 to 1.17.0 - [Release notes](https://github.com/vmware-tanzu/velero/releases) - [Changelog](https://github.com/vmware-tanzu/velero/blob/main/CHANGELOG.md) - [Commits](https://github.com/vmware-tanzu/velero/compare/v1.16.2...v1.17.0) Updates `helm.sh/helm/v3` from 3.18.6 to 3.19.0 - [Release notes](https://github.com/helm/helm/releases) - [Commits](https://github.com/helm/helm/compare/v3.18.6...v3.19.0) --- updated-dependencies: - dependency-name: github.com/vmware-tanzu/velero dependency-version: 1.17.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: security - dependency-name: helm.sh/helm/v3 dependency-version: 3.19.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: security ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump helm.sh/helm/v3 from 3.18.6 to 3.19.0 in /examples/sdk/helm-template in the security group (#1859) chore(deps): bump helm.sh/helm/v3 Bumps the security group in /examples/sdk/helm-template with 1 update: [helm.sh/helm/v3](https://github.com/helm/helm). Updates `helm.sh/helm/v3` from 3.18.6 to 3.19.0 - [Release notes](https://github.com/helm/helm/releases) - [Commits](https://github.com/helm/helm/compare/v3.18.6...v3.19.0) --- updated-dependencies: - dependency-name: helm.sh/helm/v3 dependency-version: 3.19.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: security ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add cron job support bundle scheduler Complete implementation with K8s integration: - pkg/schedule/job.go: Job management and persistence - pkg/schedule/daemon.go: Real-time scheduler daemon - pkg/schedule/cli.go: CLI commands (create, list, delete, daemon) - pkg/schedule/schedule_test.go: Comprehensive unit tests - cmd/troubleshoot/cli/root.go: CLI integration * fixing bugbot * Fix all bugbot errors: auto-update stability, job cooldown timing, and daemon execution * Deleting Agent * removed unused flags * fixing auto-upload * fixing markdown files * namespace not required flag for auto collectors to work * loosened cron job validation * writes logs to logfile * fix: resolve autoFromEnv variable scoping issue for CI - Ensure autoFromEnv variable and its usage are in correct scope - Fix build errors: declared and not used / undefined variable - All functionality preserved and tested locally - Force add to override gitignore --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat: clean tokenization system implementation (#1874) Core tokenization functionality with minimal file changes: ✅ Core Features: - Intelligent tokenization engine (tokenizer.go) - Context-aware secret classification (PASSWORD, APIKEY, DATABASE, etc.) - Cross-file correlation with deterministic HMAC-SHA256 tokens - Optional encrypted mapping for token→original value resolution ✅ Integration: - CLI flags: --tokenize, --redaction-map, --encrypt-redaction-map - Updated all redactor types: literal, single-line, multi-line, YAML - Support bundle integration with auto-upload compatibility - Backward compatibility: preserves ***HIDDEN*** when disabled ✅ Production Ready: - Only 11 essential files (vs 31 in original PR) - No excessive test files or documentation - Clean build, all functionality verified - Maintains existing redaction behavior by default Token format: ***TOKEN_<TYPE>_<HASH>*** (e.g., ***TOKEN_PASSWORD_A1B2C3***) * Removes silent failing (#1877) * preserves stdout and stderr from collectors * Delete eliminate-silent-failures.md * Update host_kernel_modules_test.go * added error logs when a collector fails to start * Update host_filesystem_performance_linux.go * fixed error saving logic inconsistency * Update collect.go * Improved error handling for support bundles and redactors for windows (#1878) * improved error handling and window locking * Delete all-windows-collectors.yaml * addressing bugbot concerns * Update host_tcpportstatus.go * Update redact.go * Add regression test suite to github actions * Update regression-test.yaml * Update regression-test.yaml * Update regression-test.yaml * create test/output directory * handle node-specific files and multiple report arguments * simplify comparison to detect code regressions only * handle empty structural_compare rules * removed v1beta3 branch from github workflow * Update Makefile * removed outdated actions * Update Makefile --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com> Co-authored-by: Benjamin Yang <82779168+bennyyang11@users.noreply.github.com> Co-authored-by: Benjamin Yang <benjaminyang@Benjamins-MacBook-Pro.local> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1428 lines
60 KiB
Markdown
1428 lines
60 KiB
Markdown
# Person 2 PRD: Collectors, Redaction, Analysis, Diff, Remediation
|
|
|
|
## CRITICAL CODEBASE ANALYSIS UPDATE
|
|
|
|
**This PRD has been updated based on comprehensive analysis of the current troubleshoot codebase. Key findings:**
|
|
|
|
### Current State Analysis
|
|
- **API Schema**: Current API group is `troubleshoot.replicated.com` (not `troubleshoot.sh`), with `v1beta1` and `v1beta2` available
|
|
- **Binary Structure**: Multiple binaries already exist (`preflight`, `support-bundle`, `collect`, `analyze`)
|
|
- **CLI Structure**: `support-bundle` root command exists with `analyze` and `redact` subcommands
|
|
- **Collection System**: Comprehensive collection framework in `pkg/collect/` with 15+ collector types
|
|
- **Redaction System**: Functional redaction system in `pkg/redact/` with multiple redactor types
|
|
- **Analysis System**: Mature analysis system in `pkg/analyze/` with 60+ built-in analyzers
|
|
- **Support Bundle**: Complete support bundle system in `pkg/supportbundle/` with archiving and processing
|
|
|
|
### Implementation Strategy
|
|
This PRD now focuses on **EXTENDING** existing systems rather than building from scratch:
|
|
- **Auto-collectors**: NEW package `pkg/collect/autodiscovery/` extending existing collection
|
|
- **Redaction tokenization**: ENHANCE existing `pkg/redact/` system
|
|
- **Agent-based analysis**: WRAP existing `pkg/analyze/` system with agent abstraction
|
|
- **Bundle differencing**: COMPLETELY NEW `pkg/supportbundle/diff/` capability
|
|
|
|
## Overview
|
|
|
|
Person 2 is responsible for the core data collection, processing, and analysis capabilities of the troubleshoot project. This involves implementing auto-collectors, advanced redaction with tokenization, agent-based analysis, support bundle differencing, and remediation suggestions.
|
|
|
|
## Scope & Responsibilities
|
|
|
|
- **Auto-collectors** (namespace-scoped, RBAC-aware), include image digests & tags
|
|
- **Redaction** with tokenization (optional local LLM-assisted pass), emit `redaction-map.json`
|
|
- **Analyzer** via agents (local/hosted) and "generate analyzers from requirements"
|
|
- **Support bundle diffs** and remediation suggestions
|
|
|
|
### Primary Code Areas
|
|
- `pkg/collect` - Collection engine and auto-collectors (extending existing collection system)
|
|
- `pkg/redact` - Redaction engine with tokenization (enhancing existing redaction system)
|
|
- `pkg/analyze` - Analysis engine and agent integration (extending existing analysis system)
|
|
- `pkg/supportbundle` - Bundle readers/writers and artifact management (extending existing support bundle system)
|
|
- `examples/*` - Reference implementations and test cases
|
|
|
|
**Critical API Contract**: All implementations must use ONLY the current API group `troubleshoot.replicated.com/v1beta2` types and be prepared for future migration to Person 1's planned schema updates. No schema modifications allowed.
|
|
|
|
## Deliverables
|
|
|
|
### Core Deliverables (Based on Current CLI Structure)
|
|
1. **`support-bundle --namespace ns --auto`** - enhance existing root command with auto-discovery capabilities
|
|
2. **Redaction/tokenization profiles** - streaming integration in collection path, emit `redaction-map.json`
|
|
3. **`support-bundle analyze --agent claude|local --bundle bundle.tgz`** - enhance existing analyze subcommand with agent support
|
|
4. **`support-bundle diff old.tgz new.tgz`** - NEW subcommand with structured `diff.json` output
|
|
5. **"Generate analyzers from requirements"** - create analyzers from requirement specifications
|
|
6. **Remediation blocks** - surfaced in analysis outputs with actionable suggestions
|
|
|
|
**Note**: The current CLI structure has `support-bundle` as the root collection command, with `analyze` and `redact` as subcommands. The `diff` subcommand will be newly added.
|
|
|
|
### Critical Implementation Constraints
|
|
- **NO schema alterations**: Person 2 consumes but never modifies schemas/types from Person 1
|
|
- **Streaming redaction**: Must run as streaming step during collection (per IO flow contract)
|
|
- **Exact CLI compliance**: Implement commands exactly as specified in CLI contracts
|
|
- **Artifact format compliance**: Follow exact naming conventions for all output files
|
|
|
|
---
|
|
|
|
## Component 1: Auto-Collectors
|
|
|
|
### Objective
|
|
Implement intelligent, namespace-scoped auto-collectors that enhance the current YAML-driven collection system with automatic foundational data discovery. This creates a dual-path collection strategy that ensures comprehensive troubleshooting data is always gathered.
|
|
|
|
### Dual-Path Collection Strategy
|
|
|
|
**Current System (YAML-only)**:
|
|
- Collects only what vendors specify in YAML collector specs
|
|
- Limited to predefined collector configurations
|
|
- May miss critical cluster state information
|
|
|
|
**New Auto-Collectors System**:
|
|
- **Path 1 - No YAML**: Automatically discover and collect foundational cluster data (logs, deployments, services, configmaps, secrets, events, etc.)
|
|
- **Path 2 - With YAML**: Collect vendor-specified YAML collectors PLUS automatically collect foundational data as well
|
|
- Always ensures comprehensive baseline data collection for effective troubleshooting
|
|
|
|
### Requirements
|
|
- **Foundational collection**: Always collect essential cluster resources (pods, deployments, services, configmaps, events, logs)
|
|
- **Namespace-scoped collection**: Respect namespace boundaries and permissions
|
|
- **RBAC-aware**: Only collect data the user has permission to access
|
|
- **Image metadata**: Include digests, tags, and repository information for discovered containers
|
|
- **Deterministic expansion**: Same cluster state should produce consistent foundational collection
|
|
- **YAML augmentation**: When YAML specs provided, add foundational collection to vendor-specified collectors
|
|
- **Streaming integration**: Work with redaction pipeline during collection
|
|
|
|
### Technical Specifications
|
|
|
|
#### 1.1 Auto-Discovery Engine
|
|
**Location**: `pkg/collect/autodiscovery/`
|
|
|
|
**Components**:
|
|
- `discoverer.go` - Main discovery orchestrator
|
|
- `rbac_checker.go` - Permission validation
|
|
- `namespace_scanner.go` - Namespace-aware resource enumeration
|
|
- `resource_expander.go` - Convert discovered resources to collector specs
|
|
|
|
**API Contract**:
|
|
```go
|
|
type AutoCollector interface {
|
|
// Discover foundational collectors based on cluster state
|
|
DiscoverFoundational(ctx context.Context, opts DiscoveryOptions) ([]CollectorSpec, error)
|
|
// Augment existing YAML collectors with foundational collectors
|
|
AugmentWithFoundational(ctx context.Context, yamlCollectors []CollectorSpec, opts DiscoveryOptions) ([]CollectorSpec, error)
|
|
// Validate permissions for discovered resources
|
|
ValidatePermissions(ctx context.Context, resources []Resource) ([]Resource, error)
|
|
}
|
|
|
|
type DiscoveryOptions struct {
|
|
Namespaces []string
|
|
IncludeImages bool
|
|
RBACCheck bool
|
|
MaxDepth int
|
|
FoundationalOnly bool // Path 1: Only collect foundational data
|
|
AugmentMode bool // Path 2: Add foundational to existing YAML specs
|
|
}
|
|
|
|
type FoundationalCollectors struct {
|
|
// Core Kubernetes resources always collected
|
|
Pods []PodCollector
|
|
Deployments []DeploymentCollector
|
|
Services []ServiceCollector
|
|
ConfigMaps []ConfigMapCollector
|
|
Secrets []SecretCollector
|
|
Events []EventCollector
|
|
Logs []LogCollector
|
|
// Container image metadata
|
|
ImageFacts []ImageFactsCollector
|
|
}
|
|
```
|
|
|
|
#### 1.2 Image Metadata Collection
|
|
**Location**: `pkg/collect/images/`
|
|
|
|
**Components**:
|
|
- `registry_client.go` - Registry API integration
|
|
- `digest_resolver.go` - Convert tags to digests
|
|
- `manifest_parser.go` - Parse image manifests
|
|
- `facts_builder.go` - Build structured image facts
|
|
|
|
**Data Structure**:
|
|
```go
|
|
type ImageFacts struct {
|
|
Repository string `json:"repository"`
|
|
Tag string `json:"tag"`
|
|
Digest string `json:"digest"`
|
|
Registry string `json:"registry"`
|
|
Size int64 `json:"size"`
|
|
Created time.Time `json:"created"`
|
|
Labels map[string]string `json:"labels"`
|
|
Platform Platform `json:"platform"`
|
|
}
|
|
|
|
type Platform struct {
|
|
Architecture string `json:"architecture"`
|
|
OS string `json:"os"`
|
|
Variant string `json:"variant,omitempty"`
|
|
}
|
|
```
|
|
|
|
### Implementation Checklist
|
|
|
|
#### Phase 1: Core Auto-Discovery (Week 1-2)
|
|
- [ ] **Discovery Engine Setup**
|
|
- [ ] Create `pkg/collect/autodiscovery/` package structure
|
|
- [ ] Implement `Discoverer` interface and base implementation
|
|
- [ ] Add Kubernetes client integration for resource enumeration
|
|
- [ ] Create namespace filtering logic
|
|
- [ ] Add discovery configuration parsing
|
|
|
|
- [ ] **RBAC Integration**
|
|
- [ ] Implement `RBACChecker` for permission validation
|
|
- [ ] Add `SelfSubjectAccessReview` integration
|
|
- [ ] Create permission caching layer for performance (5min TTL)
|
|
- [ ] Add fallback strategies for limited permissions
|
|
|
|
- [ ] **Resource Expansion**
|
|
- [ ] Implement resource-to-collector mapping via `ResourceExpander`
|
|
- [ ] Add standard resource patterns (pods, deployments, services, configmaps, secrets, events)
|
|
- [ ] Create expansion rules configuration with priority system
|
|
- [ ] Add dependency graph resolution and deduplication
|
|
|
|
- [ ] **Unit Testing** **ALL TESTS PASSING**
|
|
- [ ] Test `Discoverer.DiscoverFoundational()` with mock Kubernetes clients
|
|
- [ ] Test `RBACChecker.FilterByPermissions()` with various permission scenarios
|
|
- [ ] Test namespace enumeration and filtering with different configurations
|
|
- [ ] Test `ResourceExpander` with all foundational resource types
|
|
- [ ] Test collector deduplication and conflict resolution (YAML overrides foundational)
|
|
- [ ] Test error handling and graceful degradation scenarios
|
|
- [ ] Test permission caching and RBAC integration
|
|
- [ ] Test collector priority sorting and dual-path logic
|
|
|
|
#### Phase 2: Image Metadata Collection (Week 3)
|
|
- [ ] **Registry Integration**
|
|
- [ ] Create `pkg/collect/images/` package
|
|
- [ ] Implement registry client with authentication support (Docker Hub, ECR, GCR, Harbor, etc.)
|
|
- [ ] Add manifest parsing for Docker v2 and OCI formats
|
|
- [ ] Create digest resolution from tags
|
|
|
|
- [ ] **Facts Generation**
|
|
- [ ] Implement `ImageFacts` data structure with comprehensive metadata
|
|
- [ ] Add image scanning and metadata extraction (platform, layers, config)
|
|
- [ ] Create facts serialization to JSON with `FactsBundle` format
|
|
- [ ] Add error handling and fallback modes with `ContinueOnError`
|
|
|
|
- [ ] **Integration**
|
|
- [ ] Integrate image collection into auto-discovery system
|
|
- [ ] Add image facts to foundational collectors
|
|
- [ ] Create `facts.json` output specification with summary statistics
|
|
- [ ] Add Kubernetes image extraction from pods, deployments, daemonsets, statefulsets
|
|
|
|
- [ ] **Unit Testing** **ALL TESTS PASSING**
|
|
- [ ] Test registry client authentication and factory patterns for different registry types
|
|
- [ ] Test manifest parsing for Docker v2, OCI, and legacy v1 image formats
|
|
- [ ] Test digest resolution and validation with various formats
|
|
- [ ] Test `ImageFacts` data structure serialization/deserialization
|
|
- [ ] Test image metadata extraction with comprehensive validation
|
|
- [ ] Test error handling for network failures and authentication
|
|
- [ ] Test concurrent collection with rate limiting and semaphores
|
|
- [ ] Test image facts caching and deduplication logic with LRU cleanup
|
|
|
|
#### Phase 3: CLI Integration (Week 4)
|
|
**Note**: Current CLI structure has `--namespace` already available. Successfully added `--auto` flag and related options.
|
|
|
|
### CLI Usage Patterns for Dual-Path Approach
|
|
|
|
**Path 1 - Foundational Only (No YAML)**:
|
|
```bash
|
|
# Collect foundational data for default namespace
|
|
support-bundle --auto
|
|
|
|
# Collect foundational data for specific namespace(s)
|
|
support-bundle --auto --namespace myapp
|
|
|
|
# Include container image metadata
|
|
support-bundle --auto --namespace myapp --include-images
|
|
|
|
# Use comprehensive discovery profile
|
|
support-bundle --auto --discovery-profile comprehensive --include-images
|
|
```
|
|
|
|
**Path 2 - YAML + Foundational (Augmented)**:
|
|
```bash
|
|
# Collect vendor YAML specs + foundational data
|
|
support-bundle vendor-spec.yaml --auto
|
|
|
|
# Multiple YAML specs + foundational data
|
|
support-bundle spec1.yaml spec2.yaml --auto --namespace myapp
|
|
|
|
# Exclude system namespaces from foundational collection
|
|
support-bundle vendor-spec.yaml --auto --exclude-namespaces "kube-*,cattle-*"
|
|
```
|
|
|
|
**Current Behavior (Preserved)**:
|
|
```bash
|
|
# Only collect what's in YAML (no foundational data added)
|
|
support-bundle vendor-spec.yaml
|
|
```
|
|
|
|
**New Diff Command**:
|
|
```bash
|
|
# Compare two support bundles
|
|
support-bundle diff old-bundle.tgz new-bundle.tgz
|
|
|
|
# Output to JSON file
|
|
support-bundle diff old.tgz new.tgz --output json -f diff-report.json
|
|
|
|
# Generate HTML report with remediation
|
|
support-bundle diff old.tgz new.tgz --output html --include-remediation
|
|
```
|
|
|
|
- [ ] **Command Enhancement**
|
|
- [ ] Add `--auto` flag to `support-bundle` root command
|
|
- [ ] Implement dual-path logic: no args+`--auto` = foundational only
|
|
- [ ] Implement augmentation logic: YAML args+`--auto` = YAML + foundational
|
|
- [ ] Integrate with existing `--namespace` filtering
|
|
- [ ] Add `--include-images` option for container image metadata collection
|
|
- [ ] Create `--rbac-check` validation mode (enabled by default)
|
|
- [ ] Add `support-bundle diff` subcommand with full flag set
|
|
|
|
- [ ] **Configuration**
|
|
- [ ] Add discovery profiles (minimal, standard, comprehensive, paranoid)
|
|
- [ ] Add namespace exclusion/inclusion patterns with glob support
|
|
- [ ] Implement dry-run mode integration for auto-discovery
|
|
- [ ] Create discovery configuration file support with JSON format
|
|
- [ ] Add profile-based timeout and collection behavior configuration
|
|
|
|
- [ ] **Unit Testing** **ALL TESTS PASSING**
|
|
- [ ] Test CLI flag parsing and validation for all auto-discovery options
|
|
- [ ] Test discovery profile loading and validation logic
|
|
- [ ] Test dry-run mode integration and output
|
|
- [ ] Test namespace filtering with glob patterns
|
|
- [ ] Test command help text and flag descriptions
|
|
- [ ] Test error handling for invalid CLI flag combinations
|
|
- [ ] Test configuration file loading, validation, and fallbacks
|
|
- [ ] Test dual-path mode detection and routing logic
|
|
|
|
### Testing Strategy
|
|
- [ ] **Unit Tests** **ALL PASSING**
|
|
- [ ] RBAC checker with mock Kubernetes API
|
|
- [ ] Resource expansion logic and deduplication
|
|
- [ ] Image metadata parsing and registry integration
|
|
- [ ] Discovery configuration validation and pattern matching
|
|
- [ ] CLI flag validation and profile loading
|
|
- [ ] Bundle diff validation and output formatting
|
|
|
|
- [ ] **Integration Tests** **IMPLEMENTED**
|
|
- [ ] End-to-end auto-discovery workflow testing
|
|
- [ ] Permission boundary validation with mock RBAC
|
|
- [ ] Image registry integration with mock HTTP servers
|
|
- [ ] Namespace isolation verification
|
|
- [ ] CLI integration with existing support-bundle system
|
|
|
|
- [ ] **Performance Tests** **BENCHMARKED**
|
|
- [ ] Large cluster discovery performance (1000+ resources)
|
|
- [ ] Image metadata collection at scale with concurrent processing
|
|
- [ ] Memory usage during auto-discovery with caching
|
|
- [ ] CLI flag parsing and configuration loading performance
|
|
|
|
### Step-by-Step Implementation
|
|
|
|
#### Step 1: Set up Auto-Discovery Foundation
|
|
1. Create package structure: `pkg/collect/autodiscovery/`
|
|
2. Define `AutoCollector` interface with dual-path methods in `interfaces.go`
|
|
3. Implement `FoundationalDiscoverer` struct in `discoverer.go`
|
|
4. Define foundational collectors list (pods, deployments, services, configmaps, secrets, events, logs)
|
|
5. Add Kubernetes client initialization and configuration
|
|
6. Create unit tests for basic discovery functionality
|
|
|
|
#### Step 2: Implement Foundational Collection (Path 1)
|
|
1. Create `foundational.go` with predefined essential collector specs
|
|
2. Implement namespace-scoped resource enumeration for foundational resources
|
|
3. Add RBAC checking for each foundational collector type
|
|
4. Create deterministic resource expansion (same cluster → same collectors)
|
|
5. Add comprehensive unit tests for foundational collection
|
|
|
|
#### Step 3: Implement YAML Augmentation (Path 2)
|
|
1. Create `augmenter.go` to merge YAML collectors with foundational collectors
|
|
2. Implement deduplication logic (avoid collecting same resource twice)
|
|
3. Add priority system (YAML specs override foundational specs when conflict)
|
|
4. Create merger validation and conflict resolution
|
|
5. Add comprehensive unit tests for augmentation logic
|
|
|
|
#### Step 4: Build RBAC Checking Engine
|
|
1. Create `rbac_checker.go` with `SelfSubjectAccessReview` integration
|
|
2. Add permission caching with TTL for performance
|
|
3. Implement batch permission checking for efficiency
|
|
4. Add fallback modes for clusters with limited RBAC visibility
|
|
5. Create comprehensive RBAC test suite
|
|
|
|
#### Step 5: Add Image Metadata Collection
|
|
1. Create `pkg/collect/images/` package with registry client
|
|
2. Implement manifest parsing for Docker v2 and OCI formats
|
|
3. Add authentication support (Docker Hub, ECR, GCR, etc.)
|
|
4. Create `ImageFacts` generation from manifest data
|
|
5. Add error handling and retry logic for registry operations
|
|
|
|
#### Step 6: Integrate with Existing Collection Pipeline
|
|
1. Modify existing `pkg/collect/collect.go` to support auto-discovery modes
|
|
2. Add CLI integration for `--auto` flag (Path 1) and YAML+auto mode (Path 2)
|
|
3. Create seamless integration with existing collector framework
|
|
4. Add streaming integration with redaction pipeline
|
|
5. Create `facts.json` output format and writer
|
|
6. Implement progress reporting and user feedback
|
|
7. Add configuration validation and error reporting
|
|
|
|
---
|
|
|
|
## Component 2: Advanced Redaction with Tokenization
|
|
|
|
### Objective
|
|
Enhance the existing redaction system (currently in `pkg/redact/`) with tokenization capabilities, optional local LLM assistance, and reversible redaction mapping for data owners.
|
|
|
|
**Current State**: The codebase has a functional redaction system with:
|
|
- File-based redaction using regex patterns
|
|
- Multiple redactor types (`SingleLineRedactor`, `MultiLineRedactor`, `YamlRedactor`, etc.)
|
|
- Redaction tracking and reporting via `RedactionList`
|
|
- Integration with collection pipeline
|
|
|
|
### Requirements
|
|
- **Streaming redaction**: Enhance existing system to work as streaming step during collection
|
|
- **Tokenization**: Replace sensitive values with consistent tokens for traceability (new capability)
|
|
- **LLM assistance**: Optional local LLM for intelligent redaction detection (new capability)
|
|
- **Reversible mapping**: Generate `redaction-map.json` for token reversal by data owners (new capability)
|
|
- **Performance**: Maintain/improve performance of existing system for large support bundles
|
|
- **Profiles**: Extend existing redactor configuration with redaction profiles
|
|
|
|
### Technical Specifications
|
|
|
|
#### 2.1 Redaction Engine Architecture
|
|
**Location**: `pkg/redact/`
|
|
|
|
**Core Components**:
|
|
- `engine.go` - Main redaction orchestrator
|
|
- `tokenizer.go` - Token generation and mapping
|
|
- `processors/` - File type specific processors
|
|
- `llm/` - Local LLM integration (optional)
|
|
- `profiles/` - Pre-defined redaction profiles
|
|
|
|
**API Contract**:
|
|
```go
|
|
type RedactionEngine interface {
|
|
ProcessStream(ctx context.Context, input io.Reader, output io.Writer, opts RedactionOptions) (*RedactionMap, error)
|
|
GenerateTokens(ctx context.Context, values []string) (map[string]string, error)
|
|
LoadProfile(name string) (*RedactionProfile, error)
|
|
}
|
|
|
|
type RedactionOptions struct {
|
|
Profile string
|
|
EnableLLM bool
|
|
TokenPrefix string
|
|
StreamMode bool
|
|
PreserveFormat bool
|
|
}
|
|
|
|
type RedactionMap struct {
|
|
Tokens map[string]string `json:"tokens"` // token -> original value
|
|
Stats RedactionStats `json:"stats"` // redaction statistics
|
|
Timestamp time.Time `json:"timestamp"` // when redaction was performed
|
|
Profile string `json:"profile"` // profile used
|
|
}
|
|
```
|
|
|
|
#### 2.2 Tokenization System
|
|
**Location**: `pkg/redact/tokenizer.go`
|
|
|
|
**Features**:
|
|
- Consistent token generation for same values
|
|
- Configurable token formats and prefixes
|
|
- Token collision detection and resolution
|
|
- Metadata preservation (type hints, length preservation)
|
|
|
|
**Token Format**:
|
|
```
|
|
***TOKEN_<TYPE>_<HASH>***
|
|
Examples:
|
|
- ***TOKEN_PASSWORD_A1B2C3***
|
|
- ***TOKEN_EMAIL_X7Y8Z9***
|
|
- ***TOKEN_IP_D4E5F6***
|
|
```
|
|
|
|
#### 2.3 LLM Integration (Optional)
|
|
**Location**: `pkg/redact/llm/`
|
|
|
|
**Supported Models**:
|
|
- Ollama integration for local models
|
|
- OpenAI compatible APIs
|
|
- Hugging Face transformers (via local API)
|
|
|
|
**LLM Tasks**:
|
|
- Intelligent sensitive data detection
|
|
- Context-aware redaction decisions
|
|
- False positive reduction
|
|
- Custom pattern learning
|
|
|
|
### Implementation Checklist
|
|
|
|
#### Phase 1: Enhanced Redaction Engine (Week 1-2)
|
|
- [ ] **Core Engine Refactoring**
|
|
- [ ] Refactor existing `pkg/redact` to support streaming
|
|
- [ ] Create new `RedactionEngine` interface
|
|
- [ ] Implement streaming processor for different file types
|
|
- [ ] Add configurableprocessing pipelines
|
|
|
|
- [ ] **Tokenization Implementation**
|
|
- [ ] Create `Tokenizer` with consistent hash-based token generation
|
|
- [ ] Implement token mapping and reverse lookup
|
|
- [ ] Add token format configuration and validation
|
|
- [ ] Create collision detection and resolution
|
|
|
|
- [ ] **File Type Processors**
|
|
- [ ] Create specialized processors for JSON, YAML, logs, config files
|
|
- [ ] Add context-aware redaction (e.g., preserve YAML structure)
|
|
- [ ] Implement streaming processing for large files
|
|
- [ ] Add error recovery and partial redaction support
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `RedactionEngine` with various input stream types and sizes
|
|
- [ ] Test `Tokenizer` consistency - same input produces same tokens
|
|
- [ ] Test token collision detection and resolution algorithms
|
|
- [ ] Test file type processors with malformed/corrupted input files
|
|
- [ ] Test streaming redaction performance with large files (GB scale)
|
|
- [ ] Test error recovery and partial redaction scenarios
|
|
- [ ] Test redaction map generation and serialization
|
|
- [ ] Test token format validation and configuration options
|
|
|
|
#### Phase 2: Redaction Profiles (Week 3)
|
|
- [ ] **Profile System**
|
|
- [ ] Create `RedactionProfile` data structure and parser
|
|
- [ ] Implement built-in profiles (minimal, standard, comprehensive, paranoid)
|
|
- [ ] Add profile validation and testing
|
|
- [ ] Create profile override and customization system
|
|
|
|
- [ ] **Profile Definitions**
|
|
- [ ] **Minimal**: Basic passwords, API keys, tokens
|
|
- [ ] **Standard**: + IP addresses, URLs, email addresses
|
|
- [ ] **Comprehensive**: + usernames, hostnames, file paths
|
|
- [ ] **Paranoid**: + any alphanumeric strings > 8 chars, custom patterns
|
|
|
|
- [ ] **Configuration**
|
|
- [ ] Add profile selection to support bundle specs
|
|
- [ ] Create profile inheritance and composition
|
|
- [ ] Implement runtime profile switching
|
|
- [ ] Add profile documentation and examples
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test redaction profile parsing and validation
|
|
- [ ] Test profile inheritance and composition logic
|
|
- [ ] Test built-in profiles (minimal, standard, comprehensive, paranoid)
|
|
- [ ] Test custom profile creation and validation
|
|
- [ ] Test profile override and customization mechanisms
|
|
- [ ] Test runtime profile switching without state corruption
|
|
- [ ] Test profile configuration serialization/deserialization
|
|
- [ ] Test profile pattern matching accuracy and coverage
|
|
|
|
#### Phase 3: LLM Integration (Week 4)
|
|
- [ ] **LLM Framework**
|
|
- [ ] Create `LLMProvider` interface for different backends
|
|
- [ ] Implement Ollama integration for local models
|
|
- [ ] Add OpenAI-compatible API client
|
|
- [ ] Create fallback modes when LLM is unavailable
|
|
|
|
- [ ] **Intelligent Detection**
|
|
- [ ] Design prompts for sensitive data detection
|
|
- [ ] Implement confidence scoring for LLM suggestions
|
|
- [ ] Add human-readable explanation generation
|
|
- [ ] Create feedback loop for improving detection
|
|
|
|
- [ ] **Privacy & Security**
|
|
- [ ] Ensure LLM processing respects data locality
|
|
- [ ] Add data minimization for LLM requests
|
|
- [ ] Implement secure prompt injection prevention
|
|
- [ ] Create audit logging for LLM interactions
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `LLMProvider` interface implementations for different backends
|
|
- [ ] Test LLM prompt generation and response parsing
|
|
- [ ] Test confidence scoring algorithms for LLM suggestions
|
|
- [ ] Test fallback mechanisms when LLM services are unavailable
|
|
- [ ] Test prompt injection prevention with malicious inputs
|
|
- [ ] Test data minimization - only necessary data sent to LLM
|
|
- [ ] Test LLM response validation and sanitization
|
|
- [ ] Test audit logging completeness and security
|
|
|
|
#### Phase 4: Integration & Artifacts (Week 5)
|
|
- [ ] **Collection Integration**
|
|
- [ ] Integrate redaction engine into collection pipeline
|
|
- [ ] Add streaming redaction during data collection
|
|
- [ ] Implement progress reporting for redaction operations
|
|
- [ ] Add redaction statistics and reporting
|
|
|
|
- [ ] **Artifact Generation**
|
|
- [ ] Implement `redaction-map.json` generation and format
|
|
- [ ] Add redaction statistics to support bundle metadata
|
|
- [ ] Create redaction audit trail and logging
|
|
- [ ] Implement secure token storage and encryption options
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test redaction integration with existing collection pipeline
|
|
- [ ] Test streaming redaction performance during data collection
|
|
- [ ] Test progress reporting accuracy and timing
|
|
- [ ] Test `redaction-map.json` format compliance and validation
|
|
- [ ] Test redaction statistics calculation and accuracy
|
|
- [ ] Test redaction audit trail completeness
|
|
- [ ] Test secure token storage encryption/decryption
|
|
- [ ] Test error handling during redaction pipeline failures
|
|
|
|
### Testing Strategy
|
|
- [ ] **Unit Tests**
|
|
- [ ] Token generation and collision handling
|
|
- [ ] File type processor accuracy
|
|
- [ ] Profile loading and validation
|
|
- [ ] LLM integration mocking
|
|
|
|
- [ ] **Integration Tests**
|
|
- [ ] End-to-end redaction with real support bundles
|
|
- [ ] LLM provider integration testing
|
|
- [ ] Performance testing with large files
|
|
- [ ] Streaming redaction pipeline validation
|
|
|
|
- [ ] **Security Tests**
|
|
- [ ] Token uniqueness and unpredictability
|
|
- [ ] Redaction completeness verification
|
|
- [ ] Information leakage prevention
|
|
- [ ] LLM prompt injection resistance
|
|
|
|
### Step-by-Step Implementation
|
|
|
|
#### Step 1: Streaming Redaction Foundation
|
|
1. Analyze existing redaction code in `pkg/redact`
|
|
2. Design streaming architecture with io.Reader/Writer interfaces
|
|
3. Create `RedactionEngine` interface and base implementation
|
|
4. Implement file type detection and routing
|
|
5. Add comprehensive unit tests for streaming operations
|
|
|
|
#### Step 2: Tokenization System
|
|
1. Create `Tokenizer` with hash-based consistent token generation
|
|
2. Implement token mapping data structures and serialization
|
|
3. Add token format configuration and validation
|
|
4. Create collision detection and resolution algorithms
|
|
5. Add comprehensive testing for token consistency and security
|
|
|
|
#### Step 3: File Type Processors
|
|
1. Create processor interface and registry system
|
|
2. Implement JSON processor with path-aware redaction
|
|
3. Add YAML processor with structure preservation
|
|
4. Create log file processor with context awareness
|
|
5. Add configuration file processors for common formats
|
|
|
|
#### Step 4: Redaction Profiles
|
|
1. Design profile schema and configuration format
|
|
2. Implement built-in profile definitions
|
|
3. Create profile loading, validation, and inheritance system
|
|
4. Add profile documentation and examples
|
|
5. Create comprehensive profile testing suite
|
|
|
|
#### Step 5: LLM Integration (Optional)
|
|
1. Create LLM provider interface and abstraction layer
|
|
2. Implement Ollama integration for local models
|
|
3. Design prompts for sensitive data detection
|
|
4. Add confidence scoring and human-readable explanations
|
|
5. Create comprehensive privacy and security safeguards
|
|
|
|
#### Step 6: Integration and Artifacts
|
|
1. Integrate redaction engine into support bundle collection
|
|
2. Implement `redaction-map.json` generation and format
|
|
3. Add CLI flags for redaction options and profiles
|
|
4. Create comprehensive documentation and examples
|
|
5. Add performance monitoring and optimization
|
|
|
|
---
|
|
|
|
## Component 3: Agent-Based Analysis
|
|
|
|
### Objective
|
|
Enhance the existing analysis system (currently in `pkg/analyze/`) with agent-based capabilities and analyzer generation from requirements. This addresses the overview requirement for "Analyzer via agents (local/hosted) and 'generate analyzers from requirements'".
|
|
|
|
**Current State**: The codebase has a comprehensive analysis system with:
|
|
- 60+ built-in analyzers for various Kubernetes resources and conditions
|
|
- Host analyzers for system-level checks
|
|
- Structured analyzer results (`AnalyzeResult` type)
|
|
- Analysis download and local bundle processing
|
|
- Integration with support bundle collection
|
|
- JSON/YAML output formatting
|
|
|
|
### Requirements
|
|
- **Agent abstraction**: Wrap existing analyzers and support local, hosted, and future agent types
|
|
- **Analyzer generation**: Create analyzers from requirement specifications (new capability)
|
|
- **Analysis artifacts**: Enhance existing results to generate structured `analysis.json` with remediation
|
|
- **Offline capability**: Maintain current local analysis capabilities
|
|
- **Extensibility**: Add plugin architecture for custom analysis engines while preserving existing analyzers
|
|
|
|
### Technical Specifications
|
|
|
|
#### 3.1 Analysis Engine Architecture
|
|
**Location**: `pkg/analyze/`
|
|
|
|
**Core Components**:
|
|
- `engine.go` - Analysis orchestrator
|
|
- `agents/` - Agent implementations (local, hosted, custom)
|
|
- `generators/` - Analyzer generation from requirements
|
|
- `artifacts/` - Analysis result formatting and serialization
|
|
|
|
**API Contract**:
|
|
```go
|
|
type AnalysisEngine interface {
|
|
Analyze(ctx context.Context, bundle *SupportBundle, opts AnalysisOptions) (*AnalysisResult, error)
|
|
GenerateAnalyzers(ctx context.Context, requirements *RequirementSpec) ([]AnalyzerSpec, error)
|
|
RegisterAgent(name string, agent Agent) error
|
|
}
|
|
|
|
type Agent interface {
|
|
Name() string
|
|
Analyze(ctx context.Context, data []byte, analyzers []AnalyzerSpec) (*AgentResult, error)
|
|
HealthCheck(ctx context.Context) error
|
|
Capabilities() []string
|
|
}
|
|
|
|
type AnalysisResult struct {
|
|
Results []AnalyzerResult `json:"results"`
|
|
Remediation []RemediationStep `json:"remediation"`
|
|
Summary AnalysisSummary `json:"summary"`
|
|
Metadata AnalysisMetadata `json:"metadata"`
|
|
}
|
|
```
|
|
|
|
#### 3.2 Agent Types
|
|
|
|
##### 3.2.1 Local Agent
|
|
**Location**: `pkg/analyze/agents/local/`
|
|
|
|
**Features**:
|
|
- Built-in analyzer implementations
|
|
- No external dependencies
|
|
- Fast execution and offline capability
|
|
- Extensible through plugins
|
|
|
|
##### 3.2.2 Hosted Agent
|
|
**Location**: `pkg/analyze/agents/hosted/`
|
|
|
|
**Features**:
|
|
- REST API integration with hosted analysis services
|
|
- Advanced ML/AI capabilities
|
|
- Cloud-scale processing
|
|
- Authentication and rate limiting
|
|
|
|
##### 3.2.3 LLM Agent (Optional)
|
|
**Location**: `pkg/analyze/agents/llm/`
|
|
|
|
**Features**:
|
|
- Local or cloud LLM integration
|
|
- Natural language analysis descriptions
|
|
- Context-aware remediation suggestions
|
|
- Multi-modal analysis (text, logs, configs)
|
|
|
|
#### 3.3 Analyzer Generation
|
|
**Location**: `pkg/analyze/generators/`
|
|
|
|
**Requirements-to-Analyzers Mapping**:
|
|
```go
|
|
type RequirementSpec struct {
|
|
APIVersion string `json:"apiVersion"`
|
|
Kind string `json:"kind"`
|
|
Metadata RequirementMetadata `json:"metadata"`
|
|
Spec RequirementSpecDetails `json:"spec"`
|
|
}
|
|
|
|
type RequirementSpecDetails struct {
|
|
Kubernetes KubernetesRequirements `json:"kubernetes"`
|
|
Resources ResourceRequirements `json:"resources"`
|
|
Storage StorageRequirements `json:"storage"`
|
|
Network NetworkRequirements `json:"network"`
|
|
Custom []CustomRequirement `json:"custom"`
|
|
}
|
|
```
|
|
|
|
### Implementation Checklist
|
|
|
|
#### Phase 1: Analysis Engine Foundation (Week 1-2)
|
|
- [ ] **Engine Architecture**
|
|
- [ ] Create `pkg/analyze/` package structure
|
|
- [ ] Design and implement `AnalysisEngine` interface
|
|
- [ ] Create agent registry and management system
|
|
- [ ] Add analysis result formatting and serialization
|
|
|
|
- [ ] **Local Agent Implementation**
|
|
- [ ] Create `LocalAgent` with built-in analyzer implementations
|
|
- [ ] Port existing analyzer logic to new agent framework
|
|
- [ ] Add plugin loading system for custom analyzers
|
|
- [ ] Implement performance optimization and caching
|
|
|
|
- [ ] **Analysis Artifacts**
|
|
- [ ] Design `analysis.json` schema and format
|
|
- [ ] Implement result aggregation and summarization
|
|
- [ ] Add analysis metadata and provenance tracking
|
|
- [ ] Create structured error handling and reporting
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `AnalysisEngine` interface implementations
|
|
- [ ] Test agent registry and management system functionality
|
|
- [ ] Test `LocalAgent` with various built-in analyzers
|
|
- [ ] Test analysis result formatting and serialization
|
|
- [ ] Test result aggregation algorithms and accuracy
|
|
- [ ] Test error handling for malformed analyzer inputs
|
|
- [ ] Test analysis metadata and provenance tracking
|
|
- [ ] Test plugin loading system with mock plugins
|
|
|
|
#### Phase 2: Hosted Agent Integration (Week 3)
|
|
- [ ] **Hosted Agent Framework**
|
|
- [ ] Create `HostedAgent` with REST API integration
|
|
- [ ] Implement authentication and authorization
|
|
- [ ] Add rate limiting and retry logic
|
|
- [ ] Create configuration management for hosted endpoints
|
|
|
|
- [ ] **API Integration**
|
|
- [ ] Design hosted agent API specification
|
|
- [ ] Implement request/response handling
|
|
- [ ] Add data serialization and compression
|
|
- [ ] Create secure credential management
|
|
|
|
- [ ] **Fallback Mechanisms**
|
|
- [ ] Implement graceful degradation when hosted agents unavailable
|
|
- [ ] Add local fallback for critical analyzers
|
|
- [ ] Create hybrid analysis modes
|
|
- [ ] Add user notification for service limitations
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `HostedAgent` REST API integration with mock servers
|
|
- [ ] Test authentication and authorization with various providers
|
|
- [ ] Test rate limiting and retry logic with simulated failures
|
|
- [ ] Test request/response handling and data serialization
|
|
- [ ] Test fallback mechanisms when hosted agents are unavailable
|
|
- [ ] Test hybrid analysis mode coordination and result merging
|
|
- [ ] Test secure credential management and rotation
|
|
- [ ] Test analysis quality assessment algorithms
|
|
|
|
#### Phase 3: Analyzer Generation (Week 4)
|
|
- [ ] **Requirements Parser**
|
|
- [ ] Create `RequirementSpec` parser and validator
|
|
- [ ] Implement requirement categorization and mapping
|
|
- [ ] Add support for vendor and Replicated requirement specs
|
|
- [ ] Create requirement merging and conflict resolution
|
|
|
|
- [ ] **Generator Framework**
|
|
- [ ] Design analyzer generation templates
|
|
- [ ] Implement rule-based analyzer creation
|
|
- [ ] Add analyzer validation and testing
|
|
- [ ] Create generated analyzer documentation
|
|
|
|
- [ ] **Integration**
|
|
- [ ] Integrate generator with analysis engine
|
|
- [ ] Add CLI flags for analyzer generation
|
|
- [ ] Create generated analyzer debugging and validation
|
|
- [ ] Add generator configuration and customization
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test requirement specification parsing with various input formats
|
|
- [ ] Test analyzer generation from requirement specifications
|
|
- [ ] Test requirement-to-analyzer mapping algorithms
|
|
- [ ] Test custom analyzer template generation and validation
|
|
- [ ] Test analyzer code generation quality and correctness
|
|
- [ ] Test generated analyzer testing and validation frameworks
|
|
- [ ] Test requirement specification validation and error reporting
|
|
- [ ] Test analyzer generation performance and scalability
|
|
|
|
#### Phase 4: Remediation & Advanced Features (Week 5)
|
|
- [ ] **Remediation System**
|
|
- [ ] Design `RemediationStep` data structure
|
|
- [ ] Implement remediation suggestion generation
|
|
- [ ] Add remediation prioritization and categorization
|
|
- [ ] Create remediation execution framework (future)
|
|
|
|
- [ ] **Advanced Analysis**
|
|
- [ ] Add cross-analyzer correlation and insights
|
|
- [ ] Implement trend analysis and historical comparison
|
|
- [ ] Create analysis confidence scoring
|
|
- [ ] Add analysis explanation and reasoning
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `RemediationStep` data structure and serialization
|
|
- [ ] Test remediation suggestion generation algorithms
|
|
- [ ] Test remediation prioritization and categorization logic
|
|
- [ ] Test cross-analyzer correlation algorithms
|
|
- [ ] Test trend analysis and historical comparison accuracy
|
|
- [ ] Test analysis confidence scoring calculations
|
|
- [ ] Test analysis explanation and reasoning generation
|
|
- [ ] Test remediation framework extensibility and plugin system
|
|
|
|
### Testing Strategy
|
|
- [ ] **Unit Tests**
|
|
- [ ] Agent interface compliance
|
|
- [ ] Analysis result serialization
|
|
- [ ] Analyzer generation logic
|
|
- [ ] Remediation suggestion accuracy
|
|
|
|
- [ ] **Integration Tests**
|
|
- [ ] End-to-end analysis with real support bundles
|
|
- [ ] Hosted agent API integration
|
|
- [ ] Analyzer generation from real requirements
|
|
- [ ] Multi-agent analysis coordination
|
|
|
|
- [ ] **Performance Tests**
|
|
- [ ] Large support bundle analysis performance
|
|
- [ ] Concurrent agent execution
|
|
- [ ] Memory usage during analysis
|
|
- [ ] Hosted agent latency and throughput
|
|
|
|
### Step-by-Step Implementation
|
|
|
|
#### Step 1: Analysis Engine Foundation
|
|
1. Create package structure: `pkg/analyze/`
|
|
2. Define `AnalysisEngine` and `Agent` interfaces
|
|
3. Implement basic analysis orchestration
|
|
4. Create agent registry and management
|
|
5. Add comprehensive unit tests
|
|
|
|
#### Step 2: Local Agent Implementation
|
|
1. Create `LocalAgent` struct and implementation
|
|
2. Port existing analyzer logic to agent framework
|
|
3. Add plugin system for custom analyzers
|
|
4. Implement result caching and optimization
|
|
5. Create comprehensive test suite
|
|
|
|
#### Step 3: Analysis Artifacts
|
|
1. Design `analysis.json` schema and validation
|
|
2. Implement result serialization and formatting
|
|
3. Add analysis metadata and provenance
|
|
4. Create structured error handling
|
|
5. Add comprehensive format validation
|
|
|
|
#### Step 4: Hosted Agent Integration
|
|
1. Create `HostedAgent` with REST API client
|
|
2. Implement authentication and rate limiting
|
|
3. Add fallback and error handling
|
|
4. Create configuration management
|
|
5. Add integration testing with mock services
|
|
|
|
#### Step 5: Analyzer Generation
|
|
1. Create `RequirementSpec` parser and validator
|
|
2. Implement analyzer generation templates
|
|
3. Add rule-based analyzer creation logic
|
|
4. Create analyzer validation and testing
|
|
5. Add comprehensive generation testing
|
|
|
|
#### Step 6: Remediation System
|
|
1. Design remediation data structures
|
|
2. Implement suggestion generation algorithms
|
|
3. Add remediation prioritization and categorization
|
|
4. Create comprehensive documentation
|
|
5. Add remediation testing and validation
|
|
|
|
---
|
|
|
|
## Component 4: Support Bundle Differencing
|
|
|
|
### Objective
|
|
Implement comprehensive support bundle comparison and differencing capabilities to track changes over time and identify issues through comparison. This is a completely NEW capability not present in the current codebase.
|
|
|
|
**Current State**: The codebase has support bundle parsing utilities in `pkg/supportbundle/parse.go` that can extract and read bundle contents, but no comparison or differencing capabilities.
|
|
|
|
### Requirements
|
|
- **Bundle comparison**: Compare two support bundles with detailed diff output (completely new)
|
|
- **Change categorization**: Categorize changes by type and impact (new)
|
|
- **Diff artifacts**: Generate structured `diff.json` for programmatic consumption (new)
|
|
- **Visualization**: Human-readable diff reports (new)
|
|
- **Performance**: Handle large bundles efficiently using existing parsing utilities
|
|
|
|
### Technical Specifications
|
|
|
|
#### 4.1 Diff Engine Architecture
|
|
**Location**: `pkg/supportbundle/diff/`
|
|
|
|
**Core Components**:
|
|
- `engine.go` - Main diff orchestrator
|
|
- `comparators/` - Type-specific comparison logic
|
|
- `formatters/` - Output formatting (JSON, HTML, text)
|
|
- `filters/` - Diff filtering and noise reduction
|
|
|
|
**API Contract**:
|
|
```go
|
|
type DiffEngine interface {
|
|
Compare(ctx context.Context, oldBundle, newBundle *SupportBundle, opts DiffOptions) (*BundleDiff, error)
|
|
GenerateReport(ctx context.Context, diff *BundleDiff, format string) (io.Reader, error)
|
|
}
|
|
|
|
type BundleDiff struct {
|
|
Summary DiffSummary `json:"summary"`
|
|
Changes []Change `json:"changes"`
|
|
Metadata DiffMetadata `json:"metadata"`
|
|
Significance SignificanceReport `json:"significance"`
|
|
}
|
|
|
|
type Change struct {
|
|
Type ChangeType `json:"type"` // added, removed, modified
|
|
Category string `json:"category"` // resource, log, config, etc.
|
|
Path string `json:"path"` // file path or resource path
|
|
Impact ImpactLevel `json:"impact"` // high, medium, low, none
|
|
Details map[string]any `json:"details"` // change-specific details
|
|
Remediation *RemediationStep `json:"remediation,omitempty"`
|
|
}
|
|
```
|
|
|
|
#### 4.2 Comparison Types
|
|
|
|
##### 4.2.1 Resource Comparisons
|
|
- Kubernetes resource specifications
|
|
- Resource status and health changes
|
|
- Configuration drift detection
|
|
- RBAC and security policy changes
|
|
|
|
##### 4.2.2 Log Comparisons
|
|
- Error pattern analysis
|
|
- Log volume and frequency changes
|
|
- New error types and patterns
|
|
- Performance metric changes
|
|
|
|
##### 4.2.3 Configuration Comparisons
|
|
- Configuration file changes
|
|
- Environment variable differences
|
|
- Secret and ConfigMap modifications
|
|
- Application configuration drift
|
|
|
|
### Implementation Checklist
|
|
|
|
#### Phase 1: Diff Engine Foundation (Week 1-2)
|
|
- [ ] **Core Engine**
|
|
- [ ] Create `pkg/supportbundle/diff/` package structure
|
|
- [ ] Implement `DiffEngine` interface and base implementation
|
|
- [ ] Create bundle loading and parsing utilities
|
|
- [ ] Add diff metadata and tracking
|
|
|
|
- [ ] **Change Detection**
|
|
- [ ] Implement file-level change detection
|
|
- [ ] Create content comparison utilities
|
|
- [ ] Add change categorization and classification
|
|
- [ ] Implement impact assessment algorithms
|
|
|
|
- [ ] **Data Structures**
|
|
- [ ] Define `BundleDiff` and related data structures
|
|
- [ ] Create change serialization and deserialization
|
|
- [ ] Add diff statistics and summary generation
|
|
- [ ] Implement diff validation and consistency checks
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `DiffEngine` with various support bundle pairs
|
|
- [ ] Test bundle loading and parsing utilities with different formats
|
|
- [ ] Test file-level change detection algorithms
|
|
- [ ] Test content comparison utilities with binary and text files
|
|
- [ ] Test change categorization and classification accuracy
|
|
- [ ] Test `BundleDiff` data structure serialization/deserialization
|
|
- [ ] Test diff statistics calculation and accuracy
|
|
- [ ] Test diff validation and consistency check algorithms
|
|
|
|
#### Phase 2: Specialized Comparators (Week 3)
|
|
- [ ] **Resource Comparator**
|
|
- [ ] Create Kubernetes resource diff logic
|
|
- [ ] Add YAML/JSON structural comparison
|
|
- [ ] Implement semantic resource analysis
|
|
- [ ] Add resource health status comparison
|
|
|
|
- [ ] **Log Comparator**
|
|
- [ ] Create log file comparison utilities
|
|
- [ ] Add error pattern extraction and comparison
|
|
- [ ] Implement log volume analysis
|
|
- [ ] Create performance metric comparison
|
|
|
|
- [ ] **Configuration Comparator**
|
|
- [ ] Add configuration file diff logic
|
|
- [ ] Create environment variable comparison
|
|
- [ ] Implement secret and sensitive data handling
|
|
- [ ] Add configuration drift detection
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test Kubernetes resource diff logic with various resource types
|
|
- [ ] Test YAML/JSON structural comparison algorithms
|
|
- [ ] Test semantic resource analysis and health status comparison
|
|
- [ ] Test log file comparison utilities with different log formats
|
|
- [ ] Test error pattern extraction and comparison accuracy
|
|
- [ ] Test log volume analysis algorithms
|
|
- [ ] Test configuration file diff logic with various config formats
|
|
- [ ] Test sensitive data handling in configuration comparisons
|
|
|
|
#### Phase 3: Output and Visualization (Week 4)
|
|
- [ ] **Diff Artifacts**
|
|
- [ ] Implement `diff.json` generation and format
|
|
- [ ] Add diff metadata and provenance
|
|
- [ ] Create diff validation and schema
|
|
- [ ] Add diff compression and storage
|
|
|
|
- [ ] **Report Generation**
|
|
- [ ] Create HTML diff reports with visualization
|
|
- [ ] Add interactive diff navigation and filtering
|
|
- [ ] Implement diff report customization and theming
|
|
- [ ] Create diff report export and sharing capabilities
|
|
|
|
- [ ] **Unit Testing**
|
|
- [ ] Test `diff.json` generation and format validation
|
|
- [ ] Test diff metadata and provenance tracking
|
|
- [ ] Test diff compression and storage mechanisms
|
|
- [ ] Test HTML diff report generation with various diff types
|
|
- [ ] Test interactive diff navigation functionality
|
|
- [ ] Test diff report customization and theming options
|
|
- [ ] Test diff visualization accuracy and clarity
|
|
- [ ] Test diff report export formats and compatibility
|
|
- [ ] Add text-based diff output
|
|
- [ ] Implement diff filtering and noise reduction
|
|
- [ ] Create diff summary and executive reports
|
|
|
|
#### Phase 4: CLI Integration (Week 5)
|
|
- [ ] **Command Implementation**
|
|
- [ ] Add `support-bundle diff` command
|
|
- [ ] Implement command-line argument parsing
|
|
- [ ] Add progress reporting and user feedback
|
|
- [ ] Create diff command validation and error handling
|
|
|
|
- [ ] **Configuration**
|
|
- [ ] Add diff configuration and profiles
|
|
- [ ] Create diff ignore patterns and filters
|
|
- [ ] Implement diff output customization
|
|
- [ ] Add diff performance optimization options
|
|
|
|
### Step-by-Step Implementation
|
|
|
|
#### Step 1: Diff Engine Foundation
|
|
1. Create package structure: `pkg/supportbundle/diff/`
|
|
2. Design `DiffEngine` interface and core data structures
|
|
3. Implement basic bundle loading and parsing
|
|
4. Create change detection algorithms
|
|
5. Add comprehensive unit tests
|
|
|
|
#### Step 2: Change Detection and Classification
|
|
1. Implement file-level change detection
|
|
2. Create content comparison utilities with different strategies
|
|
3. Add change categorization and impact assessment
|
|
4. Create change significance scoring
|
|
5. Add comprehensive classification testing
|
|
|
|
#### Step 3: Specialized Comparators
|
|
1. Create comparator interface and registry
|
|
2. Implement resource comparator with semantic analysis
|
|
3. Add log comparator with pattern analysis
|
|
4. Create configuration comparator with drift detection
|
|
5. Add comprehensive comparator testing
|
|
|
|
#### Step 4: Output Generation
|
|
1. Implement `diff.json` schema and serialization
|
|
2. Create HTML report generation with visualization
|
|
3. Add text-based diff formatting
|
|
4. Create diff filtering and noise reduction
|
|
5. Add comprehensive output validation
|
|
|
|
#### Step 5: CLI Integration
|
|
1. Add `diff` command to support-bundle CLI
|
|
2. Implement argument parsing and validation
|
|
3. Add progress reporting and user experience
|
|
4. Create comprehensive CLI testing
|
|
5. Add documentation and examples
|
|
|
|
---
|
|
|
|
## Integration & Testing Strategy
|
|
|
|
### Integration Contracts (Critical Constraints)
|
|
|
|
**Person 2 is a CONSUMER of Person 1's work and must NOT alter schema definitions or CLI contracts.**
|
|
|
|
#### Schema Contract (Owned by Person 1)
|
|
**CRITICAL UPDATE**: Based on current codebase analysis:
|
|
- **Current API Group**: `troubleshoot.replicated.com` (NOT `troubleshoot.sh`)
|
|
- **Current Versions**: `v1beta1` and `v1beta2` are available (NO `v1beta3` exists yet)
|
|
- **Use ONLY** `troubleshoot.replicated.com/v1beta2` CRDs/YAML spec definitions until Person 1 provides schema migration plan
|
|
- **Follow EXACTLY** agreed-upon artifact filenames (`analysis.json`, `diff.json`, `redaction-map.json`, `facts.json`)
|
|
- **NO modifications** to schema definitions, types, or API contracts
|
|
- All schemas act as the cross-team contract with clear compatibility rules
|
|
|
|
#### CLI Contract (Owned by Person 1)
|
|
**CRITICAL UPDATE**: Based on current CLI structure analysis:
|
|
- **Current Structure**: `support-bundle` (root/collect), `support-bundle analyze`, `support-bundle redact`
|
|
- **Existing Flags**: `--namespace`, `--redact`, `--collect-without-permissions`, etc. already available
|
|
- **NEW Commands to Add**: `support-bundle diff` (completely new)
|
|
- **NEW Flags to Add**: `--auto`, `--include-images`, `--rbac-check`, `--agent`
|
|
- **NO changes** to existing CLI surface area, help text, or command structure
|
|
- Must integrate new capabilities into existing command structure
|
|
|
|
#### IO Flow Contract (Owned by Person 2)
|
|
- **Collect/analyze/diff operations** read and write ONLY via defined schemas and filenames
|
|
- **Redaction runs as streaming step** during collection (no intermediate files)
|
|
- All input/output must conform to Person 1's schema specifications
|
|
|
|
#### Golden Samples Contract
|
|
- Use checked-in example specs and artifacts for contract testing
|
|
- Ensure changes don't break consumers or violate schema contracts
|
|
- Maintain backward compatibility with existing artifact formats
|
|
|
|
### Cross-Component Integration
|
|
|
|
#### Collection → Redaction Pipeline
|
|
```go
|
|
// Example integration flow
|
|
func CollectWithRedaction(ctx context.Context, opts CollectionOptions) (*SupportBundle, error) {
|
|
// 1. Auto-discover collectors
|
|
collectors, err := autoCollector.Discover(ctx, opts.DiscoveryOptions)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// 2. Collect with streaming redaction
|
|
bundle := &SupportBundle{}
|
|
for _, collector := range collectors {
|
|
data, err := collector.Collect(ctx)
|
|
if err != nil {
|
|
continue
|
|
}
|
|
|
|
redactedData, redactionMap, err := redactionEngine.ProcessStream(ctx, data, opts.RedactionOptions)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
bundle.AddFile(collector.OutputPath(), redactedData)
|
|
bundle.AddRedactionMap(redactionMap)
|
|
}
|
|
|
|
return bundle, nil
|
|
}
|
|
```
|
|
|
|
#### Analysis → Remediation Integration
|
|
```go
|
|
// Example analysis to remediation flow
|
|
func AnalyzeWithRemediation(ctx context.Context, bundle *SupportBundle) (*AnalysisResult, error) {
|
|
// 1. Run analysis
|
|
result, err := analysisEngine.Analyze(ctx, bundle, opts)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// 2. Generate remediation suggestions
|
|
for i, analyzerResult := range result.Results {
|
|
if analyzerResult.IsFail() {
|
|
remediation, err := generateRemediation(ctx, analyzerResult)
|
|
if err == nil {
|
|
result.Results[i].Remediation = remediation
|
|
}
|
|
}
|
|
}
|
|
|
|
return result, nil
|
|
}
|
|
```
|
|
|
|
### Comprehensive Testing Strategy
|
|
|
|
#### Unit Testing Requirements
|
|
- [ ] **Coverage Target**: >80% code coverage for all components
|
|
- [ ] **Mock Dependencies**: Mock all external dependencies (K8s API, registries, LLM APIs)
|
|
- [ ] **Error Scenarios**: Test all error paths and edge cases
|
|
- [ ] **Performance**: Unit benchmarks for critical paths
|
|
|
|
#### Integration Testing Requirements
|
|
- [ ] **End-to-End Flows**: Complete collection → redaction → analysis → diff workflows
|
|
- [ ] **Real Cluster Testing**: Integration with actual Kubernetes clusters
|
|
- [ ] **Large Bundle Testing**: Performance with multi-GB support bundles
|
|
- [ ] **Network Conditions**: Testing with limited/intermittent connectivity
|
|
|
|
#### Performance Testing Requirements
|
|
- [ ] **Memory Usage**: Monitor memory consumption during large operations
|
|
- [ ] **CPU Utilization**: Profile CPU usage for optimization opportunities
|
|
- [ ] **I/O Performance**: Test with large files and slow storage
|
|
- [ ] **Concurrency**: Test multi-threaded operations and race conditions
|
|
|
|
#### Security Testing Requirements
|
|
- [ ] **Redaction Completeness**: Verify no sensitive data leakage
|
|
- [ ] **Token Security**: Ensure token unpredictability and uniqueness
|
|
- [ ] **Access Control**: Verify RBAC enforcement
|
|
- [ ] **Input Validation**: Test against malicious inputs
|
|
|
|
### Golden Sample Testing
|
|
- [ ] **Reference Bundles**: Create standard test support bundles
|
|
- [ ] **Expected Outputs**: Define expected analysis, diff, and redaction outputs
|
|
- [ ] **Regression Testing**: Automated comparison against golden outputs
|
|
- [ ] **Schema Validation**: Ensure all outputs conform to schemas
|
|
|
|
---
|
|
|
|
## Documentation Requirements
|
|
|
|
### User Documentation
|
|
- [ ] **Collection Guide**: How to use auto-collectors and namespace scoping
|
|
- [ ] **Redaction Guide**: Redaction profiles, tokenization, and LLM integration
|
|
- [ ] **Analysis Guide**: Agent configuration and remediation interpretation
|
|
- [ ] **Diff Guide**: Bundle comparison workflows and interpretation
|
|
|
|
### Developer Documentation
|
|
- [ ] **API Documentation**: Go doc comments for all public APIs
|
|
- [ ] **Architecture Guide**: Component interaction and data flow
|
|
- [ ] **Extension Guide**: How to add custom agents, analyzers, and processors
|
|
- [ ] **Performance Guide**: Optimization techniques and benchmarks
|
|
|
|
### Configuration Documentation
|
|
- [ ] **Schema Reference**: Complete reference for all configuration options
|
|
- [ ] **Profile Examples**: Example redaction and analysis profiles
|
|
- [ ] **Integration Examples**: Sample integrations with CI/CD and monitoring
|
|
|
|
---
|
|
|
|
## Timeline & Milestones
|
|
|
|
### Month 1: Foundation
|
|
- **Week 1-2**: Auto-collectors and RBAC integration
|
|
- **Week 3-4**: Advanced redaction with tokenization
|
|
|
|
### Month 2: Advanced Features
|
|
- **Week 5-6**: Agent-based analysis system
|
|
- **Week 7-8**: Support bundle differencing
|
|
|
|
### Month 3: Integration & Polish
|
|
- **Week 9-10**: Cross-component integration and testing
|
|
- **Week 11-12**: Documentation, optimization, and release preparation
|
|
|
|
### Key Milestones
|
|
- [ ] **M1**: Auto-discovery working with RBAC (Week 2)
|
|
- [ ] **M2**: Streaming redaction with tokenization (Week 4)
|
|
- [ ] **M3**: Local and hosted agents functional (Week 6)
|
|
- [ ] **M4**: Bundle diffing and remediation (Week 8)
|
|
- [ ] **M5**: Full integration and testing complete (Week 10)
|
|
- [ ] **M6**: Documentation and release ready (Week 12)
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Functional Requirements
|
|
- [ ] `support-bundle collect --namespace ns --auto` produces complete bundles
|
|
- [ ] Redaction with tokenization works with streaming pipeline
|
|
- [ ] Analysis generates structured results with remediation
|
|
- [ ] Bundle diffing produces actionable comparison reports
|
|
|
|
### Performance Requirements
|
|
- [ ] Auto-discovery completes in <30 seconds for typical clusters
|
|
- [ ] Redaction processes 1GB+ bundles without memory issues
|
|
- [ ] Analysis completes in <2 minutes for standard bundles
|
|
- [ ] Diff generation completes in <1 minute for bundle pairs
|
|
|
|
### Quality Requirements
|
|
- [ ] >80% code coverage with comprehensive tests
|
|
- [ ] Zero critical security vulnerabilities
|
|
- [ ] Complete API documentation and user guides
|
|
- [ ] Successful integration with Person 1's schema and CLI contracts
|
|
|
|
---
|
|
|
|
## Final Integration Testing Phase
|
|
|
|
After all components are implemented and unit tested, conduct comprehensive integration testing to verify the complete system works together:
|
|
|
|
### **End-to-End Integration Testing**
|
|
|
|
#### **1. Complete Workflow Testing**
|
|
- [ ] Test full `support-bundle collect --namespace ns --auto` workflow
|
|
- [ ] Test auto-discovery → collection → redaction → analysis → diff pipeline
|
|
- [ ] Test CLI integration with real Kubernetes clusters
|
|
- [ ] Test support bundle generation with all auto-discovered collectors
|
|
- [ ] Test complete artifact generation (bundle.tgz, facts.json, redaction-map.json, analysis.json)
|
|
|
|
#### **2. Cross-Component Integration**
|
|
- [ ] Test auto-discovery integration with image metadata collection
|
|
- [ ] Test streaming redaction integration with collection pipeline
|
|
- [ ] Test analysis engine integration with auto-discovered collectors and redacted data
|
|
- [ ] Test support bundle diff functionality with complete bundles
|
|
- [ ] Test remediation suggestions integration with analysis results
|
|
|
|
#### **3. Real-World Scenario Testing**
|
|
- [ ] Test against real Kubernetes clusters with various configurations
|
|
- [ ] Test with different RBAC permission levels and restrictions
|
|
- [ ] Test with various application types (web apps, databases, microservices)
|
|
- [ ] Test with large clusters (1000+ pods, 100+ namespaces)
|
|
- [ ] Test with different container registries (Docker Hub, ECR, GCR, Harbor)
|
|
|
|
#### **4. Performance and Reliability Integration**
|
|
- [ ] Test end-to-end performance with large, complex clusters
|
|
- [ ] Test system reliability with network failures and API errors
|
|
- [ ] Test memory usage and resource consumption across all components
|
|
- [ ] Test concurrent operations and thread safety
|
|
- [ ] Test scalability limits and graceful degradation under load
|
|
|
|
#### **5. Security and Privacy Integration**
|
|
- [ ] Test RBAC enforcement across the entire pipeline
|
|
- [ ] Test redaction effectiveness with real sensitive data
|
|
- [ ] Test token reversibility and data owner access to redaction maps
|
|
- [ ] Test LLM integration security and data locality compliance
|
|
- [ ] Test audit trail completeness across all operations
|
|
|
|
#### **6. User Experience Integration**
|
|
- [ ] Test CLI usability and help documentation
|
|
- [ ] Test configuration file examples and documentation
|
|
- [ ] Test error messages and user feedback across all components
|
|
- [ ] Test progress reporting and operation status visibility
|
|
- [ ] Test troubleshoot.sh ecosystem integration and compatibility
|
|
|
|
#### **7. Artifact and Output Integration**
|
|
- [ ] Test support bundle format compliance and compatibility
|
|
- [ ] Test analysis.json schema validation and tool compatibility
|
|
- [ ] Test diff.json format and visualization integration
|
|
- [ ] Test redaction-map.json usability and token reversal
|
|
- [ ] Test facts.json integration with analysis and visualization tools
|
|
|
|
---
|
|
|
|
## MAJOR CHANGES FROM ORIGINAL PRD
|
|
|
|
This section documents all critical changes made to align the PRD with the actual troubleshoot codebase:
|
|
|
|
### 1. API Schema Reality Check
|
|
- **CHANGED**: API group from `troubleshoot.sh/v1beta3` → `troubleshoot.replicated.com/v1beta2`
|
|
- **REASON**: Current codebase only has v1beta1 and v1beta2, using `troubleshoot.replicated.com` group
|
|
|
|
### 2. Implementation Strategy Shift
|
|
- **CHANGED**: From "build from scratch" → "extend existing systems"
|
|
- **REASON**: Discovered mature, production-ready systems already exist
|
|
- **IMPACT**: Faster implementation, better integration, lower risk
|
|
|
|
### 3. CLI Structure Alignment
|
|
- **CHANGED**: Command structure from `support-bundle collect/analyze/diff` → enhance existing `support-bundle` root + subcommands
|
|
- **REASON**: Current structure already has `support-bundle` (collect), `support-bundle analyze`, `support-bundle redact`
|
|
- **NEW**: Only `support-bundle diff` is completely new
|
|
|
|
### 4. Binary Architecture Reality
|
|
- **DISCOVERED**: Multiple binaries already exist (`preflight`, `support-bundle`, `collect`, `analyze`)
|
|
- **IMPACT**: Two-binary approach already partially implemented
|
|
- **FOCUS**: Enhance existing `support-bundle` binary capabilities
|
|
|
|
### 5. Existing System Capabilities
|
|
- **Collection**: 15+ collector types, RBAC integration, progress reporting
|
|
- **Redaction**: Regex-based, multiple redactor types, tracking/reporting
|
|
- **Analysis**: 60+ analyzers, host+cluster analysis, structured results
|
|
- **Support Bundle**: Complete archiving, parsing, metadata system
|
|
|
|
### 6. Removed All Completion Markers
|
|
- **CHANGED**: All ``, `[ ]`, "" markers → `[ ]` (pending)
|
|
- **REASON**: Starting implementation from scratch despite existing foundation
|
|
|
|
### 7. Technical Approach Updates
|
|
- **Auto-collectors**: NEW package extending existing collection framework with dual-path approach
|
|
- **Redaction**: ENHANCE existing system with tokenization and streaming
|
|
- **Analysis**: WRAP existing analyzers with agent abstraction layer
|
|
- **Diff**: COMPLETELY NEW capability using existing bundle parsing
|
|
|
|
### 8. Auto-Collectors Foundational Data Definition
|
|
|
|
**What "Foundational Data" Includes**:
|
|
- **Pods**: All pods in target namespace(s) with full spec and status
|
|
- **Deployments/ReplicaSets**: All deployment resources and their managed replica sets
|
|
- **Services**: All service definitions and endpoints
|
|
- **ConfigMaps**: All configuration data (with redaction)
|
|
- **Secrets**: All secret metadata (values redacted by default)
|
|
- **Events**: Recent cluster events for troubleshooting context
|
|
- **Pod Logs**: Container logs from all pods (with retention limits)
|
|
- **Image Facts**: Container image metadata (digests, tags, registry info)
|
|
- **Network Policies**: Any network policies affecting the namespace
|
|
- **RBAC**: Relevant roles, role bindings, service accounts
|
|
|
|
This foundational collection ensures that even without vendor-specific YAML specs, support bundles contain the essential data needed for troubleshooting most Kubernetes issues.
|
|
|
|
This updated PRD provides a realistic, implementable roadmap that leverages existing production-ready code while adding the new capabilities specified in the original requirements. The implementation risk is significantly reduced, and the timeline is more achievable.
|