Files
troubleshoot/roadmap.md
Marc Campbell 35759c47af V1beta3 (#1873)
* Change workflow branch from 'main' to 'v1beta3'

* Auto updater (#1849)

* added auto updater

* updated docs

* commit to trigger actions

* Auto-collectors: foundational discovery, image metadata, CLI integrat… (#1845)

* Auto-collectors: foundational discovery, image metadata, CLI integration; reset PRD markers

* Address PR review feedback

- Implement missing namespace exclude patterns functionality
- Fix image facts collector to use empty Data field instead of static string
- Correct APIVersion to use troubleshoot.sh/v1beta2 consistently

* Fix bug bot issues: API parsing, EOF error, and API group corrections

- Fix RBAC API parsing errors in rbac_checker.go (getAPIGroup/getAPIVersion functions)
- Fix FakeReader EOF error to use standard io.EOF instead of custom error
- Fix incorrect API group from troubleshoot.sh to troubleshoot.replicated.com in run.go

These changes address the issues identified by the bug bot and ensure proper
interface compliance and consistent API group usage.

* Fix multiple bug bot issues

- Fix RBAC API parsing errors in rbac_checker.go (getAPIGroup/getAPIVersion functions)
- Fix FakeReader EOF error to use standard io.EOF instead of custom error
- Fix incorrect API group from troubleshoot.sh to troubleshoot.replicated.com in run.go
- Fix image facts collector Data field to contain structured JSON instead of static strings

These changes address all issues identified by the bug bot and ensure proper
interface compliance, consistent API usage, and meaningful data fields.

* Update auto_discovery.go

* Fix TODO comments in Auto-collector section

Fixed 3 of 4 TODOs as requested in PR review:

1. pkg/collect/images/registry_client.go (line 46):
   - Implement custom CA certificate loading
   - Add x509 import and certificate parsing logic
   - Enables image collection from private registries with custom CAs

2. cmd/troubleshoot/cli/diff.go (line 209):
   - Implement bundle file count functionality
   - Add tar/gzip imports and getFileCountFromBundle() function
   - Properly counts files in support bundle archives (.gz/.tgz)

3. cmd/troubleshoot/cli/run.go (line 338):
   - Replace TODO with clarifying comment about RemoteCollectors usage
   - Confirmed RemoteCollectors are still actively used in preflights

The 4th TODO (diff.go line 196) is left as-is since it's explicitly marked
as Phase 4 future work (Support Bundle Differencing implementation).

Addresses PR review feedback about unimplemented TODO comments.

---------

Co-authored-by: Benjamin Yang <benjaminyang@Benjamins-MacBook-Pro.local>

* resetting make targets and github workflows to support v1beta3 releas… (#1853)

* resetting make targets and github workflows to support v1beta3 release later

* removing generate

* remove

* removing

* removing

* Support bundle diff (#1855)

implemented support bundle diff command

* Preflight docs and template subcommands (#1847)

* Added docs and template subcommands with test files

* uses helm templating preflight yaml files

* merge doc requirements for multiple inputs

* Helm aware rendering and markdown output

* v1beta3 yaml structure better mirrors beta2

* Update sample-preflight-templated.yaml

* Added docs and template subcommands with test files

* uses helm templating preflight yaml files

* merge doc requirements for multiple inputs

* Helm aware rendering and markdown output

* v1beta3 yaml structure better mirrors beta2

* Update sample-preflight-templated.yaml

* Added/updated documentation on subcommands

* Update docs.go

* commit to trigger actions

* Updated yaml spec (#1851)

* v1beta3 spec can be read by preflight

* added test files for ease of testing

* updated v1beta3 guide doc and added tests

* fixed not removing tmp files from v1beta3 processing

* created v1beta2 to v1beta3 converter

* Updated yaml spec (#1863)

* v1beta3 spec can be read by preflight

* added test files for ease of testing

* v1beta3 renderer fixes

* fixed gitignore issue

* Auto support bundle upload (#1860)

* basic auto uploading support bundles

* added upload command

* added default vendor endpoint

* added auth system from replicated cli

* fixed case sensitivity issue in YAML parsing

* support bundle uploads for end customers

* app slug flag and detection without licenseID

* moved v1beta3 examples to proper directory

* does not auto update for package managers (#1850)

* V1beta3 cleanup (#1869)

* moving some files around

* more cleanup

* removing more unused

* update ci for v1beta3 (#1870)

* fmt:

* removing unused examples

* add a v1beta3 fixture

* removing coverage reporting

* adding brew (#1872)

* Fixing testing errors (#1871)

fix: resolve failing unit tests and diff consistency in v1beta3

- Fix readLinesFromReader to return lines WITH newlines (like difflib.SplitLines)
- Update test expectations to match correct function behavior with newlines
- This ensures consistency between streaming and non-streaming diff paths
- Fix timeout test by changing from 10ms to 500ms to eliminate flaky failures

Fixes TestReadLinesFromReader and Test_loadSupportBundleSpecsFromURIs_TimeoutError
Resolves diff output inconsistency between code paths

* Fix/exec textanalyze path clean (#1865)

* created roadmap and yaml claude agent

* Update roadmap.md

* Fix textAnalyze analyzer to auto-match exec collector nested paths

- Auto-detect exec output files (*-stdout.txt, *-stderr.txt, *-errors.json)
- Convert simple filenames to wildcard patterns automatically
- Preserve existing wildcard patterns
- Fixes 'No matching file' errors for exec + textAnalyze workflows

---------

Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>

* bump goreleaser to v2

* remove collect binary and risc binary

* remove this check

* add debug logging

* larger runner for release

* dropping goreleaser

* fix syntax

* fix syntax

* goreleaser

* larger

* prerelease auto and more

* publish to directory:

* some more goreleaser/homebrew stuffs

* removing risc

* bump example

* Advanced analysis clean (#1868)

* created roadmap and yaml claude agent

* Update roadmap.md

* feat: Clean advanced analysis implementation - core agents, engine, artifacts

* Remove unrelated files - keep only advanced analysis implementation

* fix: Fix goroutine leak in hosted agent rate limiter

- Added stop channel and stopped flag to RateLimiter struct
- Modified replenishTokens to listen for stop signal and exit cleanly
- Added Stop() method to gracefully shutdown rate limiter
- Added Stop() method to HostedAgent to cleanup rate limiter on shutdown

Fixes cursor bot issue: Rate Limiter Goroutine Leak

* fix: Fix analyzer config and model validation bugs

Bug 1: Analyzer Config Missing File Path
- Added filePath to DeploymentStatus analyzer config in convertAnalyzerToSpec
- Sets namespace-specific path (cluster-resources/deployments/{namespace}.json)
- Falls back to generic path (cluster-resources/deployments.json) if no namespace
- Fixes LocalAgent.analyzeDeploymentStatus backward compatibility

Bug 2: HealthCheck Fails Model Validation
- Changed Ollama model validation from prefix match to exact match
- Prevents false positives where llama2:13b would match request for llama2:7b
- Ensures agent only reports healthy when exact model is available

Both fixes address cursor bot reported issues and maintain backward compatibility.

* fixing lint errors

* fixing lint errors

* adding CLI flags

* fix: resolve linting errors for CI

- Remove unnecessary nil check in host_kernel_configs.go (len() for nil slices is zero)
- Remove unnecessary fmt.Sprintf() calls in ceph.go for static strings
- Apply go fmt formatting fixes

Fixes failing lint CI check

* fix: resolve CI failures in build-test workflow and Ollama tests

1. Fix GitHub Actions workflow logic error:
   - Replace problematic contains() expression with explicit job result checks
   - Properly handle failure and cancelled states for each job
   - Prevents false positive failures in success summary job

2. Fix Ollama agent parseLLMResponse panics:
   - Add proper error handling for malformed JSON in LLM responses
   - Return error when JSON is found but invalid (instead of silent fallback)
   - Add error when no meaningful content can be parsed from response
   - Prevents nil pointer dereference in test assertions

Fixes failing build-test/success and build-test/test CI checks

* fix: resolve all CI failures and cursor bot issues

1. Fix disable-ollama flag logic bug:
   - Remove disable-ollama from advanced analysis trigger condition
   - Prevents unintended advanced analysis mode when no agents registered
   - Allows proper fallback to legacy analysis

2. Fix diff test consistency:
   - Update test expectations to match function behavior (lines with newlines)
   - Ensures consistency between streaming and non-streaming diff paths

3. Fix Ollama agent error handling:
   - Add proper error return for malformed JSON in LLM responses
   - Add meaningful content validation for markdown parsing
   - Prevents nil pointer panics in test assertions

4. Fix analysis engine mock agent:
   - Mock agent now processes and returns results for all provided analyzers
   - Fixes test expectation mismatch (expected 8 results, got 1)

Resolves all failing CI checks: lint, test, and success workflow logic

---------

Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>

* Auto-Collect (#1867)

* Fix auto-collector missing files issue

- Add KOTS-aware detection for diagnostic files
- Replace silent RBAC filtering with user warnings
- Enhance error file collection for troubleshooting
- Achieve parity with traditional support bundles

Resolves issue where auto-collector was missing:
- KOTS diagnostic files (now 4 vs 3)
- ConfigMaps (now 6 vs 6)
- Maintains superior log collection (24 vs 0)

Final result: [SUCCESS] comprehensive collection achieved

* fixing bugbog

* fix: resolve production readiness issues in auto-collect branch

1. Fix diff test expectations (lines should have newlines for difflib consistency)
2. Fix preflight tests to use existing v1beta3 example file
3. Fix autodiscovery test context parameter (function signature update)

Resolves TestReadLinesFromReader and preflight v1beta3 test failures

* fix: resolve autodiscovery tests and cursor bot image matching issues

1. Fix cursor bot image matching bug in isKotsadmImage:
   - Replace flawed prefix matching with proper image component detection
   - Handle private registries correctly (registry.company.com/kotsadm/kotsadm:v1.0.0)
   - Prevent false positives with proper delimiter checking
   - Add helper functions: containsImageComponent, splitImagePath, removeTagAndDigest

2. Fix autodiscovery test failures:
   - Add TestMode flag to DiscoveryOptions to control KOTS diagnostic collection
   - Tests use TestMode=true to get only foundational collectors (no KOTS diagnostics)
   - Preserves production behavior while enabling clean testing

Resolves failing TestDiscoverer_DiscoverFoundational tests and cursor bot issues

* Cron job clean (#1862)

* created roadmap and yaml claude agent

* Update roadmap.md

* chore(deps): bump sigstore/cosign-installer from 3.9.2 to 3.10.0 (#1857)

Bumps [sigstore/cosign-installer](https://github.com/sigstore/cosign-installer) from 3.9.2 to 3.10.0.
- [Release notes](https://github.com/sigstore/cosign-installer/releases)
- [Commits](https://github.com/sigstore/cosign-installer/compare/v3.9.2...v3.10.0)

---
updated-dependencies:
- dependency-name: sigstore/cosign-installer
  dependency-version: 3.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the security group with 2 updates (#1858)

Bumps the security group with 2 updates: [github.com/vmware-tanzu/velero](https://github.com/vmware-tanzu/velero) and [helm.sh/helm/v3](https://github.com/helm/helm).


Updates `github.com/vmware-tanzu/velero` from 1.16.2 to 1.17.0
- [Release notes](https://github.com/vmware-tanzu/velero/releases)
- [Changelog](https://github.com/vmware-tanzu/velero/blob/main/CHANGELOG.md)
- [Commits](https://github.com/vmware-tanzu/velero/compare/v1.16.2...v1.17.0)

Updates `helm.sh/helm/v3` from 3.18.6 to 3.19.0
- [Release notes](https://github.com/helm/helm/releases)
- [Commits](https://github.com/helm/helm/compare/v3.18.6...v3.19.0)

---
updated-dependencies:
- dependency-name: github.com/vmware-tanzu/velero
  dependency-version: 1.17.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: security
- dependency-name: helm.sh/helm/v3
  dependency-version: 3.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: security
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump helm.sh/helm/v3 from 3.18.6 to 3.19.0 in /examples/sdk/helm-template in the security group (#1859)

chore(deps): bump helm.sh/helm/v3

Bumps the security group in /examples/sdk/helm-template with 1 update: [helm.sh/helm/v3](https://github.com/helm/helm).


Updates `helm.sh/helm/v3` from 3.18.6 to 3.19.0
- [Release notes](https://github.com/helm/helm/releases)
- [Commits](https://github.com/helm/helm/compare/v3.18.6...v3.19.0)

---
updated-dependencies:
- dependency-name: helm.sh/helm/v3
  dependency-version: 3.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: security
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add cron job support bundle scheduler

Complete implementation with K8s integration:
- pkg/schedule/job.go: Job management and persistence
- pkg/schedule/daemon.go: Real-time scheduler daemon
- pkg/schedule/cli.go: CLI commands (create, list, delete, daemon)
- pkg/schedule/schedule_test.go: Comprehensive unit tests
- cmd/troubleshoot/cli/root.go: CLI integration

* fixing bugbot

* Fix all bugbot errors: auto-update stability, job cooldown timing, and daemon execution

* Deleting Agent

* removed unused flags

* fixing auto-upload

* fixing markdown files

* namespace not required flag for auto collectors to work

* loosened cron job validation

* writes logs to logfile

* fix: resolve autoFromEnv variable scoping issue for CI

- Ensure autoFromEnv variable and its usage are in correct scope
- Fix build errors: declared and not used / undefined variable
- All functionality preserved and tested locally
- Force add to override gitignore

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: clean tokenization system implementation (#1874)

Core tokenization functionality with minimal file changes:

 Core Features:
- Intelligent tokenization engine (tokenizer.go)
- Context-aware secret classification (PASSWORD, APIKEY, DATABASE, etc.)
- Cross-file correlation with deterministic HMAC-SHA256 tokens
- Optional encrypted mapping for token→original value resolution

 Integration:
- CLI flags: --tokenize, --redaction-map, --encrypt-redaction-map
- Updated all redactor types: literal, single-line, multi-line, YAML
- Support bundle integration with auto-upload compatibility
- Backward compatibility: preserves ***HIDDEN*** when disabled

 Production Ready:
- Only 11 essential files (vs 31 in original PR)
- No excessive test files or documentation
- Clean build, all functionality verified
- Maintains existing redaction behavior by default

Token format: ***TOKEN_<TYPE>_<HASH>*** (e.g., ***TOKEN_PASSWORD_A1B2C3***)

* Removes silent failing (#1877)

* preserves stdout and stderr from collectors

* Delete eliminate-silent-failures.md

* Update host_kernel_modules_test.go

* added error logs when a collector fails to start

* Update host_filesystem_performance_linux.go

* fixed error saving logic inconsistency

* Update collect.go

* Improved error handling for support bundles and redactors for windows (#1878)

* improved error handling and window locking

* Delete all-windows-collectors.yaml

* addressing bugbot concerns

* Update host_tcpportstatus.go

* Update redact.go

* Add regression test suite to github actions

* Update regression-test.yaml

* Update regression-test.yaml

* Update regression-test.yaml

* create test/output directory

* handle node-specific files and multiple report arguments

* simplify comparison to detect code regressions only

* handle empty structural_compare rules

* removed v1beta3 branch from github workflow

* Update Makefile

* removed outdated actions

* Update Makefile

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Noah Campbell <noah.edward.campbell@gmail.com>
Co-authored-by: Benjamin Yang <82779168+bennyyang11@users.noreply.github.com>
Co-authored-by: Benjamin Yang <benjaminyang@Benjamins-MacBook-Pro.local>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-08 10:22:11 -07:00

28 KiB
Raw Blame History

Phased execution plan (actionable)

  1. Foundation & policy (cross-cutting) • Goal: Establish non-negotiable engineering charters, error taxonomy, deterministic I/O, and output envelope. • Do: • Adopt items under “Cross-cutting engineering charters”. • Implement centralized error codes (see “1) Error codes (centralized)”). • Implement JSON output envelope (see “2) Output envelope (JSON mode)”). • Add idempotency key helper (see “3) Idempotency key”). • Ensure deterministic marshaling patterns (see “4) Deterministic marshaling”). • Define config precedence and env aliases (see section E) Config precedence & env aliases). • Add Make targets (see section F) Make targets). • Acceptance: • “Measurable add-on success criteria” items related to CLI output and determinism are satisfied.

  2. Distribution & updates (installers, signing, updater) • Goal: Stop krew; ship Homebrew and curl|bash installers; add secure update with rollback. • Do: • Remove/retire krew guidance; add Homebrew formulas and curl|bash script(s). • Implement “C) Update system (secure + rollback)” including channels, rollback, tamper defense, delta updates (optional later). • Implement “Reproducible, signed, attestable releases” (SBOM, cosign, SLSA, SOURCE_DATE_EPOCH). • Add minimal packaging matrix validation for brew and curl|bash; expand later (see D) Packaging matrix validation (CI)). • Acceptance: • Users can install preflight and support-bundle via brew and curl|bash. • Updater supports --channel, verify, rollback; signatures verified per roadmap details.

  3. API v1beta3 schemas and libraries • Goal: Define and own v1beta3 JSON Schemas and supporting defaulting/validation/conversion libraries within performance budgets. • Do: • Implement “API v1beta3 & schema work (deeper)” sections AD (JSON Schema strategy; defaulting; validation; performance budget). • Add converters and fuzzers per “C) Converters robustness”. • Benchmarks per “D) Performance budget”. • Acceptance: • Schemas published under schemas.troubleshoot.sh/v1beta3/* with $id, $schema, $defs. • Validation/defaulting return structured errors; fuzz and perf budgets pass.

  4. Preflight requirements disclosure command • Goal: Let customers preview requirements offline; render table/json/yaml/md; support templating values. • Do: • Implement “Preflight requirements disclosure (new command)” (preflight requirements), including flags and behaviors. • Implement templating from “Preflight CLI: Values and --set support (templating)”. • Acceptance: • Output validates against docs/preflight-requirements.schema.json and renders within width targets. • Unit and golden tests for table/json/md; fuzz tests for extractor stability.

  5. Docs generator and portal gate/override • Goal: Generate preflight docs with rationale and support portal gate/override flow. • Do: • Implement “Preflight docs & portal flow (hardening)” sections AD (merge engine, docs generator, portal client contract, E2E tests). • Ensure CLI prints requestId on error; implement backoff/idempotency per contract. • Acceptance: • E2E portal tests cover pass/fail/override/429/5xx with retries. • Docs generator emits MD/HTML with i18n hooks and template slots.

  6. Simplified spec model: intents, presets, imports • Goal: Reduce authoring burden via intents for collect/analyze, redaction profiles with tokenize, and preset/import model. • Do: • Implement “Simplified spec model: intents, presets, imports”: intents.collect.auto; intents.analyze.requirements; redact.profile + tokenize; import/extends; selectors/filters; compatibility flags --emit and --explain. • Provide examples and downgrade warnings for v1beta2 emit. • Acceptance: • Deterministic expansion demonstrated; explain output shows generated low-level spec; downgrade warnings reported where applicable.

  7. Public packages & ecosystem factoring • Goal: Establish stable package boundaries to support reuse and avoid logging in libs. • Do: • Create packages listed under “Public packages & ecosystem” (pkg/cli/contract, update, schema, specs/*, docs/render, portal/client). • Export minimal, stable APIs; return structured errors. • Acceptance: • api-diff green or change proposal attached.

  8. CI/CD reinforcement • Goal: End-to-end pipelines for verification, install matrix, benchmarks, supply-chain, and releases. • Do: • Implement pipeline stages listed under “CI/CD reinforcement → Pipelines 15”. • Add static checks (revive/golangci-lint, api-diff rules) per roadmap. • Acceptance: • Pipelines green; supply chain artifacts (SBOM, cosign, SLSA) produced; release flow notarizes and publishes.

  9. Testing strategy, determinism and performance harness, artifacts layout • Goal: Comprehensive unit/contract/fuzz/integration tests, deterministic outputs, and curated fixtures. • Do: • Implement “Testing strategy (Dev 1 scope)” (unit, contract/golden, fuzz/property, integration/matrix tests). • Implement “Determinism & performance” harness and budgets. • Organize artifacts per “Artifacts & layout” and add Make targets for test/fuzz/contracts/e2e/bench. • Acceptance: • Golden tests stable; determinism harness passes under SOURCE_DATE_EPOCH; benchmarks within budgets.

  10. Packaging matrix expansion (optional later) • Goal: Expand beyond brew/curl to scoop and deb/rpm when desired. • Do: • Extend “D) Packaging matrix validation (CI)” to include scoop and deb/rpm installers and tests across OSes. • Acceptance: • Installers validated on ubuntu/macos/windows with smoke commands; macOS notarization verified.

Notes • Each phase references detailed specifications below. Implement phases in order; parallelize sub-items where safe. • If scope for an initial milestone is narrower (e.g., brew/curl only), mark the remaining items as deferred but keep tests/docs ready to expand.

Cross-cutting engineering charters

  1. Contract/stability policy (one pager, checked into repo) • SemVer & windows: major.minor.patch; flags/commands stable for ≥2 minors; deprecations carry --explain-deprecations. • Breaking-change gate: PR must include contracts/CHANGE_PROPOSAL.md + updated goldens + migration notes. • Determinism: Same inputs ⇒ byte-identical outputs (normalized map ordering, sorted slices, stable timestamps with SOURCE_DATE_EPOCH).

  2. Observability & diagnostics • Structured logs (zerolog/zap): --log-format {text,json}, --log-level {info,debug,trace}. • Exit code taxonomy: 0 ok, 1 generic, 2 usage, 3 network, 4 schema, 5 incompatible-api, 6 update-failed, 7 permission, 8 partial-success. • OTel hooks (behind TROUBLESHOOT_OTEL_ENDPOINT): span “loadSpec”, “mergeSpec”, “runPreflight”, “uploadPortal”.

  3. Reproducible, signed, attestable releases • SBOM (cyclonedx/spdx) emitted by GoReleaser. • cosign: sign archives + checksums.txt; produce SLSA provenance attestation. • SOURCE_DATE_EPOCH set in CI to pin archive mtimes.

CLI contracts & packaging (more depth)

A) Machine-readable CLI spec • Generate docs/cli-contracts.json from Cobra tree (name, synopsis, flags, defaults, env aliases, deprecation). • Validate at runtime when TROUBLESHOOT_DEBUG_CONTRACT=1 to catch drift in dev builds. • Use that JSON to: • Autogenerate shell completions for bash/zsh/fish/pwsh. • Render the --help text (single source of truth).

B) UX hardening • TTY detection: progress bars only on TTY; --no-progress to force off. • Color policy: --color {auto,always,never} + NO_COLOR env respected. • Output mode: --output {human,json,yaml} for all read commands. For json, include a top-level "schemaVersion": "cli.v1".

C) Update system (secure + rollback) • Channel support: --channel {stable,rc,nightly} (maps to tags: vX.Y.Z, vX.Y.Z-rc.N, nightly-YYYYMMDD). • Rollback: keep N=2 previous binaries under ~/.troubleshoot/bin/versions/…; preflight update --rollback. • Tamper defense: verify cosign sig for checksums.txt; verify SHA256 of selected asset; fail closed with error code 6. • Delta updates (optional later): if asset .patch exists and base version matches, apply bsdiff; fallback to full.

D) Packaging matrix validation (CI) • Matrix test on ubuntu-latest, macos-latest, windows-latest: • Install via brew, scoop, deb/rpm, curl|bash; then run preflight --version and a sample command. • Gatekeeper: spctl -a -v on macOS; print notarization ticket.

E) Config precedence & env aliases • Per-binary config paths (defaults): • macOS/Linux: • preflight: ~/.config/preflight/config.yaml • support-bundle: ~/.config/support-bundle/config.yaml • Windows: • preflight: %APPDATA%\Troubleshoot\Preflight\config.yaml • support-bundle: %APPDATA%\Troubleshoot\SupportBundle\config.yaml • Optional global fallback (lower precedence): ~/.config/troubleshoot/config.yaml • Precedence: flag > binary env > global env > binary config > global config > default • --config overrides discovery; respects XDG_CONFIG_HOME (Unix) and APPDATA (Windows) • Env aliases: • Global: TROUBLESHOOT_PORTAL_URL, TROUBLESHOOT_API_TOKEN • Binary-scoped: PREFLIGHT_* and SUPPORT_BUNDLE_* (take precedence over TROUBLESHOOT_*)

F) Make targets

make contracts # regen CLI JSON + goldens make sbom # build SBOMs make release-dryrun # goreleaser --skip-publish make e2e-install # spins a container farm to test deb/rpm

API v1beta3 & schema work (deeper)

A) JSON Schema strategy • Give every schema an $id and $schema; publish at schemas.troubleshoot.sh/v1beta3/*.json. • Use $defs for shared primitives (Quantity, Duration, CPUSet, Selector). • Add x-kubernetes-validations parity constraints where applicable (even if not applying as CRD).

B) Defaulting & validation library • pkg/validation/validate.go: returns []FieldError with JSONPointer paths and machine codes. • pkg/defaults/defaults.go: idempotent defaulting; fuzz tests prove no oscillation (fuzz: in -> default -> default == default).

C) Converters robustness • Fuzzers (go1.20+): generate random v1beta1/2 structs, convert→internal→v1beta3→internal and assert invariants (lossless roundtrips where representable). • Report downgrade loss: if v1beta3→v1beta2 drops info, print warning list to stderr and annotate output with x-downgrade-warnings.

D) Performance budget • Load+validate 1MB spec ≤ 150ms p95, 10MB ≤ 800ms p95 on GOARCH=amd64 GitHub runner. • Benchmarks in pkg/apis/bench_test.go enforce budgets.

E) Simplified spec model: intents, presets, imports • Problem: vendors handwrite verbose collector/analyzer lists. Goal: smaller, intent-driven specs that expand deterministically. • Tenets: • Additive, backwards-compatible; loader can expand intents into concrete v1beta2-equivalent structures. • Deterministic expansion (same inputs ⇒ same expansion) with --explain to show the generated low-level spec. • Shorthand over raw lists: “what” not “how”. • Top-level additions (v1beta3): • intents.collect.auto: namespace, profiles, includeKinds, excludeKinds, selectors, size caps. • intents.analyze.requirements: high-level checks (k8sVersion, nodes.cpu/memory, podsReady, storageClass, CRDsPresent…). • redact.profile + tokenize: standard|strict; optional token map emission. • import: versioned presets (preset://k8s/basic@v1) with local vendoring. • extends: URL or preset to inherit from, with override blocks. • Selectors & filters: • labelSelector, fieldSelector, name/glob filters; include/exclude precedence clarified in schema docs. • Compatibility: • --emit v1beta2 to produce a concrete legacy spec; downgrade warnings if some intent cant fully map. • --explain prints the expanded collectors/analyzers to aid review and vendoring. • Example: Preflight with requirements + docs

apiVersion: troubleshoot.sh/v1beta3
kind: Preflight
metadata:
  name: example
requirements:
  - name: Baseline
    docString: "Core Kubernetes and cluster requirements."
    checks:
      - clusterVersion:
          checkName: Kubernetes version
          outcomes:
            - fail:
                when: "< 1.20.0"
                message: This application requires at least Kubernetes 1.20.0, and recommends 1.22.0.
                uri: https://kubernetes.io
            - warn:
                when: "< 1.22.0"
                message: Your cluster meets the minimum version of Kubernetes, but we recommend you update to 1.22.0 or later.
                uri: https://kubernetes.io
            - pass:
                when: ">= 1.22.0"
                message: Your cluster meets the recommended and required versions of Kubernetes.
      - customResourceDefinition:
          checkName: Ingress
          customResourceDefinitionName: ingressroutes.contour.heptio.com
          outcomes:
            - fail:
                message: Contour ingress not found!
            - pass:
                message: Contour ingress found!
      - containerRuntime:
          outcomes:
            - pass:
                when: "== containerd"
                message: containerd container runtime was found.
            - fail:
                message: Did not find containerd container runtime.
      - storageClass:
          checkName: Required storage classes
          storageClassName: "default"
          outcomes:
            - fail:
                message: Could not find a storage class called default.
            - pass:
                message: All good on storage classes
      - distribution:
          outcomes:
            - fail:
                when: "== docker-desktop"
                message: The application does not support Docker Desktop Clusters
            - fail:
                when: "== microk8s"
                message: The application does not support Microk8s Clusters
            - fail:
                when: "== minikube"
                message: The application does not support Minikube Clusters
            - pass:
                when: "== eks"
                message: EKS is a supported distribution
            - pass:
                when: "== gke"
                message: GKE is a supported distribution
            - pass:
                when: "== aks"
                message: AKS is a supported distribution
            - pass:
                when: "== kurl"
                message: KURL is a supported distribution
            - pass:
                when: "== digitalocean"
                message: DigitalOcean is a supported distribution
            - pass:
                when: "== rke2"
                message: RKE2 is a supported distribution
            - pass:
                when: "== k3s"
                message: K3S is a supported distribution
            - pass:
                when: "== oke"
                message: OKE is a supported distribution
            - pass:
                when: "== kind"
                message: Kind is a supported distribution
            - warn:
                message: Unable to determine the distribution of Kubernetes
      - nodeResources:
          checkName: Must have at least 3 nodes in the cluster, with 5 recommended
          outcomes:
            - fail:
                when: "count() < 3"
                message: This application requires at least 3 nodes.
                uri: https://kurl.sh/docs/install-with-kurl/adding-nodes
            - warn:
                when: "count() < 5"
                message: This application recommends at last 5 nodes.
                uri: https://kurl.sh/docs/install-with-kurl/adding-nodes
            - pass:
                message: This cluster has enough nodes.
      - nodeResources:
          checkName: Every node in the cluster must have at least 8 GB of memory, with 32 GB recommended
          outcomes:
            - fail:
                when: "min(memoryCapacity) < 8Gi"
                message: All nodes must have at least 8 GB of memory.
                uri: https://kurl.sh/docs/install-with-kurl/system-requirements
            - warn:
                when: "min(memoryCapacity) < 32Gi"
                message: All nodes are recommended to have at least 32 GB of memory.
                uri: https://kurl.sh/docs/install-with-kurl/system-requirements
            - pass:
                message: All nodes have at least 32 GB of memory.
      - nodeResources:
          checkName: Total CPU Cores in the cluster is 4 or greater
          outcomes:
            - fail:
                when: "sum(cpuCapacity) < 4"
                message: The cluster must contain at least 4 cores
                uri: https://kurl.sh/docs/install-with-kurl/system-requirements
            - pass:
                message: There are at least 4 cores in the cluster
      - nodeResources:
          checkName: Every node in the cluster must have at least 40 GB of ephemeral storage, with 100 GB recommended
          outcomes:
            - fail:
                when: "min(ephemeralStorageCapacity) < 40Gi"
                message: All nodes must have at least 40 GB of ephemeral storage.
                uri: https://kurl.sh/docs/install-with-kurl/system-requirements
            - warn:
                when: "min(ephemeralStorageCapacity) < 100Gi"
                message: All nodes are recommended to have at least 100 GB of ephemeral storage.
                uri: https://kurl.sh/docs/install-with-kurl/system-requirements
            - pass:
                message: All nodes have at least 100 GB of ephemeral storage.

{{- if eq .Values.postgres.enabled true }}
  - name: Postgres
    docString: "Postgres needs a storage class and sufficient memory."
    checks:
      - storageClass:
          checkName: Postgres storage class
          name: "{{ .Values.postgres.storageClassName | default \"default\" }}"
          required: true
      - nodeResources:
          checkName: Postgres memory guidance
          outcomes:
            - fail:
                when: "min(memoryCapacity) < 8Gi"
                message: All nodes must have at least 8 GB of memory for Postgres.
            - warn:
                when: "min(memoryCapacity) < 32Gi"
                message: Nodes are recommended to have at least 32 GB of memory for Postgres.
            - pass:
                message: Nodes have sufficient memory for Postgres.
{{- end }}

{{- if eq .Values.redis.enabled true }}
  - name: Redis
    docString: "Redis needs a storage class and adequate ephemeral storage."
    checks:
      - storageClass:
          checkName: Redis storage class
          name: "{{ .Values.redis.storageClassName | default \"default\" }}"
          required: true
      - nodeResources:
          checkName: Redis ephemeral storage
          outcomes:
            - fail:
                when: "min(ephemeralStorageCapacity) < 40Gi"
                message: All nodes must have at least 40 GB of ephemeral storage for Redis.
            - warn:
                when: "min(ephemeralStorageCapacity) < 100Gi"
                message: Nodes are recommended to have at least 100 GB of ephemeral storage for Redis.
            - pass:
                message: Nodes have sufficient ephemeral storage for Redis.
{{- end }}
•	Presets library:
	•	Versioned URIs (e.g., preset://k8s/basic@v1, preset://app/logs@v1) maintained in-repo and publishable.
	•	"troubleshoot vendor --import" downloads presets to ./vendor/troubleshoot/ for offline builds.

Preflight docs & portal flow (hardening)

A) Merge engine details • Stable key = GroupKind/Name[/Namespace] (e.g., NodeResource/CPU, FilePermission//etc/hosts). • Conflict detection emits a list with reasons: “same key, differing fields: thresholds.min, description”. • Provenance captured on each merged node: • troubleshoot.sh/provenance: vendor|replicated|merged • troubleshoot.sh/merge-conflict: "thresholds.min, description"

B) Docs generator upgrades • Template slots: why, riskLevel {low,med,high}, owner, runbookURL, estimatedTime. • i18n hooks: template lookup by locale --locale es-ES falls back to en-US. • Output MD + self-contained HTML (inline CSS) when --html. --toc adds a nav sidebar.

C) Portal client contract • Auth: Bearer ; optional mTLS later. • Idempotency: Idempotency-Key header derived from spec SHA256. • Backoff: exponential jitter (100ms → 3s, 6 tries) on 429/5xx; code 3 on exhaustion. • Response model:

{ "requestId": "r_abc123", "decision": "pass|override|fail", "reason": "text", "policyVersion": "2025-09-01" }

•	CLI prints requestId on error for support.

D) E2E tests (httptest.Server) • Scenarios: pass, fail, override, 429 with retry-after, 5xx flake, invalid JSON. • Golden transcripts of HTTP exchanges under testdata/e2e/portal.

Public packages & ecosystem

A) Package boundaries

pkg/ cli/contract # cobra->json exporter (no cobra import cycles) update/ # channel, verify, rollback schema/ # embed.FS of JSON Schemas + helpers specs/loader # version sniffing, load any -> internal specs/convert # converters specs/validate # validation library docs/render # md/html generation portal/client # http client + types

•	No logging in libs; return structured errors with codes; callers log.

B) SARIF export (nice-to-have) • --output sarif for preflight results so CI systems ingest findings.

C) Back-compat façade • For integrators, add tiny shim: pkg/legacy/v1beta2loader that calls new loader + converter; mark with Deprecated: GoDoc but stable for a window.

CI/CD reinforcement

Pipelines 1. verify: lint, unit, fuzz (short), contracts, schemas → required. 2. matrix-install: brew/scoop/deb/rpm/curl on 3 OSes. 3. bench: enforce perf budgets. 4. supply-chain: build SBOM, cosign sign/verify, slsa attestation. 5. release (tagged): goreleaser publish, notarize, bump brew/scoop, attach SBOM, cosign attest.

Static checks • revive/golangci-lint with a rule to forbid time.Now() in pure functions; must use injected clock. • api-diff: compare exported pkg/** against last tag; fails on breaking changes without contracts/CHANGE_PROPOSAL.md.

  1. Error codes (centralized)

package xerr type Code int const ( OK Code = iota Usage Network Schema IncompatibleAPI UpdateFailed Permission Partial ) type E struct { Code Code; Op, Msg string; Err error } func (e E) Error() string { return e.Msg } func CodeOf(err error) Code { / unwrap */ }

  1. Output envelope (JSON mode)

{ "schemaVersion": "cli.v1", "tool": "preflight", "version": "1.12.0", "timestamp": "2025-09-09T17:02:33Z", "result": { /* command-specific */ }, "warnings": [], "errors": [] }

  1. Idempotency key

func idemKey(spec []byte) string { sum := sha256.Sum256(spec) return hex.EncodeToString(sum[:]) }

  1. Deterministic marshaling

enc := json.NewEncoder(w) enc.SetEscapeHTML(false) enc.SetIndent("", " ") sort.SliceStable(obj.Items, func(i,j int) bool { return obj.Items[i].Name < obj.Items[j].Name })

Measurable add-on success criteria • preflight --help --output json validates against docs/cli-contracts.schema.json. • make bench passes with stated p95 budgets. • cosign verify-blob succeeds for checksums.txt in CI and on dev machines (docd). • E2E portal tests cover all decision branches and 429/5xx paths with retries observed. • api-diff is green or has an attached change proposal.

Testing strategy (Dev 1 scope)

Unit tests
•	CLI arg parsing: Cobra ExecuteC with table-driven flag sets for both binaries.
•	Config precedence resolver: tmp dirs + OS-specific cases (XDG_CONFIG_HOME/APPDATA).
•	Validation/defaulting libraries: happy/edge cases; structured []FieldError assertions.
•	Portal client: httptest.Server scenarios (pass/fail/override/429/5xx) with retry/backoff checks.
•	Updater: mock release index; cosign verify using test keys; rollback success/failure paths.

Contract/golden tests
•	CLI contracts: generate docs/cli-contracts.json and compare to goldens; update via make contracts.
•	--help rendering snapshots (normalized width/colors) for core commands.
•	Schemas: validate example specs against v1beta3 JSON Schemas; store fixtures in testdata/schemas/.
•	Docs generator: preflight-docs.md/HTML goldens for sample merged specs with provenance.

Fuzz/property tests
•	Converters: v1beta1/2→internal→v1beta3→internal round-trip fuzz; invariants enforced.
•	Defaulting idempotence: default(default(x)) == default(x).

Integration/matrix tests
•	Installers: brew/scoop/deb/rpm/curl on ubuntu/macos/windows; run preflight/support-bundle --version and a smoke command.
•	macOS notarization: spctl -a -v on built binaries.
•	Updater E2E: start mock release server, switch channels, rollback, tamper-detection failure.

Determinism & performance
•	Deterministic outputs under SOURCE_DATE_EPOCH; byte-for-byte stable archives in a test harness.
•	Benchmarks: load+validate budgets (latency + RSS) enforced via go test -bench and thresholds.

Artifacts & layout
•	Fixtures under testdata/: schemas/, cli/, docs/, portal/, updater/ with README explaining regeneration.
•	Make targets: make test, make fuzz-short, make contracts, make e2e-install, make bench.

Preflight CLI: Values and --set support (templating)

• Goal: Let end customers pass Values at runtime to drive a single modular YAML with conditionals. • Scope: preflight gains --values (repeatable) and --set key=value (repeatable), rendered over the input YAML before loading specs. • Template engine: Go text/template + Sprig, with .Values bound. Standard delimiters {{ }}. • Precedence: • --set overrides everything (last one wins when repeated) • Later --values files override earlier ones (left-to-right deep merge) • Defaults embedded in the YAML are lowest precedence • Merge: • Maps: deep-merge • Slices: replace (whole list) • Types: • true|false parsed as bool, numbers as float/int when unquoted, everything else as string • Use quotes to force string: --set image.tag="1.2.3"

Example usage

# combine file values with inline overrides
preflight ./some-preflight-checks.yaml \
  --values ./values.yaml \
  --values ./values-prod.yaml \
  --set postgres.enabled=true \
  --set redis.enabled=false

Minimal Values schema (illustrative)

postgres:
  enabled: false
  storageClassName: default
redis:
  enabled: true
  storageClassName: default

Single-file modular YAML authoring pattern

apiVersion: troubleshoot.sh/v1beta3
kind: Preflight
metadata:
  name: example
requirements:
  - name: Baseline
    docString: "Core Kubernetes requirements."
    checks:
      - k8sVersion: ">=1.22"
      - distribution:
          allow: [eks, gke, aks, kurl, digitalocean, rke2, k3s, oke, kind]
          deny:  [docker-desktop, microk8s, minikube]
      - storageClass:
          name: "default"
          required: true

{{- if eq .Values.postgres.enabled true }}
  - name: Postgres
    docString: "Postgres needs a storage class and sufficient memory."
    checks:
      - storageClass:
          name: "{{ .Values.postgres.storageClassName | default \"default\" }}"
          required: true
      - nodes:
          memoryPerNode: ">=8Gi"
          recommendMemoryPerNode: ">=32Gi"
{{- end }}

{{- if eq .Values.redis.enabled true }}
  - name: Redis
    docString: "Redis needs a storage class and adequate ephemeral storage."
    checks:
      - storageClass:
          name: "{{ .Values.redis.storageClassName | default \"default\" }}"
          required: true
      - nodes:
          ephemeralPerNode: ">=40Gi"
          recommendEphemeralPerNode: ">=100Gi"
{{- end }}

Notes • Keep everything in one YAML; conditionals gate entire requirement blocks. • Authors can still drop down to raw analyzers; the renderer runs before spec parsing, so both styles work. • Add --dry-run to print the rendered spec without executing checks.