* add sbctl integration proposal and move design directory into docs Co-authored-by: Evans Mungai <evans@replicated.com>
8.8 KiB
Provide the ability to specify multiple Troubleshoot specs in one run of support-bundle or preflight
Troubleshoot doesn't have a modular way for different components to specify specifications. If a software project wanted to include project specific items the end user would have a hard time knowing where to find those and how to collect them. Furthermore if multiple projects were to do this the user would have to run troubleshoot multiple times.
Ideally troubleshoot would allow merging of specs to allow building a spec either automatically or influenced by user input to target specific needs.
Goals
Primary goal: Allow folks that develop a particular component to maintain Troubleshoot specs for that component, including Vendors for their application.
Long term goal: allow folks to update the component specs without needing to run any other upgrades.
This proposal initially adds the ability to supply multiple support bundle specs for a single run of support-bundle and/or preflight so that:
- when Troubleshoot runs, the spec it uses is an aggregate of multiple specs from the same source - yaml file, URL, configMap, or (new) CRD
- each software component can contribute a Troubleshoot spec that is specific to that software
- Ownership of each individual spec per component transfers to the owner of that component
This will:
- Ensure support bundles generated by users are comprehensive and contain all needed information for a project maintainer to action
- Allow multiple specs from different projects, should a cluster have multiple projects using Troubleshoot
- Software developers to continually update specs for components to easily identify known issues and reduce support noise.
Non Goals
- Compatibility with previous URL, configMap and yaml specs, both Redactors and SupportBundle types
- Maintain the ability to read secrets from the Kubernetes API
Background
When users of Troubleshoot create a support bundle from the CLI, they use a single spec that is provided from:
- A secret installed in the cluster by an application such as KOTS
- A url like http://kots.io
- some other single example spec, e.g. https://github.com/replicatedhq/troubleshoot-specs
When KOTS collects a support bundles from the KOTS UI, the spec is a merged combination of the following:
- The default spec in the kots code
- The spec provided by the Vendor’s application bundle
- This merged spec is deduplicated by kots here.
Redactors are hard coded into Troubleshoot as well as supplied in the spec.
Although we’re currently making strides in improving the Troubleshoot project by creating new collectors, analyzers, and specs, we have no way to more quickly deliver Troubleshoot improvements to installations in the field without updating the kots application and pushing a new release, or upgrading kots to a new version.
When folks find a support issue that could have been identified by Troubleshoot, we ask them to write a new collector, and/or analyzer, for that information. However, if we do that, the new collector/analyzer is not easily available to end users.
Some of the useful features of Troubleshoot are actually implemented as part of KOTS, while both open source Troubleshoot should address this independently.
High-Level Design
Add CRD support to Troubleshoot:
- Design a custom resource (CRD) that allows adding spec(s) to the Kubernetes cluster using
kubectl. There is no need to extend this to use API server aggregation. - Update Troubleshoot to allow consuming the first object found of the new custom resource, by default - i.e. if there are no specs provided on the CLI or entrypoint, use CRD
- Once merge is implemented, update Troubleshoot to consume and merge all the instances of the CRD.
- To minimize code changes in Troubleshoot, we could implement two CRDs, of
type: SupportBundleandtype: Redactor. However, there is no specific need to separate the two and it maybe advantageous to keep things simple, combining them.
Allow multiple specs to be merged by Troubleshoot:
- create an interface for collectors, analyzers, and redactors to merge
- each collector/analyzer/redactor can use a generic implementation of the interface for the merge, or can use a specific one for that task if that particular collector will benefit from a more intelligent merge.
- the spec merge code in kots can, at this stage, be removed once kots ships CRDs for the Troubleshoot specs
Spec sources:
- Alter Troubleshoot to accept multiple yaml files on the CLI
- Create a CRD containing a spec, have Troubleshoot search for spec CRDs and combine them all when run
- maintain the URL compatibility as is
- maintain the configMap compatibility as is
Detailed Design
CRD: to be designed
Merge:
- define a new interface that provides the merge functionality for collectors, analyzers, and redactors
- there should be a generic implementation of that interface which is used by default for all objects. This can simply use
append(). - specific collectors may have alternative implementations of the interface where overrides are required.
Particular overrides known at this point in time:
- clusterResources has an option namespaces config, these should be merged and deduplicated
- Two runPod collectors with the same name and different commands are unmergable. Only one should be run and an error logged.
- Two configMap collectors can be deduplicated and merged depending on the configurations
- Two logs collectors likely only need deduplication to prevent collecting the same thing twice
Limitations
Visibility
Users may collect more than intended with the new implementation if all available CRD objects are run. We therefore need to be sure the Troubleshoot client allows for selection of what to run. While the default would be to run everything, a user should be able to just run specific SupportBundle specs based on labels, namespaces, etc. By doing this no functionality is lost.
Airgap
Some of the spec locations such as URL wouldn’t work here. There may be other considerations around airgap as well.
Airgap installs are intentionally feature limited. Failure to retrieve new versions of specs should fail gracefully and proceed with the run.
As a follow up feature to improve support bundle collection Troubleshoot could gain some abilities to help manage and upgrade SupportBundles found in a cluster in a fashion that allows the user to split the discovery, download, and upgrade so they could be run on different machines.
An example implementation would be:
- A command to find, deduplicate, and output a list of SupportBundles in the cluster. The list would include at least the upstream URL and version currently installed in the cluster.
- A command to take the output of the above command and check for updates, downloading all updated SupportBundles into a single .tgz file.
- A command to apply the above .tgz file to the cluster upgrading any existing SupportBundles with the newer versions.
Assumptions
- all deployments are able to deploy CRDs to the Kubernetes cluster
Testing
Alternatives Considered
Use the existing URL functionality in Troubleshoot
Proposal: Provide a custom URL for each install, that when called by Troubleshoot returns a spec composed of the ‘latest’ for every component spec specified in that installation, much like we do with https://kurl.sh/latest now.
Pros:
- No changes required to Troubleshoot
- Always use the latest specs for components other than the Vendor’s application
Cons:
- Would not work for airgap installations
- Software developers need to update their application in order to get new specs from all dependencies
- Replicated would need to provide a new API/web service to host the specs, reducing the general community applicability of troubleshoot as a stand alone project
- Discourages individual projects from maintaining Troubleshoot specs for that project
Security Considerations
Passing control of component specs to individual projects presents a possibility of reducing the amount of review a spec goes through for each update, allowing a spec provided by, say, a kURL addon to run collectors. The current implementation relies on the kots review process for default specs, plus the review of individual example specs. Depending on the final implementation, it is unlikely that this change itself increases any risk since the delivery of automated and default specs to Troubleshoot is maintained within kots, and/or Troubleshoot itself much as it is already.