mirror of https://github.com/replicatedhq/troubleshoot.git synced 2026-02-14 18:29:53 +00:00

Files

Diamon Wiggins 9c77a0e3da Add sbctl integration proposal and move design directory into docs (#893 )

* add sbctl integration proposal and move design directory into docs

Co-authored-by: Evans Mungai <evans@replicated.com>

2022-12-14 14:52:40 +13:00

8.8 KiB

Raw Permalink Blame History

Provide the ability to specify multiple Troubleshoot specs in one run of support-bundle or preflight

Troubleshoot doesn't have a modular way for different components to specify specifications. If a software project wanted to include project specific items the end user would have a hard time knowing where to find those and how to collect them. Furthermore if multiple projects were to do this the user would have to run troubleshoot multiple times.

Ideally troubleshoot would allow merging of specs to allow building a spec either automatically or influenced by user input to target specific needs.

Goals

Primary goal: Allow folks that develop a particular component to maintain Troubleshoot specs for that component, including Vendors for their application.

Long term goal: allow folks to update the component specs without needing to run any other upgrades.

This proposal initially adds the ability to supply multiple support bundle specs for a single run of support-bundle and/or preflight so that:

when Troubleshoot runs, the spec it uses is an aggregate of multiple specs from the same source - yaml file, URL, configMap, or (new) CRD
each software component can contribute a Troubleshoot spec that is specific to that software
Ownership of each individual spec per component transfers to the owner of that component

This will:

Ensure support bundles generated by users are comprehensive and contain all needed information for a project maintainer to action
Allow multiple specs from different projects, should a cluster have multiple projects using Troubleshoot
Software developers to continually update specs for components to easily identify known issues and reduce support noise.

Non Goals

Compatibility with previous URL, configMap and yaml specs, both Redactors and SupportBundle types
Maintain the ability to read secrets from the Kubernetes API

Background

When users of Troubleshoot create a support bundle from the CLI, they use a single spec that is provided from:

A secret installed in the cluster by an application such as KOTS
A url like http://kots.io
some other single example spec, e.g. https://github.com/replicatedhq/troubleshoot-specs

When KOTS collects a support bundles from the KOTS UI, the spec is a merged combination of the following:

The default spec in the kots code
The spec provided by the Vendor’s application bundle
This merged spec is deduplicated by kots here.

Redactors are hard coded into Troubleshoot as well as supplied in the spec.

Although we’re currently making strides in improving the Troubleshoot project by creating new collectors, analyzers, and specs, we have no way to more quickly deliver Troubleshoot improvements to installations in the field without updating the kots application and pushing a new release, or upgrading kots to a new version.

When folks find a support issue that could have been identified by Troubleshoot, we ask them to write a new collector, and/or analyzer, for that information. However, if we do that, the new collector/analyzer is not easily available to end users.

Some of the useful features of Troubleshoot are actually implemented as part of KOTS, while both open source Troubleshoot should address this independently.

High-Level Design

Add CRD support to Troubleshoot:

Design a custom resource (CRD) that allows adding spec(s) to the Kubernetes cluster using kubectl. There is no need to extend this to use API server aggregation.
Update Troubleshoot to allow consuming the first object found of the new custom resource, by default - i.e. if there are no specs provided on the CLI or entrypoint, use CRD
Once merge is implemented, update Troubleshoot to consume and merge all the instances of the CRD.
To minimize code changes in Troubleshoot, we could implement two CRDs, of type: SupportBundle and type: Redactor. However, there is no specific need to separate the two and it maybe advantageous to keep things simple, combining them.

Allow multiple specs to be merged by Troubleshoot:

create an interface for collectors, analyzers, and redactors to merge
each collector/analyzer/redactor can use a generic implementation of the interface for the merge, or can use a specific one for that task if that particular collector will benefit from a more intelligent merge.
the spec merge code in kots can, at this stage, be removed once kots ships CRDs for the Troubleshoot specs

Spec sources:

Alter Troubleshoot to accept multiple yaml files on the CLI
Create a CRD containing a spec, have Troubleshoot search for spec CRDs and combine them all when run
maintain the URL compatibility as is
maintain the configMap compatibility as is

Detailed Design

CRD: to be designed

Merge:

define a new interface that provides the merge functionality for collectors, analyzers, and redactors
there should be a generic implementation of that interface which is used by default for all objects. This can simply use append().
specific collectors may have alternative implementations of the interface where overrides are required.

Particular overrides known at this point in time:

clusterResources has an option namespaces config, these should be merged and deduplicated
Two runPod collectors with the same name and different commands are unmergable. Only one should be run and an error logged.
Two configMap collectors can be deduplicated and merged depending on the configurations
Two logs collectors likely only need deduplication to prevent collecting the same thing twice

Limitations

Visibility

Users may collect more than intended with the new implementation if all available CRD objects are run. We therefore need to be sure the Troubleshoot client allows for selection of what to run. While the default would be to run everything, a user should be able to just run specific SupportBundle specs based on labels, namespaces, etc. By doing this no functionality is lost.

Airgap

Some of the spec locations such as URL wouldn’t work here. There may be other considerations around airgap as well.

Airgap installs are intentionally feature limited. Failure to retrieve new versions of specs should fail gracefully and proceed with the run.

As a follow up feature to improve support bundle collection Troubleshoot could gain some abilities to help manage and upgrade SupportBundles found in a cluster in a fashion that allows the user to split the discovery, download, and upgrade so they could be run on different machines.

An example implementation would be:

A command to find, deduplicate, and output a list of SupportBundles in the cluster. The list would include at least the upstream URL and version currently installed in the cluster.
A command to take the output of the above command and check for updates, downloading all updated SupportBundles into a single .tgz file.
A command to apply the above .tgz file to the cluster upgrading any existing SupportBundles with the newer versions.

Assumptions

all deployments are able to deploy CRDs to the Kubernetes cluster

Testing

Alternatives Considered

Use the existing URL functionality in Troubleshoot

Proposal: Provide a custom URL for each install, that when called by Troubleshoot returns a spec composed of the ‘latest’ for every component spec specified in that installation, much like we do with https://kurl.sh/latest now.

Pros:

No changes required to Troubleshoot
Always use the latest specs for components other than the Vendor’s application

Cons:

Would not work for airgap installations
Software developers need to update their application in order to get new specs from all dependencies
Replicated would need to provide a new API/web service to host the specs, reducing the general community applicability of troubleshoot as a stand alone project
Discourages individual projects from maintaining Troubleshoot specs for that project

Security Considerations

Passing control of component specs to individual projects presents a possibility of reducing the amount of review a spec goes through for each update, allowing a spec provided by, say, a kURL addon to run collectors. The current implementation relies on the kots review process for default specs, plus the review of individual example specs. Depending on the final implementation, it is unlikely that this change itself increases any risk since the delivery of automated and default specs to Troubleshoot is maintained within kots, and/or Troubleshoot itself much as it is already.

8.8 KiB Raw Permalink Blame History Unescape Escape