Change design to have collectors write to etcd rather than push to an api-server after discussion with ethan

This commit is contained in:
Chris Sanders
2022-06-28 16:54:42 -05:00
parent 33ddee27ed
commit fe1137d8ec

View File

@@ -14,7 +14,7 @@
## Background
While using the information gathered in a support bundle users were finding it hard to find information while manually reviewing the varous files collected in the bundle. Users have to understand the folder structure, files structure, and process JSON files to find information about the cluster. The [sbctl](https://github.com/replicatedhq/sbctl) project was created to prove out the utility of providing api based access so that users could use existing tools which they already understood. This has been a very successful experiment with feedback being that most users now use this utility as their primary, or only, interface to the support bundle information.
While using the information gathered in a support bundle users were finding it hard to find information while manually reviewing the various files collected in the bundle. Users have to understand the folder structure, files structure, and process JSON files to find information about the cluster. The [sbctl](https://github.com/replicatedhq/sbctl) project was created to prove out the utility of providing api based access so that users could use existing tools which they already understood. This has been a very successful experiment with feedback being that most users now use this utility as their primary, or only, interface to the support bundle information.
There are some drawbacks to the current approach. The `sbctl` project is tightly coupled to troubleshoot on-disk formats, each kubernetes API must be implemented in `sbctl` individually and will require being kept up to date as APIs change, and `sbctl` has no plan today to provide similar API based access to information other than the standard kubernetes api.
@@ -22,13 +22,21 @@ This proposal is meant to take the learnings from `sbctl` and consider implement
## High-Level Design
To help address standard access to API data, troubleshoot will start an api server backed by etcd which collectors can then use to store collected information rather than writting to custom on-disk locations. This will allow collectors to use the standard API to both collect (from the live API server) and store (to the ephemeral troubleshoot API server) data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almsot all maitnance from troubleshoot for implementing API calls to access the collected information.
To help address standard access to API data, troubleshoot will start an etcd instance which collectors can then use to store collected information rather than writing to custom on-disk locations. Collectors that only need to collect Kubernetes API information then do not need to serve up data as the API server will be used to provide access to the collected data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almost all maintenance from troubleshoot for implementing API calls to access the collected information.
There will also be information that is desired to be stored that doesn't have any native representation in the api-server today. Examples could include customer collectors which execute into containers and extensions like [metrics server](https://github.com/kubernetes-sigs/metrics-server). These components can be addressed using the built in [Kubernetes API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) which is how `metrics server` itself works to extend and provide api based access to node information. By registering additional API extensions troubleshoot plugins can implement both a collection and an API for retrieving custom information which is accessible and revisioned in the same fashion as the rest of the api-server.
There will also be information that is desired to be stored that doesn't have any native representation in the api-server today. Examples could include custom collectors which execute into containers and extensions like [metrics server](https://github.com/kubernetes-sigs/metrics-server). These components can be addressed using the built in [Kubernetes API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) which is how `metrics server` itself works to extend and provide api based access to node information. By registering additional API extensions troubleshoot plugins can implement both a collection and an API for retrieving custom information which is accessible in the same fashion as the rest of the api-server.
The additional benefit to this, which can be demonstrated with `metrics-server`, is that collectors can be written for any other extension API and be compatible with existing tools the same way using the api-server provides compatibility with `kubectl`. Using `metrics-server` as an example, a collector can be written to collect node information which can be stored locally. The collector can then implement the standard [GET Endpoints](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) from the metrics-server and reply to them with the results it collected. By doing so, all tools that work with metrics-server today, like `kubectl top` will also work with information served from a support bundle. Additionally, the collector implementation handles any custom file formats without exposing that to any other tool making it easy to maintain.
Any other access to the filesystem directly will be modified to instead use the provided api-server. This means analyzers will not directly reference files on disk and will instead run against the api-server to analyze the collected information. This decouples analyzers from how collectors store data providing clean separation of concerns for maintaining both collectors and analyzers. This could also allow analyzers to run against existing clusters providing a use for analyzers independent of support bundle collection. This could encourage additional community contributions of analyzers.
Any other access to the filesystem directly will be modified to instead use the provided api-server. This means analyzers will not directly reference files on disk and will instead run against the api-server to analyze the collected information. This decouples analyzers from how collectors store data providing clean separation of concerns for maintaining both collectors and analyzers. Decoupling these also creates a natural way to implement analyzers that use data from multiple collectors. This could also allow analyzers to run against existing clusters providing a use for analyzers independent of support bundle collection. This could encourage additional community contributions of analyzers.
An initially unintended benefit of using the Aggregation Layer is that any HostCollector using this implementation would be very close to an implementation of an extension which you could install in clusters. This could make HostCollectors also useful to install as a service in a live cluster for operations information about hosts.
## Outstanding design questions
* Is the overhead to write an Aggregation API going to add an unnecessary burden to writing new collector plugins? Can these be templated into a reasonably to ease collector creation?
* Can data collected from an api-server reasonably be put directly into an etcd still compatible with an api-server?
* Does [Velero](velero.io) which already can be configured to collect K8s objects have a close enough use case that it or it's modules could be reused to implement this?
## Detailed Design
@@ -54,3 +62,4 @@ The current `sbctl` project could be left to run it's course independent of this
## Security Considerations
Consideration to how Redactors are implemented needs to be considered.