From 6d7bf73abafb1d669923fbb2310828a430a7c652 Mon Sep 17 00:00:00 2001 From: Chris Sanders Date: Wed, 15 Jun 2022 12:22:17 -0500 Subject: [PATCH 1/4] Add initial design folder with standard-interfaces design --- design/standard-interfaces.md | 56 +++++++++++++++++++++++++++++++++++ design/template.md | 21 +++++++++++++ 2 files changed, 77 insertions(+) create mode 100644 design/standard-interfaces.md create mode 100644 design/template.md diff --git a/design/standard-interfaces.md b/design/standard-interfaces.md new file mode 100644 index 00000000..23f5a859 --- /dev/null +++ b/design/standard-interfaces.md @@ -0,0 +1,56 @@ +# Provide an extendable API for accessing bundle information + +## Goals + +* Provide API based access to collected information, decoupling other projects from troubleshoot +* Reuse existing APIs when they exist to make the bundle compatible with existing tools without modification +* Plan for extensibility for accessing information beyond just the standard kubernetes api + +## Non Goals + +* Compatibility with previous on-disk formats +* Compatibility with existing collectors without modification + * There should be a plan to allow the implementation of existing collectors + +## Background + +While using the information gathered in a support bundle users were finding it hard to find information while manually reviewing the varous files collected in the bundle. Users have to understand the folder structure, files structure, and process JSON files to find information about the cluster. The [sbctl](https://github.com/replicatedhq/sbctl) project was created to prove out the utility of providing api based access so that users could use existing tools which they already understood. This has been a very successful experiment with feedback being that most users now use this utility as their primary, or only, interface to the support bundle information. + +There are some drawbacks to the current approach. The `sbctl` project is tightly coupled to troubleshoot on-disk formats, each kubernetes API must be implemented in `sbctl` individually and will require being kept up to date as APIs change, and `sbctl` has no plan today to provide similar API based access to information other than the standard kubernetes api. + +This proposal is meant to take the learnings from `sbctl` and consider implementing it as a first class feature of troubleshoot while attempting to address the current maintenance and expandability limitations. + +## High-Level Design + +To help address standard access to API data, troubleshoot will start an api server backed by etcd which collectors can then use to store collected information rather than writting to custom on-disk locations. This will allow collectors to have use the standard API to both collect (from the live API server) and store (to the ephemeral troubleshoot API server) data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almsot all maitnance from troubleshoot for implementing API calls to access the collected information. + +There will also be information that is desired to be stored that doesn't have any native representation in the api-server today. Examples could include customer collectors which execute into containers and extensions like [metrics server](https://github.com/kubernetes-sigs/metrics-server). These components can be addressed using the built in [Kubernetes API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) which is how `metrics server` itself works to extend and provide api based access to node information. By registering additional API extensions troubleshoot plugins can implement both a collection and an API for retrieving custom information which is accessible and revisioned in the same fashion as the rest of the api-server. + +The additional benefit to this, which can be demonstrated with `metrics-server`, is that collectors can be written for any other extension API and be compatible with existing tools the same way using the api-server provides compatibility with `kubectl`. Using `metrics-server` as an example, a collector can be written to collect node information which can be stored locally. The collector can then implement the standard [GET Endpoints](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) from the metrics-server and reply to them with the results it collected. By doing so, all tools that work with metrics-server today, like `kubectl top` will also work with information served from a support bundle. Additionally, the collector implementation handles any custom file formats without exposing that to any other tool making it easy to maintain. + +Any other access to the filesystem directly will be modified to instead use the provided api-server. This means analyzers will not directly reference files on disk and will instead run against the api-server to analyze the collected information. This decouples analyzers from how collectors store data providing clean separation of concerns for maintaining both collectors and analyzers. This could also allow analyzers to run against existing clusters providing a use for analyzers independent of support bundle collection. This could encourage additional community contributions of analyzers. + +## Detailed Design + +TBD + +## Limitations + +Using the actual API server will provide limitations on the version skew which can be collected/displayed. This could be addressed by including multiple versions of the kubernetes-api server to allow serving a wide range of support bundles. This limitation likely already exists today but would existing in the tooling that is trying to collect, analyzer, or server the data. + +## Assumptions + +* Serving logs hasn't been designed yet, and presumably can be addressed in the detailed design to provide logs to the api-server in place of kubelet. Ideally this can be done using the standard upstream kubernetes-api server it is undesirable to fork it. +* Running the api-server, etcd, and any other supporting services (like kubelet) as go routines while adding some overhead to the collection process won't cause a significant burden on ram or cpu to collect support bundles. + +## Testing + +## Alternatives Considered + +### Keep sbctl separate + +The current `sbctl` project could be left to run it's course independent of this project. This leaves the troubleshoot project reliant on a separate project to provide a good user experience for anything other than analyzers. + +## Security Considerations + +Consideration to how Redactors are implemented needs to be considered. diff --git a/design/template.md b/design/template.md new file mode 100644 index 00000000..2c052697 --- /dev/null +++ b/design/template.md @@ -0,0 +1,21 @@ +# Title + +## Goals + +## Non Goals + +## Background + +## High-Level Design + +## Detailed Design + +## Limitations + +## Assumptions + +## Testing + +## Alternatives Considered + +## Security Considerations From 33ddee27ed1d0e469f2b5f7b7c54c3a6f6a85210 Mon Sep 17 00:00:00 2001 From: Chris Sanders Date: Wed, 15 Jun 2022 17:57:43 -0500 Subject: [PATCH 2/4] typos --- design/standard-interfaces.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/design/standard-interfaces.md b/design/standard-interfaces.md index 23f5a859..94dea0d0 100644 --- a/design/standard-interfaces.md +++ b/design/standard-interfaces.md @@ -2,7 +2,7 @@ ## Goals -* Provide API based access to collected information, decoupling other projects from troubleshoot +* Provide API based access to collected information, decoupling other projects from troubleshoot on-disk format * Reuse existing APIs when they exist to make the bundle compatible with existing tools without modification * Plan for extensibility for accessing information beyond just the standard kubernetes api @@ -22,7 +22,7 @@ This proposal is meant to take the learnings from `sbctl` and consider implement ## High-Level Design -To help address standard access to API data, troubleshoot will start an api server backed by etcd which collectors can then use to store collected information rather than writting to custom on-disk locations. This will allow collectors to have use the standard API to both collect (from the live API server) and store (to the ephemeral troubleshoot API server) data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almsot all maitnance from troubleshoot for implementing API calls to access the collected information. +To help address standard access to API data, troubleshoot will start an api server backed by etcd which collectors can then use to store collected information rather than writting to custom on-disk locations. This will allow collectors to use the standard API to both collect (from the live API server) and store (to the ephemeral troubleshoot API server) data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almsot all maitnance from troubleshoot for implementing API calls to access the collected information. There will also be information that is desired to be stored that doesn't have any native representation in the api-server today. Examples could include customer collectors which execute into containers and extensions like [metrics server](https://github.com/kubernetes-sigs/metrics-server). These components can be addressed using the built in [Kubernetes API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) which is how `metrics server` itself works to extend and provide api based access to node information. By registering additional API extensions troubleshoot plugins can implement both a collection and an API for retrieving custom information which is accessible and revisioned in the same fashion as the rest of the api-server. @@ -36,7 +36,7 @@ TBD ## Limitations -Using the actual API server will provide limitations on the version skew which can be collected/displayed. This could be addressed by including multiple versions of the kubernetes-api server to allow serving a wide range of support bundles. This limitation likely already exists today but would existing in the tooling that is trying to collect, analyzer, or server the data. +Using the actual API server will provide limitations on the version skew which can be collected/displayed. This could be addressed by including multiple versions of the kubernetes-api server to allow serving a wide range of support bundles. This limitation likely already exists today but would exist in the tooling that is trying to collect, analyzer, or server the data. ## Assumptions From fe1137d8ec931079bc3b7a0f1c3f42901e1ef706 Mon Sep 17 00:00:00 2001 From: Chris Sanders Date: Tue, 28 Jun 2022 16:54:42 -0500 Subject: [PATCH 3/4] Change design to have collectors write to etcd rather than push to an api-server after discussion with ethan --- design/standard-interfaces.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/design/standard-interfaces.md b/design/standard-interfaces.md index 94dea0d0..84d24909 100644 --- a/design/standard-interfaces.md +++ b/design/standard-interfaces.md @@ -14,7 +14,7 @@ ## Background -While using the information gathered in a support bundle users were finding it hard to find information while manually reviewing the varous files collected in the bundle. Users have to understand the folder structure, files structure, and process JSON files to find information about the cluster. The [sbctl](https://github.com/replicatedhq/sbctl) project was created to prove out the utility of providing api based access so that users could use existing tools which they already understood. This has been a very successful experiment with feedback being that most users now use this utility as their primary, or only, interface to the support bundle information. +While using the information gathered in a support bundle users were finding it hard to find information while manually reviewing the various files collected in the bundle. Users have to understand the folder structure, files structure, and process JSON files to find information about the cluster. The [sbctl](https://github.com/replicatedhq/sbctl) project was created to prove out the utility of providing api based access so that users could use existing tools which they already understood. This has been a very successful experiment with feedback being that most users now use this utility as their primary, or only, interface to the support bundle information. There are some drawbacks to the current approach. The `sbctl` project is tightly coupled to troubleshoot on-disk formats, each kubernetes API must be implemented in `sbctl` individually and will require being kept up to date as APIs change, and `sbctl` has no plan today to provide similar API based access to information other than the standard kubernetes api. @@ -22,13 +22,21 @@ This proposal is meant to take the learnings from `sbctl` and consider implement ## High-Level Design -To help address standard access to API data, troubleshoot will start an api server backed by etcd which collectors can then use to store collected information rather than writting to custom on-disk locations. This will allow collectors to use the standard API to both collect (from the live API server) and store (to the ephemeral troubleshoot API server) data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almsot all maitnance from troubleshoot for implementing API calls to access the collected information. +To help address standard access to API data, troubleshoot will start an etcd instance which collectors can then use to store collected information rather than writing to custom on-disk locations. Collectors that only need to collect Kubernetes API information then do not need to serve up data as the API server will be used to provide access to the collected data. Storing data directly in etcd will allow troubleshoot to later serve up this same information by again starting an api-server and etcd using the previously collected etcd data store. This should remove almost all maintenance from troubleshoot for implementing API calls to access the collected information. -There will also be information that is desired to be stored that doesn't have any native representation in the api-server today. Examples could include customer collectors which execute into containers and extensions like [metrics server](https://github.com/kubernetes-sigs/metrics-server). These components can be addressed using the built in [Kubernetes API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) which is how `metrics server` itself works to extend and provide api based access to node information. By registering additional API extensions troubleshoot plugins can implement both a collection and an API for retrieving custom information which is accessible and revisioned in the same fashion as the rest of the api-server. +There will also be information that is desired to be stored that doesn't have any native representation in the api-server today. Examples could include custom collectors which execute into containers and extensions like [metrics server](https://github.com/kubernetes-sigs/metrics-server). These components can be addressed using the built in [Kubernetes API Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) which is how `metrics server` itself works to extend and provide api based access to node information. By registering additional API extensions troubleshoot plugins can implement both a collection and an API for retrieving custom information which is accessible in the same fashion as the rest of the api-server. The additional benefit to this, which can be demonstrated with `metrics-server`, is that collectors can be written for any other extension API and be compatible with existing tools the same way using the api-server provides compatibility with `kubectl`. Using `metrics-server` as an example, a collector can be written to collect node information which can be stored locally. The collector can then implement the standard [GET Endpoints](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) from the metrics-server and reply to them with the results it collected. By doing so, all tools that work with metrics-server today, like `kubectl top` will also work with information served from a support bundle. Additionally, the collector implementation handles any custom file formats without exposing that to any other tool making it easy to maintain. -Any other access to the filesystem directly will be modified to instead use the provided api-server. This means analyzers will not directly reference files on disk and will instead run against the api-server to analyze the collected information. This decouples analyzers from how collectors store data providing clean separation of concerns for maintaining both collectors and analyzers. This could also allow analyzers to run against existing clusters providing a use for analyzers independent of support bundle collection. This could encourage additional community contributions of analyzers. +Any other access to the filesystem directly will be modified to instead use the provided api-server. This means analyzers will not directly reference files on disk and will instead run against the api-server to analyze the collected information. This decouples analyzers from how collectors store data providing clean separation of concerns for maintaining both collectors and analyzers. Decoupling these also creates a natural way to implement analyzers that use data from multiple collectors. This could also allow analyzers to run against existing clusters providing a use for analyzers independent of support bundle collection. This could encourage additional community contributions of analyzers. + +An initially unintended benefit of using the Aggregation Layer is that any HostCollector using this implementation would be very close to an implementation of an extension which you could install in clusters. This could make HostCollectors also useful to install as a service in a live cluster for operations information about hosts. + +## Outstanding design questions + +* Is the overhead to write an Aggregation API going to add an unnecessary burden to writing new collector plugins? Can these be templated into a reasonably to ease collector creation? +* Can data collected from an api-server reasonably be put directly into an etcd still compatible with an api-server? + * Does [Velero](velero.io) which already can be configured to collect K8s objects have a close enough use case that it or it's modules could be reused to implement this? ## Detailed Design @@ -54,3 +62,4 @@ The current `sbctl` project could be left to run it's course independent of this ## Security Considerations Consideration to how Redactors are implemented needs to be considered. + From ab42cd766f9b93de3f06162170d4e7f1bff0c13b Mon Sep 17 00:00:00 2001 From: Chris Sanders Date: Wed, 13 Jul 2022 13:14:05 -0500 Subject: [PATCH 4/4] Update proposal for PR discussion and capture next steps --- ...aces.md => proposal-standard-interface.md} | 23 +++++++++++++------ 1 file changed, 16 insertions(+), 7 deletions(-) rename design/{standard-interfaces.md => proposal-standard-interface.md} (84%) diff --git a/design/standard-interfaces.md b/design/proposal-standard-interface.md similarity index 84% rename from design/standard-interfaces.md rename to design/proposal-standard-interface.md index 84d24909..156caca2 100644 --- a/design/standard-interfaces.md +++ b/design/proposal-standard-interface.md @@ -32,15 +32,21 @@ Any other access to the filesystem directly will be modified to instead use the An initially unintended benefit of using the Aggregation Layer is that any HostCollector using this implementation would be very close to an implementation of an extension which you could install in clusters. This could make HostCollectors also useful to install as a service in a live cluster for operations information about hosts. -## Outstanding design questions - -* Is the overhead to write an Aggregation API going to add an unnecessary burden to writing new collector plugins? Can these be templated into a reasonably to ease collector creation? -* Can data collected from an api-server reasonably be put directly into an etcd still compatible with an api-server? - * Does [Velero](velero.io) which already can be configured to collect K8s objects have a close enough use case that it or it's modules could be reused to implement this? - ## Detailed Design -TBD +### Outstanding design questions + +1. How reasonable is it to start an api-server as part of troubleshoot? Consider the following known implementations that do something like this: + +* [envtest](https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/envtest) - requires binaries present on the machine +* [microk8s implementation](https://github.com/canonical/microk8s/blob/master/build-scripts/patches/0000-Kubelite-integration.patch) - bundles slightly modified binaries +* [k0s uses upstream binaries statically compiled](https://docs.k0sproject.io/v1.23.8+k0s.0/architecture/) - bundles statically compiled binaries that self extract and uses a process monitor to run them + +2. Can you in fact push metadat like "Status" into an api-server or do we have to write directly to etcd? + +* If we can't push to the api-server is just writing the information directly into etcd something we can do and have a reasonable expectation of compatibility? + +3. Is the overhead to write an Aggregation API going to add an unnecessary burden to writing new collector plugins? Can these be templated into a reasonably to ease collector creation? ## Limitations @@ -63,3 +69,6 @@ The current `sbctl` project could be left to run it's course independent of this Consideration to how Redactors are implemented needs to be considered. +## References + +Original PR discussion found [here](https://github.com/replicatedhq/troubleshoot/pull/611)