180 Commits

Author SHA1 Message Date
divolgin
3351c289ab Add GVK to k8s objects in cluster-resources files 2022-02-04 01:31:07 +00:00
Salah Aldeen Al Saleh
6f0cf6550d find total memory instead of available (#525) 2022-01-11 17:02:44 -08:00
Salah Aldeen Al Saleh
d1f341b8ed host system packages collector/analyzer (#506)
* host system packages collector/analyzer
2021-12-10 12:05:21 -08:00
Salah Aldeen Al Saleh
e100e7c478 get container logs for unhealthy pods (#469)
* get container logs for unhealthy pods

Co-authored-by: divolgin <dmitriy@replicated.com>
Co-authored-by: divolgin <divolgin@users.noreply.github.com>
2021-10-28 09:21:14 -07:00
divolgin
ada35eb31c Replicaset collector and analyzer 2021-10-27 20:24:14 +00:00
Salah Aldeen Al Saleh
5a8561a31f include logs for init containers as well (#467) 2021-10-27 10:20:55 -07:00
divolgin
20f1b60f11 Include pod logs for pods that are failing 2021-10-26 00:01:26 +00:00
divolgin
072d2d7a36 Fix ceph collector 2021-10-22 23:01:13 +00:00
Andrew Reed
7b36e6a1f8 Copy in longhorn client (#454) 2021-10-22 15:24:07 -05:00
Rishabh Bohra
cf03503216 feat: Collect custom resources (#447)
* feat: Collect custom resources
Co-authored-by: Martin Hrabovcin<mhrabovcin@users.noreply.github.com>

Co-authored-by: Andrew Reed <andrew@replicated.com>
2021-10-21 16:49:59 -05:00
Jalaja Ganapathy
372454651e collector/analyzer for host operating system (#443)
* collector/analyzer for host operating system

* address cr comments

* cleanup

* fix invoking the analyzer
code cleanup

* fix cr comments

* add corner case unit-test

* fix kernel version parsing

* address review comments

* add default case

* parse using regex

* added more testcases and fixed the bug found in cr

* few small things
2021-10-12 14:42:23 -07:00
Simon Croome
dc8b38d249 Handle k8s api deprecations 2021-10-07 18:55:51 +01:00
Simon Croome
977fc438ea Remote host collectors (#392)
* Add collect command and remote host collectors

Adds the ability to run a host collector on a set of remote k8s nodes.
Target nodes can be filtered using the --selector flag, with the same
syntax as kubectl.  Existing flags for --collector-image,
--collector-pullpolicy and --request-timeout are used.  To run on a
specified node, --selector="kubernetes.io/hostname=kind-worker2" could
be used.

The collect command is used by the remote collector to output the
results using a "raw" format, which uses the filename as the key, and
the value the output as a escaped json string.  When run manually it
defaults to fully decoded json. The existing block devices,
ipv4interfaces and services host collectors don't decode properly - the
fix is to convert their slice output to a map (fix not included as
unsure what depends on the existing format).

The collect command is also useful for troubleshooting preflight issues.

Examples are included to show remote collector usage.

```
bin/collect --collector-image=croomes/troubleshoot:latest  examples/collect/remote/memory.yaml --namespace test
{
  "kind-control-plane": {
    "system/memory.json": {
      "total": 1304207360
    }
  },
  "kind-worker": {
    "system/memory.json": {
      "total": 1695780864
    }
  },
  "kind-worker2": {
    "system/memory.json": {
      "total": 1726353408
    }
  }
}
```

The preflight command has been updated to run remote collectors.  To run
a host collector remotely it must be specified in the spec as a
`remoteCollector`:

```
apiVersion: troubleshoot.sh/v1beta2
kind: HostPreflight
metadata:
  name: memory
spec:
  remoteCollectors:
    - memory:
        collectorName: memory
  analyzers:
    - memory:
        outcomes:
          - fail:
              when: "< 8Gi"
              message: At least 8Gi of memory is required
          - warn:
              when: "< 32Gi"
              message: At least 32Gi of memory is recommended
          - pass:
              message: The system has as sufficient memory
```

Results for each node are analyzed separately, with the node name
appended to the title:

```
bin/preflight --interactive=false --collector-image=croomes/troubleshoot:latest examples/preflight/remote/memory.yaml --format=json
{memory running 0 1}
{memory completed 1 1}
{
  "fail": [
    {
      "title": "Amount of Memory (kind-worker2)",
      "message": "At least 8Gi of memory is required"
    },
    {
      "title": "Amount of Memory (kind-worker)",
      "message": "At least 8Gi of memory is required"
    },
    {
      "title": "Amount of Memory (kind-control-plane)",
      "message": "At least 8Gi of memory is required"
    }
  ]
}
```

Also added a host collector to allow preflight checks of required kernel
modules, which is the main driver for this change.
2021-10-06 09:03:53 -05:00
Andrew Reed
4d52760d35 Collector and analyzer for sysctl parameters (#441)
Collector and analyzer for sysctl parameters
2021-10-01 13:43:26 -05:00
divolgin
ca51e92878 Allow memory writers 2021-09-30 18:25:52 +00:00
divolgin
6d0a57b16e Don't panic when no data is collected 2021-09-29 21:25:28 +00:00
divolgin
299497c0c0 Merge pull request #429 from danbudris/copyFromHostForCpNodes
add toleration to copy-from-host daemonset to allow collection from CP nodes
2021-09-29 09:01:14 -07:00
divolgin
0e8bedc281 Save collector data to disk directly 2021-09-29 00:15:02 +00:00
danbudris
67987a4432 add toleration to allow copy-from-host daemonset to run on CP nodes 2021-09-23 17:53:57 -04:00
Salah Aldeen Al Saleh
880c7dc3ea ability to specify a list of namespaces for the cluster resources collector (#424)
* ability to specify a list of namespaces for the cluster resources collector
2021-09-23 08:02:05 -07:00
Andrew Lavery
1b65d1a544 Merge pull request #413 from replicatedhq/laverya/collect-jobs-and-cronjobs
collect jobs and cronjobs as part of cluster-resources
2021-09-03 17:25:41 -04:00
Andrew Lavery
7fcc951c9a collect jobs and cronjobs as part of cluster-resources 2021-09-03 15:46:03 -05:00
Dan Stough
0478a7a60f fix: cluster-res collector fixed to one namespace 2021-09-03 19:23:44 +00:00
Kyle Sorensen
bf7d658313 troubleshoot enables collecting all data from a configmap (#395)
Enabled collecting all data from a ConfigMap instead of by key
2021-07-26 13:00:06 -06:00
Ethan Mosbaugh
851c91b582 remove debug log 2021-07-26 16:28:11 +00:00
Ethan Mosbaugh
cf7864cd97 Copy collectors extractArchive property 2021-07-23 13:37:57 +00:00
emosbaugh
8dcfa9886d Copy from host collector (#391)
* Copy from host collector

* namespace improvements

* better support for multiple nodes
2021-07-22 12:25:59 -07:00
kwsorensen
1ed6100ac8 Feature/validate tcp load balancer address (#387)
Load Balancer Validation part of troubleshoot pre-flight checks
2021-07-14 14:30:47 -06:00
emosbaugh
39350b5722 ConfigMap collector and secrets can be collected by selectors (#384)
* ConfigMap collector and secrets can be collected by selectors

* follow docs

* Pass context and kubernetes client to collectors

* collect tests

* analyze tests

* fix tests

* improvements
2021-07-08 16:30:26 -07:00
John Murphy
eef54d0021 force timezone to upper case 2021-07-06 08:42:12 -05:00
Andrew Reed
1ed8532663 Speed up replica checksum 2021-07-01 16:52:59 +00:00
Andrew Reed
3833955a58 Always include longhorn namespace 2021-07-01 15:03:28 +00:00
divolgin
52bbc0f2bf Don't skip TLS validation on http package's default client 2021-06-30 18:22:15 +00:00
Andrew Reed
cb3925a0af Longhorn replica corruption analyzer
This automates the procedure from
https://longhorn.io/docs/1.1.1/advanced-resources/data-recovery/corrupted-replica/
2021-06-22 21:55:12 +00:00
Andrew Reed
a86f5cae7d Collect all longhorn pod logs 2021-05-27 20:14:05 +00:00
Andrew Reed
646f7a6991 Longhorn collector for all CRDs
Also implement a single analyzer as a proof of concept. More analyzers
can be added using the collected CRDs.
2021-05-26 23:37:15 +00:00
Andrew Lavery
25a92dec56 collect rook block device disk stats
this contains both max size and currently used size for each PV
2021-04-20 15:41:47 -05:00
divolgin
39cf553a03 Merge pull request #359 from replicatedhq/divolgin/maxage
Honor maxAge for log collector if set in the spec
2021-04-19 13:26:29 -07:00
divolgin
e5233dfcf5 Honor maxAge for log collector if set in the spec 2021-04-19 20:15:41 +00:00
Andrew Reed
30f21ac71b Fix background IOPS blocking until timeout 2021-04-13 18:55:53 +00:00
Andrew Reed
0a6c9836e0 Add timeout to filesystem performance collector 2021-04-13 18:30:18 +00:00
Andrew Lavery
44993a5d0d collect RGW status as part of ceph collector 2021-04-12 23:14:00 -05:00
Andrew Reed
477cde7228 Benchmark write latency with background IOPS
Add a background IOPS feature to the filesystem performance collector
that specifies separate read and write background IOPS to perform while
measuring latency. This allows for better assessment of whether etcd
will be stable when running alongside other workloads on the same
cluster.

Also add templating to the outcome message of the filesystem performance
analyzers to allow printing individual latency percentiles or the entire
table.

Remove the random IOPS benchmark since it was attempting to perform
unaligned direct I/O.
2021-04-12 22:56:00 +00:00
divolgin
7a0c6e5383 use containers package instead of go-containerregistry 2021-04-11 21:39:44 +00:00
divolgin
fe414af556 Docker registry collector/analyzer 2021-04-09 16:17:15 +00:00
Andrew Lavery
bf4d26acd2 add host_services analyzer 2021-03-30 16:15:18 -04:00
Andrew Lavery
f3b599c19a collect host systemctl services 2021-03-30 16:15:17 -04:00
Salah Aldeen Al Saleh
afa0bc56d4 fix custom redactors file selectors in support bundle subdirectory (#336)
* fix custom redactors file selectors in support bundle subdirectory
2021-03-11 08:45:20 -08:00
Ethan Mosbaugh
4b78c430ca Host preflight ux improvements 2021-03-02 17:27:01 +00:00
Ethan Mosbaugh
09d16ff185 Host preflights exclude 2021-03-01 22:45:16 +00:00