86 Commits

Author SHA1 Message Date
Eric Lin
ce82f2a81b Update config/systemd-monitor.json to match all systemd StatusUnitFormat 2024-01-18 17:16:05 +00:00
Eric Lin
c225435bea Use --revert-pattern to discount proactive restarts 2024-01-17 18:24:24 +00:00
Eric Lin
0fba03ef7a Make pattern match all systemd StatusUnitFormat 2024-01-14 20:02:13 +00:00
Antonio Ojea
552b530e0b custom plugin to monitor iptables versions rules
iptables has two kernel backends, legacy and nft.

Quoting https://developers.redhat.com/blog/2020/08/18/iptables-the-two-variants-and-their-relationship-with-nftables

> It is also important to note that while iptables-nft
> can supplant iptables-legacy, you should never use them simultaneously.

However, we don't want to block the node operations because of this
reason, as there is no enough evidence this is causing big issues in the
wild, so we just signal and warn about this situation.

Once we have more information we can revisit this decision and
keep it as is or move it to permanent.
2023-12-21 09:34:04 +00:00
Jarkko Sonninen
07900633cb Add disk and memory percent_used 2023-10-28 16:03:48 +03:00
Yordis Prieto Lazo
0842910049 chore: fix misspelling 2022-12-18 22:58:07 -05:00
Mike Miranda
1471f74d98 Add ExcludeInterfaceRegexp to Net Dev monitor 2022-06-15 23:22:38 +00:00
Julie Qi
fe09e416bd remove aufs hung check 2021-07-30 13:53:25 -07:00
Kubernetes Prow Robot
e349323507 Merge pull request #539 from smileusd/health_check
improvement health-checker
2021-06-25 09:48:45 -07:00
Jeremy Edwards
d52844ae67 Add HCS empty layer error reporting. 2021-06-22 17:06:42 +00:00
michelletandya
caf2bad7b6 config/windows-defender-monitor.json 2021-05-24 20:08:47 +00:00
tashen
a3b928467e add loopbacktime to reduce time of journalctl call 2021-05-19 13:55:55 +08:00
Kubernetes Prow Robot
9c541692ee Merge pull request #557 from vteratipally/adfad
Make sure the path to known-modules.json is relative
2021-05-14 14:39:59 -07:00
Varsha Teratipally
a79b87ce7e Make sure the path to known-modules.json is relative to the
system-stats-monitor.json file
2021-05-14 21:14:55 +00:00
michelletandya
01fa5b3afd Add windows defender problem detection custom plugin 2021-05-12 20:28:33 +00:00
Jeremy Edwards
d4933875ed Add support for basic system metrics for Windows. 2021-05-10 21:58:38 +00:00
michelletandya
01cd8dd08c Add healthChecker functionality for kube-proxy service 2021-05-05 17:27:58 +00:00
michelletandya
da15eb9afe Detect containerD errors and failures. 2021-04-29 23:47:04 +00:00
michelletandya
c4e5400ed6 separate linux/windows health checker files. 2021-04-26 21:45:05 +00:00
michelletandya
344daabaa7 Update windows containerd config file to run without errors 2021-03-30 23:26:06 +00:00
Jeremy Edwards
4181ece888 Windows Support: Fix Build Regressions, Tests Pass 2021-03-14 10:24:45 -07:00
Kubernetes Prow Robot
06b5503348 Merge pull request #530 from goushicui/master
add memory read error
2021-02-18 07:46:51 -08:00
goushicui
7ecb76f31a add memory read error 2021-02-09 14:08:18 +08:00
Karan Goel
8648fe265a add metric for per-cpu, per-stage timing 2021-01-29 08:46:39 -08:00
Kubernetes Prow Robot
e34e2763cf Merge pull request #519 from Random-Liu/fix-indention
Fix system-stats-monitor config indention.
2021-01-28 23:47:41 -08:00
Lantao Liu
144fad7706 Fix system-stats-monitor config indention. 2021-01-28 22:59:47 -08:00
Lantao Liu
c2ad21a380 Add containerd health checker config. 2021-01-28 22:46:55 -08:00
Karan Goel
2a2bab3d28 Add network interface stats
We do not have to collect these often, so for now set the collection
interval to 120s (even though the Stackdriver exporter is still set to
export every 60s).
2021-01-20 08:56:34 -08:00
Kubernetes Prow Robot
c2d7a7be62 Merge pull request #513 from karan/cpu_activity_metrics
add metrics for process stats
2021-01-19 18:38:07 -08:00
Jeremy Edwards
adc587f222 Support filelog watching in Windows. 2021-01-13 17:16:46 +00:00
Karan Goel
71098097c0 add metrics for process stats
Tested on a COS VM:

```
$ curl -s localhost:20257/metrics | grep "^system_"
system_interrupts_total{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 8.759236e+07
system_processes_total{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 692506
system_procs_blocked{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 0
system_procs_running{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 2
```
2021-01-13 09:14:08 -08:00
varsha teratipally
f89f620909 added new line in the known_modules.json 2021-01-08 23:25:02 +00:00
varsha teratipally
eb38b4b598 added a new metric to retrieve os features like unknown modules 2021-01-08 21:52:16 +00:00
Kubernetes Prow Robot
59536256e3 Merge pull request #475 from vteratipally/boot_size_disk
catching hung task with pattern like "tasks airflow scheduler: *"
2020-11-18 14:42:50 -08:00
vteratipally
0c258bb704 Update kernel-monitor.json 2020-11-17 13:38:07 -08:00
Kubernetes Prow Robot
cff4a54d6a Merge pull request #488 from vteratipally/io_errors
Add Detectection logic for  I/O errors
2020-11-16 14:06:36 -08:00
Kubernetes Prow Robot
2d53c0a2a6 Merge pull request #481 from tosi3k/oom-regex-fix
Adapt OOMKilling pattern to old and new Linux kernels
2020-11-16 14:06:20 -08:00
Karan Goel
925ea7393c Collect CPU load averages in a separate metric 2020-11-09 09:41:52 -08:00
varsha teratipally
f01b5e5cfe Detect I/O errors 2020-11-06 03:48:33 +00:00
Antoni Zawodny
6b650e785e Adapt OOMKilling pattern to old and new Linux kernels 2020-10-22 15:12:26 +02:00
varsha teratipally
f984abbe2e catching hung task with pattern like taks airflow scheduler: some of the events related to hungtask is not identified 2020-10-08 23:04:15 +00:00
vteratipally
edfd70a16c Update docker-monitor.json
fixed json format error as it doesn't allow trailing commas
2020-08-11 10:02:17 -07:00
vteratipally
fbdd9eec9a Update docker-monitor.json
making DockerContainerStartup failure as temporary
2020-08-11 09:59:46 -07:00
varsha teratipally
4ce29a95d5 removed the $ symbol as npd handles end of the line 2020-08-06 01:30:11 +00:00
varsha teratipally
95237efb4d Detect docker startup failures 2020-08-05 21:29:11 +00:00
Archit Bansal
84188cc0aa Set auto-repair=true by default for health check monitors. 2020-07-15 18:57:53 -07:00
Archit Bansal
44dc4aa6c1 Add health-check-monitor 2020-05-27 14:08:42 -07:00
Xuewei Zhang
83b09277f0 Collect more cpu/disk/memory metrics 2020-02-03 15:29:45 -08:00
Xuewei Zhang
b3f811d171 Add detection for ext4 errors 2019-12-06 14:49:17 -08:00
Kubernetes Prow Robot
3a41fc2fc3 Merge pull request #392 from arekkusu/origin/patch-2
Improve systemctl check, style + cleanup
2019-11-29 01:33:03 -08:00