59 Commits

Author SHA1 Message Date
Karan Goel
2a2bab3d28 Add network interface stats
We do not have to collect these often, so for now set the collection
interval to 120s (even though the Stackdriver exporter is still set to
export every 60s).
2021-01-20 08:56:34 -08:00
Kubernetes Prow Robot
c2d7a7be62 Merge pull request #513 from karan/cpu_activity_metrics
add metrics for process stats
2021-01-19 18:38:07 -08:00
Jeremy Edwards
adc587f222 Support filelog watching in Windows. 2021-01-13 17:16:46 +00:00
Karan Goel
71098097c0 add metrics for process stats
Tested on a COS VM:

```
$ curl -s localhost:20257/metrics | grep "^system_"
system_interrupts_total{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 8.759236e+07
system_processes_total{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 692506
system_procs_blocked{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 0
system_procs_running{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 2
```
2021-01-13 09:14:08 -08:00
varsha teratipally
f89f620909 added new line in the known_modules.json 2021-01-08 23:25:02 +00:00
varsha teratipally
eb38b4b598 added a new metric to retrieve os features like unknown modules 2021-01-08 21:52:16 +00:00
Kubernetes Prow Robot
59536256e3 Merge pull request #475 from vteratipally/boot_size_disk
catching hung task with pattern like "tasks airflow scheduler: *"
2020-11-18 14:42:50 -08:00
vteratipally
0c258bb704 Update kernel-monitor.json 2020-11-17 13:38:07 -08:00
Kubernetes Prow Robot
cff4a54d6a Merge pull request #488 from vteratipally/io_errors
Add Detectection logic for  I/O errors
2020-11-16 14:06:36 -08:00
Kubernetes Prow Robot
2d53c0a2a6 Merge pull request #481 from tosi3k/oom-regex-fix
Adapt OOMKilling pattern to old and new Linux kernels
2020-11-16 14:06:20 -08:00
Karan Goel
925ea7393c Collect CPU load averages in a separate metric 2020-11-09 09:41:52 -08:00
varsha teratipally
f01b5e5cfe Detect I/O errors 2020-11-06 03:48:33 +00:00
Antoni Zawodny
6b650e785e Adapt OOMKilling pattern to old and new Linux kernels 2020-10-22 15:12:26 +02:00
varsha teratipally
f984abbe2e catching hung task with pattern like taks airflow scheduler: some of the events related to hungtask is not identified 2020-10-08 23:04:15 +00:00
vteratipally
edfd70a16c Update docker-monitor.json
fixed json format error as it doesn't allow trailing commas
2020-08-11 10:02:17 -07:00
vteratipally
fbdd9eec9a Update docker-monitor.json
making DockerContainerStartup failure as temporary
2020-08-11 09:59:46 -07:00
varsha teratipally
4ce29a95d5 removed the $ symbol as npd handles end of the line 2020-08-06 01:30:11 +00:00
varsha teratipally
95237efb4d Detect docker startup failures 2020-08-05 21:29:11 +00:00
Archit Bansal
84188cc0aa Set auto-repair=true by default for health check monitors. 2020-07-15 18:57:53 -07:00
Archit Bansal
44dc4aa6c1 Add health-check-monitor 2020-05-27 14:08:42 -07:00
Xuewei Zhang
83b09277f0 Collect more cpu/disk/memory metrics 2020-02-03 15:29:45 -08:00
Xuewei Zhang
b3f811d171 Add detection for ext4 errors 2019-12-06 14:49:17 -08:00
Kubernetes Prow Robot
3a41fc2fc3 Merge pull request #392 from arekkusu/origin/patch-2
Improve systemctl check, style + cleanup
2019-11-29 01:33:03 -08:00
Alexandre
4df720c2a0 Improve systemctl check, style + cleanup
- Use `systemctl is-active` to check if service is running
  - Cleaner that `grep` on `systemctl status` output
  - Return success means service is running/active
  - Return failure means not running which could be due to
    stopped/failed service or that service does not exist

- Use `command -v` instead of `which`
  Ref: https://github.com/koalaman/shellcheck/wiki/SC2230

- Follow Google "Shell Style Guide": indent, use "readonly"

- Minor: Rephrase comment, avoid all caps
2019-11-29 14:14:19 +09:00
Alexandre
a91b568149 Support "nf_conntrack", check 90% full, style
- Script was checking for "ip_conntrack_..." which was replaced by "nf_conntrack_..." on newer system. Now support both.

-  Return failure ("not ok") when table is more than 90% full.
  - Not sure what value is best here but I think that is better than when the table is full.
    Otherwise we might end up with a value close to the max or bouncing around.

- Replaced cat by "$(< file )" to avoid calling external command
- Follow Google "Shell Style Guide": 2 space indent, use preferred "[[ test ]]", add "readonly"
- Include current connection usage in output message
2019-11-29 13:20:37 +09:00
Kubernetes Prow Robot
5345185ec2 Merge pull request #341 from iranzo/patch-1
Update network_problem.sh
2019-09-15 01:00:37 -07:00
Xuewei Zhang
0f0e5eff0f Adding stackdriver exporter 2019-09-12 18:30:00 -07:00
Pablo Iranzo Gómez
fa94b42849 Use bashate recommendations on network_problem script 2019-09-05 15:46:45 +02:00
Xuewei Zhang
f9b5e60a43 Add e2e test for NPD
The first test is a very simple test. It installs NPD on a VM, and then
verifies that NPD reports metric host_uptime in Prometheus format.
2019-08-16 01:33:29 -07:00
Zhen Wang
a8527712f6 Update the detection method for docker overlay2 issue 2019-08-01 22:16:44 -07:00
Zhen Wang
570ae0cb20 Make systemd monitor look back for 5m 2019-07-30 11:17:02 -07:00
Xuewei Zhang
94af7de97b Report metrics from custom-plugin-monitor 2019-07-25 11:28:38 -07:00
Xuewei Zhang
fbebcf311b Report metrics from system-log-monitor 2019-07-12 14:38:21 -07:00
Xuewei Zhang
4944ac3e48 Implement host collector as part of system-stats-monitor
Host collector report three things today:
1. Host OS uptime (in seconds)
2. Host kernel version (as a metric label)
3. Host OS version (as a metric label)
2019-06-27 16:40:11 -07:00
Zhen Wang
b94a555dfc Add systemd monitor for kubelet, docker, and containerd restart events 2019-06-18 10:26:53 -07:00
Xuewei Zhang
7ad5dec712 Add disk metrics support. 2019-06-13 00:51:17 -07:00
Andy Xie
33dffe0761 enable codnition updaet when message change for custom plugin 2018-12-11 13:14:49 +08:00
Zhen Wang
6b983a9ea3 Detect corrupt docker overlay2 2018-11-27 00:35:42 -08:00
Zhen Wang
1f636381b8 Detect kubelet and container runtime frequent crashes 2018-11-26 22:41:06 -08:00
Zhen Wang
ecaa61e7d3 Detect readonly filesystem 2018-11-20 11:20:48 -08:00
Jan Heidbrink
659f31c0f2 Adapt OOMKilling pattern to current kernels 2018-07-31 15:15:45 +02:00
David Ashpole
bf730e9c63 add log-counter go plugin 2018-06-20 15:55:19 -07:00
Jasmine Hegman
76ce35cddc Possibly enhanced network_problem custom plugin
My comment was eaten by github in !152 and wanted to raise attention incase this was meant to be an exit instead of an echo, otherwise feel free to close!
2018-01-05 11:26:15 -07:00
Rohit Ramkumar
69b6b58ee3 Addressed comments 2017-12-19 08:32:27 -08:00
Rohit Ramkumar
cd472c7765 Add empty conditions list 2017-11-27 11:35:48 -08:00
Rohit Ramkumar
fb12f3b70e Add network monitor script as plugin 2017-11-27 11:33:38 -08:00
Andy Xie
10dbfef1a8 add custom problem detector plugin 2017-11-22 10:14:09 +08:00
Ajit Kumar
d2de52f090 Add rule for docker image pull error 2017-06-21 13:48:58 -07:00
Julius Milan
b579984f0a Fix abrt-adaptor config for cpp problems
This modifies pattern for catching cpp problem messages produced by
ABRT. Found that not all mentioned messages fit into former pattern.
For example following is valid cpp problem message produced by ABRT:

Process xxx (bad_binary) crashed in Will::Fail::a() [clone .isra.2]()

but doesn't fit former pattern, since it's last part contains
whitespaces.
2017-05-11 15:40:25 +02:00
Julius Milan
abcf6a4f4b Add ABRT adaptor config 2017-03-23 16:15:56 +01:00