node-problem-detector

mirror of https://github.com/kubernetes/node-problem-detector.git synced 2026-02-23 14:24:00 +00:00

Author	SHA1	Message	Date
Archit Bansal	44dc4aa6c1	Add health-check-monitor	2020-05-27 14:08:42 -07:00
Xuewei Zhang	83b09277f0	Collect more cpu/disk/memory metrics	2020-02-03 15:29:45 -08:00
Xuewei Zhang	b3f811d171	Add detection for ext4 errors	2019-12-06 14:49:17 -08:00
Kubernetes Prow Robot	3a41fc2fc3	Merge pull request #392 from arekkusu/origin/patch-2 Improve systemctl check, style + cleanup	2019-11-29 01:33:03 -08:00
Alexandre	4df720c2a0	Improve systemctl check, style + cleanup - Use `systemctl is-active` to check if service is running - Cleaner that `grep` on `systemctl status` output - Return success means service is running/active - Return failure means not running which could be due to stopped/failed service or that service does not exist - Use `command -v` instead of `which` Ref: https://github.com/koalaman/shellcheck/wiki/SC2230 - Follow Google "Shell Style Guide": indent, use "readonly" - Minor: Rephrase comment, avoid all caps	2019-11-29 14:14:19 +09:00
Alexandre	a91b568149	Support "nf_conntrack", check 90% full, style - Script was checking for "ip_conntrack_..." which was replaced by "nf_conntrack_..." on newer system. Now support both. - Return failure ("not ok") when table is more than 90% full. - Not sure what value is best here but I think that is better than when the table is full. Otherwise we might end up with a value close to the max or bouncing around. - Replaced cat by "$(< file )" to avoid calling external command - Follow Google "Shell Style Guide": 2 space indent, use preferred "[[ test ]]", add "readonly" - Include current connection usage in output message	2019-11-29 13:20:37 +09:00
Kubernetes Prow Robot	5345185ec2	Merge pull request #341 from iranzo/patch-1 Update network_problem.sh	2019-09-15 01:00:37 -07:00
Xuewei Zhang	0f0e5eff0f	Adding stackdriver exporter	2019-09-12 18:30:00 -07:00
Pablo Iranzo Gómez	fa94b42849	Use bashate recommendations on network_problem script	2019-09-05 15:46:45 +02:00
Xuewei Zhang	f9b5e60a43	Add e2e test for NPD The first test is a very simple test. It installs NPD on a VM, and then verifies that NPD reports metric host_uptime in Prometheus format.	2019-08-16 01:33:29 -07:00
Zhen Wang	a8527712f6	Update the detection method for docker overlay2 issue	2019-08-01 22:16:44 -07:00
Zhen Wang	570ae0cb20	Make systemd monitor look back for 5m	2019-07-30 11:17:02 -07:00
Xuewei Zhang	94af7de97b	Report metrics from custom-plugin-monitor	2019-07-25 11:28:38 -07:00
Xuewei Zhang	fbebcf311b	Report metrics from system-log-monitor	2019-07-12 14:38:21 -07:00
Xuewei Zhang	4944ac3e48	Implement host collector as part of system-stats-monitor Host collector report three things today: 1. Host OS uptime (in seconds) 2. Host kernel version (as a metric label) 3. Host OS version (as a metric label)	2019-06-27 16:40:11 -07:00
Zhen Wang	b94a555dfc	Add systemd monitor for kubelet, docker, and containerd restart events	2019-06-18 10:26:53 -07:00
Xuewei Zhang	7ad5dec712	Add disk metrics support.	2019-06-13 00:51:17 -07:00
Andy Xie	33dffe0761	enable codnition updaet when message change for custom plugin	2018-12-11 13:14:49 +08:00
Zhen Wang	6b983a9ea3	Detect corrupt docker overlay2	2018-11-27 00:35:42 -08:00
Zhen Wang	1f636381b8	Detect kubelet and container runtime frequent crashes	2018-11-26 22:41:06 -08:00
Zhen Wang	ecaa61e7d3	Detect readonly filesystem	2018-11-20 11:20:48 -08:00
Jan Heidbrink	659f31c0f2	Adapt OOMKilling pattern to current kernels	2018-07-31 15:15:45 +02:00
David Ashpole	bf730e9c63	add log-counter go plugin	2018-06-20 15:55:19 -07:00
Jasmine Hegman	76ce35cddc	Possibly enhanced network_problem custom plugin My comment was eaten by github in !152 and wanted to raise attention incase this was meant to be an exit instead of an echo, otherwise feel free to close!	2018-01-05 11:26:15 -07:00
Rohit Ramkumar	69b6b58ee3	Addressed comments	2017-12-19 08:32:27 -08:00
Rohit Ramkumar	cd472c7765	Add empty conditions list	2017-11-27 11:35:48 -08:00
Rohit Ramkumar	fb12f3b70e	Add network monitor script as plugin	2017-11-27 11:33:38 -08:00
Andy Xie	10dbfef1a8	add custom problem detector plugin	2017-11-22 10:14:09 +08:00
Ajit Kumar	d2de52f090	Add rule for docker image pull error	2017-06-21 13:48:58 -07:00
Julius Milan	b579984f0a	Fix abrt-adaptor config for cpp problems This modifies pattern for catching cpp problem messages produced by ABRT. Found that not all mentioned messages fit into former pattern. For example following is valid cpp problem message produced by ABRT: Process xxx (bad_binary) crashed in Will::Fail::a() [clone .isra.2]() but doesn't fit former pattern, since it's last part contains whitespaces.	2017-05-11 15:40:25 +02:00
Julius Milan	abcf6a4f4b	Add ABRT adaptor config	2017-03-23 16:15:56 +01:00
Random-Liu	10fc831409	Change kernel specific name in code base and change syslog to filelog.	2017-02-15 13:07:01 -08:00
Random-Liu	27cc831408	Add arbitrary daemon log support	2017-02-10 11:32:35 -08:00
Random-Liu	d281cb8a15	Fix kernel monitor issues: * Change `unregister_netdevice` to be an event to fix #47. * Change `KernelPanic` to `KernelOops` because we can't handle kernel panic currently. * Use system boot time instead of "StartPattern" to fix #48.	2017-02-09 16:09:27 -08:00
Random-Liu	2ef2af99eb	Update Readme.md	2017-01-19 01:59:09 -08:00
Random-Liu	c15d463ad5	Finish the journald support	2017-01-19 01:59:09 -08:00
Lantao Liu	532f933bd8	This PR: 1) Add lookback support in kernel monitor. After started, Kernel monitor will check some old logs to detect problems which happened before last node reboot. 2) Add `lookback` and `startPattern` in kernel monitor configuration. * `lookback` specifies how long time kernel monitor should look back. * `startPattern` specifies which log indicates the node is started. kernel monitor will clear all current node conditions once it finds a node start log. This makes sure that old problems won't change the node condition. 3) Add support for kernel panic monitoring, the null pointer and divide 0 kernel panic will be surfaced as event. Usually kernel monitor will report these events during looking back phase.	2016-08-20 19:11:26 -07:00
Lantao Liu	5b07afd325	1. Make source and conditions configurable. 2. Add multiple events and conditions support in problem interface.	2016-06-02 15:32:02 -07:00
Lantao Liu	8759e4d610	Use Patch instead of UpdateStatus.	2016-05-30 19:22:32 -07:00
Lantao Liu	f0312655bd	Add first version of node-problem-detector	2016-05-17 15:55:33 -07:00

40 Commits