Commit Graph

19 Commits

Author SHA1 Message Date
goushicui
7ecb76f31a add memory read error 2021-02-09 14:08:18 +08:00
Kubernetes Prow Robot
59536256e3 Merge pull request #475 from vteratipally/boot_size_disk
catching hung task with pattern like "tasks airflow scheduler: *"
2020-11-18 14:42:50 -08:00
vteratipally
0c258bb704 Update kernel-monitor.json 2020-11-17 13:38:07 -08:00
Kubernetes Prow Robot
cff4a54d6a Merge pull request #488 from vteratipally/io_errors
Add Detectection logic for  I/O errors
2020-11-16 14:06:36 -08:00
varsha teratipally
f01b5e5cfe Detect I/O errors 2020-11-06 03:48:33 +00:00
Antoni Zawodny
6b650e785e Adapt OOMKilling pattern to old and new Linux kernels 2020-10-22 15:12:26 +02:00
varsha teratipally
f984abbe2e catching hung task with pattern like taks airflow scheduler: some of the events related to hungtask is not identified 2020-10-08 23:04:15 +00:00
Xuewei Zhang
b3f811d171 Add detection for ext4 errors 2019-12-06 14:49:17 -08:00
Xuewei Zhang
fbebcf311b Report metrics from system-log-monitor 2019-07-12 14:38:21 -07:00
Zhen Wang
ecaa61e7d3 Detect readonly filesystem 2018-11-20 11:20:48 -08:00
Jan Heidbrink
659f31c0f2 Adapt OOMKilling pattern to current kernels 2018-07-31 15:15:45 +02:00
Random-Liu
27cc831408 Add arbitrary daemon log support 2017-02-10 11:32:35 -08:00
Random-Liu
d281cb8a15 Fix kernel monitor issues:
* Change `unregister_netdevice` to be an event to fix #47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix #48.
2017-02-09 16:09:27 -08:00
Random-Liu
2ef2af99eb Update Readme.md 2017-01-19 01:59:09 -08:00
Random-Liu
c15d463ad5 Finish the journald support 2017-01-19 01:59:09 -08:00
Lantao Liu
532f933bd8 This PR:
1) Add lookback support in kernel monitor. After started, Kernel monitor
will check some old logs to detect problems which happened before last
node reboot.
2) Add `lookback` and `startPattern` in kernel monitor configuration.
  * `lookback` specifies how long time kernel monitor should look back.
  * `startPattern` specifies which log indicates the node is started.
  kernel monitor will clear all current node conditions once it finds
  a node start log. This makes sure that old problems won't change the
  node condition.
3) Add support for kernel panic monitoring, the null pointer and divide
0 kernel panic will be surfaced as event. Usually kernel monitor will
report these events during looking back phase.
2016-08-20 19:11:26 -07:00
Lantao Liu
5b07afd325 1. Make source and conditions configurable.
2. Add multiple events and conditions support in problem interface.
2016-06-02 15:32:02 -07:00
Lantao Liu
8759e4d610 Use Patch instead of UpdateStatus. 2016-05-30 19:22:32 -07:00
Lantao Liu
f0312655bd Add first version of node-problem-detector 2016-05-17 15:55:33 -07:00