Owen Strain
556ff0cc7a
Update log pattern to match kubelet service name from kubelet .deb package
2025-12-26 23:10:06 +00:00
Adrian Moisey
bc3150aa62
Update config/plugin/dns_problem.sh
...
Co-authored-by: Dan Winship <danwinship@redhat.com >
2025-09-22 15:11:07 +02:00
Adrian Moisey
3f139c4165
Add DNS lookup plugin
2025-07-21 21:38:49 +02:00
Jian Wen
5562632053
Add UEFI Common Platform Error Record (CPER) support
...
CPER is the format used to describe platform hardware error by various
tables, such as ERST, BERT and HEST etc.
The event severity message is printed here:
https://github.com/torvalds/linux/blob/v6.7/drivers/firmware/efi/cper.c#L639
Examples are as below.
Corrected error:
kernel: {37}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 162
kernel: {37}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {37}[Hardware Error]: event severity: corrected
kernel: {37}[Hardware Error]: Error 0, type: corrected
kernel: {37}[Hardware Error]: section_type: memory error
kernel: {37}[Hardware Error]: error_status: 0x0000000000000400
kernel: {37}[Hardware Error]: physical_address: 0x000000b50c68ce80
kernel: {37}[Hardware Error]: node: 1 card: 4 module: 0 rank: 0 bank: 1 device: 14 row: 58165 column: 816
kernel: {37}[Hardware Error]: error_type: 2, single-bit ECC
kernel: {37}[Hardware Error]: DIMM location: CPU 2 DIMM 30
Recoverable error:
kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
kernel: {3}[Hardware Error]: event severity: recoverable
kernel: {3}[Hardware Error]: Error 0, type: recoverable
kernel: {3}[Hardware Error]: fru_text: B1
kernel: {3}[Hardware Error]: section_type: memory error
kernel: {3}[Hardware Error]: error_status: 0x0000000000000400
kernel: {3}[Hardware Error]: physical_address: 0x000000393cfe5040
kernel: {3}[Hardware Error]: node: 2 card: 0 module: 0 rank: 0 bank: 3 device: 0 row: 34719 column: 320
kernel: {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
Fatal error:
kernel: BERT: Error records from previous boot:
kernel: [Hardware Error]: event severity: fatal
kernel: [Hardware Error]: Error 0, type: fatal
kernel: [Hardware Error]: fru_text: DIMM B5
kernel: [Hardware Error]: section_type: memory error
kernel: [Hardware Error]: error_status: 0x0000000000000400
kernel: [Hardware Error]: physical_address: 0x000000393d7e4040
kernel: [Hardware Error]: node: 2 card: 4 module: 0 rank: 0 bank: 3 device: 0 row: 34743 column: 256
Steps to test the new metrics.
$ echo "kernel: {37}[Hardware Error]: event severity: corrected" | sudo tee /dev/kmsg
$ echo "kernel: {3}[Hardware Error]: event severity: recoverable" | sudo tee /dev/kmsg
$ echo "kernel: [Hardware Error]: event severity: fatal" | sudo tee /dev/kmsg
Expected metrics are as below:
$ curl localhost:20257/metrics
problem_counter{reason="CperHardwareErrorCorrected"} 1
problem_counter{reason="CperHardwareErrorFatal"} 1
problem_counter{reason="CperHardwareErrorRecoverable"} 1
...
problem_gauge{reason="CperHardwareErrorFatal",type="CperHardwareErrorFatal"} 1
Signed-off-by: Jian Wen <wenjianhn@gmail.com >
2025-03-12 11:00:50 +08:00
Jian Wen
2e15606dda
Monitor XFS shutdown
...
Related kernel error messages are as below.
kernel: XFS (dm-4): Internal error xfs_iunlink_remove at line 2038 of file fs/xfs/xfs_inode.c. Caller xfs_ifree+0x33/0x130 [xfs]
kernel: XFS (dm-4): Corruption detected. Unmount and run xfs_repair
kernel: XFS (dm-4): xfs_inactive_ifree: xfs_ifree returned error -117
kernel: XFS (dm-4): xfs_do_force_shutdown(0x1) called from line 1788 of file fs/xfs/xfs_inode.c. Return address = 000000009d022bf1
kernel: XFS (dm-4): I/O Error Detected. Shutting down filesystem
kernel: XFS (dm-4): Please umount the filesystem and rectify the problem(s)
Signed-off-by: Jian Wen <wenjianhn@gmail.com >
2024-11-14 15:46:00 +08:00
Kubernetes Prow Robot
0f4d8b96c5
Merge pull request #961 from smileusd/upstream_add_black_list_in_log_watcher
...
add black list to aviod take too much efforts to translate in file log watcher
2024-10-16 16:09:03 +01:00
tashen
3a386a659e
add skip list to aviod take too much efforts to translate in file log watcher
2024-10-16 10:56:15 +08:00
Veer Singh
ee955f9170
Move ReadonlyFilesystem to separate config file
...
Moved the ReadonlyFilesystem Node Condition to a separate plugin
configuration file and updated NPD to contain the appropiate new flags.
2024-10-09 00:20:49 -07:00
baihongru
daf4f4da3e
Update abrt-adaptor.json
...
Indicates the unified name of KernelOops
2024-08-16 16:40:05 +08:00
Eric Lin
ce82f2a81b
Update config/systemd-monitor.json to match all systemd StatusUnitFormat
2024-01-18 17:16:05 +00:00
Eric Lin
c225435bea
Use --revert-pattern to discount proactive restarts
2024-01-17 18:24:24 +00:00
Eric Lin
0fba03ef7a
Make pattern match all systemd StatusUnitFormat
2024-01-14 20:02:13 +00:00
Antonio Ojea
552b530e0b
custom plugin to monitor iptables versions rules
...
iptables has two kernel backends, legacy and nft.
Quoting https://developers.redhat.com/blog/2020/08/18/iptables-the-two-variants-and-their-relationship-with-nftables
> It is also important to note that while iptables-nft
> can supplant iptables-legacy, you should never use them simultaneously.
However, we don't want to block the node operations because of this
reason, as there is no enough evidence this is causing big issues in the
wild, so we just signal and warn about this situation.
Once we have more information we can revisit this decision and
keep it as is or move it to permanent.
2023-12-21 09:34:04 +00:00
Jarkko Sonninen
07900633cb
Add disk and memory percent_used
2023-10-28 16:03:48 +03:00
Yordis Prieto Lazo
0842910049
chore: fix misspelling
2022-12-18 22:58:07 -05:00
Mike Miranda
1471f74d98
Add ExcludeInterfaceRegexp to Net Dev monitor
2022-06-15 23:22:38 +00:00
Julie Qi
fe09e416bd
remove aufs hung check
2021-07-30 13:53:25 -07:00
Kubernetes Prow Robot
e349323507
Merge pull request #539 from smileusd/health_check
...
improvement health-checker
2021-06-25 09:48:45 -07:00
Jeremy Edwards
d52844ae67
Add HCS empty layer error reporting.
2021-06-22 17:06:42 +00:00
michelletandya
caf2bad7b6
config/windows-defender-monitor.json
2021-05-24 20:08:47 +00:00
tashen
a3b928467e
add loopbacktime to reduce time of journalctl call
2021-05-19 13:55:55 +08:00
Kubernetes Prow Robot
9c541692ee
Merge pull request #557 from vteratipally/adfad
...
Make sure the path to known-modules.json is relative
2021-05-14 14:39:59 -07:00
Varsha Teratipally
a79b87ce7e
Make sure the path to known-modules.json is relative to the
...
system-stats-monitor.json file
2021-05-14 21:14:55 +00:00
michelletandya
01fa5b3afd
Add windows defender problem detection custom plugin
2021-05-12 20:28:33 +00:00
Jeremy Edwards
d4933875ed
Add support for basic system metrics for Windows.
2021-05-10 21:58:38 +00:00
michelletandya
01cd8dd08c
Add healthChecker functionality for kube-proxy service
2021-05-05 17:27:58 +00:00
michelletandya
da15eb9afe
Detect containerD errors and failures.
2021-04-29 23:47:04 +00:00
michelletandya
c4e5400ed6
separate linux/windows health checker files.
2021-04-26 21:45:05 +00:00
michelletandya
344daabaa7
Update windows containerd config file to run without errors
2021-03-30 23:26:06 +00:00
Jeremy Edwards
4181ece888
Windows Support: Fix Build Regressions, Tests Pass
2021-03-14 10:24:45 -07:00
Kubernetes Prow Robot
06b5503348
Merge pull request #530 from goushicui/master
...
add memory read error
2021-02-18 07:46:51 -08:00
goushicui
7ecb76f31a
add memory read error
2021-02-09 14:08:18 +08:00
Karan Goel
8648fe265a
add metric for per-cpu, per-stage timing
2021-01-29 08:46:39 -08:00
Kubernetes Prow Robot
e34e2763cf
Merge pull request #519 from Random-Liu/fix-indention
...
Fix system-stats-monitor config indention.
2021-01-28 23:47:41 -08:00
Lantao Liu
144fad7706
Fix system-stats-monitor config indention.
2021-01-28 22:59:47 -08:00
Lantao Liu
c2ad21a380
Add containerd health checker config.
2021-01-28 22:46:55 -08:00
Karan Goel
2a2bab3d28
Add network interface stats
...
We do not have to collect these often, so for now set the collection
interval to 120s (even though the Stackdriver exporter is still set to
export every 60s).
2021-01-20 08:56:34 -08:00
Kubernetes Prow Robot
c2d7a7be62
Merge pull request #513 from karan/cpu_activity_metrics
...
add metrics for process stats
2021-01-19 18:38:07 -08:00
Jeremy Edwards
adc587f222
Support filelog watching in Windows.
2021-01-13 17:16:46 +00:00
Karan Goel
71098097c0
add metrics for process stats
...
Tested on a COS VM:
```
$ curl -s localhost:20257/metrics | grep "^system_"
system_interrupts_total{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 8.759236e+07
system_processes_total{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 692506
system_procs_blocked{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 0
system_procs_running{kernel_version="5.4.49+",os_version="cos 85-13310.1041.24"} 2
```
2021-01-13 09:14:08 -08:00
varsha teratipally
f89f620909
added new line in the known_modules.json
2021-01-08 23:25:02 +00:00
varsha teratipally
eb38b4b598
added a new metric to retrieve os features like unknown modules
2021-01-08 21:52:16 +00:00
Kubernetes Prow Robot
59536256e3
Merge pull request #475 from vteratipally/boot_size_disk
...
catching hung task with pattern like "tasks airflow scheduler: *"
2020-11-18 14:42:50 -08:00
vteratipally
0c258bb704
Update kernel-monitor.json
2020-11-17 13:38:07 -08:00
Kubernetes Prow Robot
cff4a54d6a
Merge pull request #488 from vteratipally/io_errors
...
Add Detectection logic for I/O errors
2020-11-16 14:06:36 -08:00
Kubernetes Prow Robot
2d53c0a2a6
Merge pull request #481 from tosi3k/oom-regex-fix
...
Adapt OOMKilling pattern to old and new Linux kernels
2020-11-16 14:06:20 -08:00
Karan Goel
925ea7393c
Collect CPU load averages in a separate metric
2020-11-09 09:41:52 -08:00
varsha teratipally
f01b5e5cfe
Detect I/O errors
2020-11-06 03:48:33 +00:00
Antoni Zawodny
6b650e785e
Adapt OOMKilling pattern to old and new Linux kernels
2020-10-22 15:12:26 +02:00
varsha teratipally
f984abbe2e
catching hung task with pattern like taks airflow scheduler: some of the events related to hungtask is not identified
2020-10-08 23:04:15 +00:00