Commit Graph

452 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
140a850b63 Merge pull request #404 from xueweiz/queue
Fix disk metrics unit and queue_length calculation
2020-01-06 13:14:16 -08:00
Xuewei Zhang
fa7a3d7df1 Fix disk metrics unit and queue_length calculation 2020-01-02 17:19:38 -08:00
Kubernetes Prow Robot
0d0bba94e5 Merge pull request #402 from gmemcc/master
Ignore first collected disk stats to prevent metric distortion
2019-12-18 11:57:57 -08:00
Alex Wong
5a4ac81186 Only disk_avg_queue_len is distorted on first collection 2019-12-12 14:39:29 +08:00
Alex Wong
3d10c892a2 Ignore first collected disk stats to prevent metric distortion 2019-12-11 11:14:01 +08:00
Kubernetes Prow Robot
7819ffda7c Merge pull request #400 from xueweiz/patch-1
Install ginkgo executable in test/build.sh
2019-12-10 11:32:07 -08:00
Xuewei Zhang
6f27c80053 Install ginkgo executable in test/build.sh
ginkgo executable is used in e2e test to support parallelism.
Make sure to install it before running e2e test in the presubmit and CI jobs.
2019-12-06 22:35:33 -08:00
Kubernetes Prow Robot
9d584df4c6 Merge pull request #387 from xueweiz/test-pr
Add a few behavioral e2e tests
2019-12-06 15:13:54 -08:00
Xuewei Zhang
7d28dde8d8 Add e2e test for OOM kill and Docker hung
Also fixes two minor bugs:

1. Change default Boskos wait timeout to 2 minutes.
This is because the current test timeout is configured to 10 minutes.
Running each test case taks 1-2 minutes, and each node will run 1-2 test
cases. 5 minutes timeout on waiting for Boskos may cause a test timeout,
which we want to avoid.

2. Create artifact subdir with 0755 rather than 0644.
Because execution bit should be set on the directories.
2019-12-06 14:49:17 -08:00
Xuewei Zhang
8b98d08b5f Record scp command failure message to help debugging 2019-12-06 14:49:17 -08:00
Xuewei Zhang
dd37dfe12c Add e2e tests for reporting filesystem problems
Also added support for running e2e tests in parallel.
2019-12-06 14:49:17 -08:00
Xuewei Zhang
b3f811d171 Add detection for ext4 errors 2019-12-06 14:49:17 -08:00
Xuewei Zhang
5da72e86bb Add problem maker to simulate problems for e2e test 2019-12-06 14:49:17 -08:00
Kubernetes Prow Robot
7dc84e8d74 Merge pull request #395 from yuzhiquan/patch
Using time.Since(t) instead of t.Sub(time.Now())
2019-12-05 17:28:49 -08:00
yuzhiquan
9c24be2da4 cleanup: using time.Since(t) instead of t.Sub(time.Now()) 2019-12-05 18:57:53 +08:00
Xuewei Zhang
40cb3e0fec Vendor changes for gomega 2019-12-04 17:17:53 -08:00
Kubernetes Prow Robot
11e35096c4 Merge pull request #394 from yuzhiquan/master
fix: modify typo
2019-12-02 23:46:57 -08:00
yuzhiquan
b458f0d028 fix: modify typo 2019-12-03 15:21:57 +08:00
Kubernetes Prow Robot
9c3f17478b Merge pull request #393 from jiayuc/fix-make-test
fix make test early failure
2019-11-29 23:45:03 -08:00
Jiayu Chen
2368321490 fix make test early failure 2019-11-29 17:54:42 -08:00
Kubernetes Prow Robot
3a41fc2fc3 Merge pull request #392 from arekkusu/origin/patch-2
Improve systemctl check, style + cleanup
2019-11-29 01:33:03 -08:00
Kubernetes Prow Robot
f535592df0 Merge pull request #369 from arekkusu/patch-1
Support "nf_conntrack", check for 90% full
2019-11-29 01:03:04 -08:00
Alexandre
4df720c2a0 Improve systemctl check, style + cleanup
- Use `systemctl is-active` to check if service is running
  - Cleaner that `grep` on `systemctl status` output
  - Return success means service is running/active
  - Return failure means not running which could be due to
    stopped/failed service or that service does not exist

- Use `command -v` instead of `which`
  Ref: https://github.com/koalaman/shellcheck/wiki/SC2230

- Follow Google "Shell Style Guide": indent, use "readonly"

- Minor: Rephrase comment, avoid all caps
2019-11-29 14:14:19 +09:00
Alexandre
a91b568149 Support "nf_conntrack", check 90% full, style
- Script was checking for "ip_conntrack_..." which was replaced by "nf_conntrack_..." on newer system. Now support both.

-  Return failure ("not ok") when table is more than 90% full.
  - Not sure what value is best here but I think that is better than when the table is full.
    Otherwise we might end up with a value close to the max or bouncing around.

- Replaced cat by "$(< file )" to avoid calling external command
- Follow Google "Shell Style Guide": 2 space indent, use preferred "[[ test ]]", add "readonly"
- Include current connection usage in output message
2019-11-29 13:20:37 +09:00
Kubernetes Prow Robot
8704ec0c42 Merge pull request #370 from hardikdr/patch-1
Update node-problem-detector-config
2019-11-27 21:33:03 -08:00
Kubernetes Prow Robot
d000c1b060 Merge pull request #390 from xueweiz/make
Fix build tags manipulation in Makefile
2019-11-26 15:13:20 -08:00
Xuewei Zhang
5e55ef89f1 Make log-counter respect ENABLE_JOURNALD 2019-11-26 13:58:10 -08:00
Kubernetes Prow Robot
ef30a1fdd1 Merge pull request #391 from xueweiz/trim
Trim go.mod
2019-11-26 13:25:20 -08:00
Xuewei Zhang
ee94f8b52a Trim go.mod
This is done via running `GO111MODULE=on go mod tidy`.
2019-11-26 12:32:02 -08:00
Xuewei Zhang
3c80676e94 Fix build tags manipulation in Makefile 2019-11-26 12:12:23 -08:00
Kubernetes Prow Robot
ff95d2e758 Merge pull request #382 from zhangxiaoyu-zidif/patch-3
typo: delete redundant description.
2019-11-23 01:49:49 -08:00
Kubernetes Prow Robot
3ace8c2984 Merge pull request #379 from CuZn13/master
add a case is ID="centos"
2019-11-13 17:47:35 -08:00
Kubernetes Prow Robot
e218562242 Merge pull request #381 from xueweiz/maintainer
Add xueweiz as maintainer.
2019-11-12 22:16:11 -08:00
Xiaoyu Zhang(Tim)
7a5cecaa1c typo: delete redundant description. 2019-11-08 08:55:53 +08:00
Xuewei Zhang
39f401311b Add xueweiz as maintainer. 2019-11-05 15:54:49 -08:00
tongxin21
d5cb44646e add an unit test for parsing the "/etc/os-release" of CentOS
add a newline character at the end
2019-11-01 13:34:22 +08:00
tongxin21
9b9f18a7ed add a case is ID="centos" 2019-10-28 19:09:15 +08:00
Kubernetes Prow Robot
ad76b93208 Merge pull request #375 from Random-Liu/fix-channel-close-issue
Properly close channel when monitors exit.
v0.8.0
2019-10-25 14:37:38 -07:00
Lantao Liu
be7cc78aa0 Properly close channel when monitor exits.
Signed-off-by: Lantao Liu <lantaol@google.com>
2019-10-25 14:11:39 -07:00
Kubernetes Prow Robot
705cb01e0c Merge pull request #339 from wenjun93/logmonitor
avoid log channel closed caused endless loop
2019-10-25 11:27:39 -07:00
Kubernetes Prow Robot
bac3429522 Merge pull request #359 from gmemcc/hotfix-closed-channel
fix close of closed channel
2019-10-24 20:57:38 -07:00
wenjun93
4a4ebc7097 avoid log channel closed caused endless loop 2019-10-25 11:43:49 +08:00
Kubernetes Prow Robot
a999207a56 Merge pull request #367 from grosser/grosser/unwrap
untangle plugin runner a bit
2019-10-24 20:29:38 -07:00
Kubernetes Prow Robot
2c14eb1075 Merge pull request #373 from wojtek-t/decrease_heartbeat_frequency
Decrease default frequency of forced heartbeats to 5m
2019-10-24 10:43:12 -07:00
wojtekt
43728fb0fc Decrease default frequency of forced heartbeats to 5m 2019-10-24 10:39:01 +02:00
Michael Grosser
3be50a088a untangle plugin runner a bit
add some docs and make it clearer what is actually going on
(parallel rule execution on start and then on timer)
2019-10-10 15:46:04 -07:00
Kubernetes Prow Robot
c2d850ca10 Merge pull request #371 from rhysemmas/update-readme
Update README
2019-10-08 09:25:12 -07:00
rhysemmas
80e3428d75 Update background section 2019-10-08 14:14:53 +01:00
Hardik Dodiya
3afca2f0e4 Update node-problem-detector-config
Seems there is an inconsistency between /config and /deployment examples for kernel-monitor. 
Updating /deployment example accordingly.
2019-10-08 12:04:57 +05:30
Kubernetes Prow Robot
a1a7234878 Merge pull request #363 from grosser/grosser/old
remove kubernetes 1.8 support
2019-09-28 19:27:37 -07:00