1) Add lookback support in kernel monitor. After started, Kernel monitor will check some old logs to detect problems which happened before last node reboot. 2) Add `lookback` and `startPattern` in kernel monitor configuration. * `lookback` specifies how long time kernel monitor should look back. * `startPattern` specifies which log indicates the node is started. kernel monitor will clear all current node conditions once it finds a node start log. This makes sure that old problems won't change the node condition. 3) Add support for kernel panic monitoring, the null pointer and divide 0 kernel panic will be surfaced as event. Usually kernel monitor will report these events during looking back phase.
node-problem-detector
node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a DaemonSet detecting node problems and reporting them to apiserver. Now it is running as a Kubernetes Addon enabled by default in the GCE cluster.
Background
There are tons of node problems could possibly affect the pods running on the node such as:
- Hardware issues: Bad cpu, memory or disk;
- Kernel issues: Kernel deadlock, corrupted file system;
- Container runtime issues: Unresponsive runtime daemon;
- ...
Currently these problems are invisible to the upstream layers in cluster management stack, so Kubernetes will continue scheduling pods to the bad nodes.
To solve this problem, we introduced this new daemon node-problem-detector to collect node problems from various daemons and make them visible to the upstream layers. Once upstream layers have the visibility to those problems, we can discuss the remedy system.
Problem API
node-problem-detector uses Event and NodeCondition to report problems to
apiserver.
NodeCondition: Permanent problem that makes the node unavailable for pods should be reported asNodeCondition.Event: Temporary problem that has limited impact on pod but is informative should be reported asEvent.
Problem Daemon
A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific kind of node problems and reports them to node-problem-detector.
A problem daemon could be:
- A tiny daemon designed for dedicated usecase of Kubernetes.
- An existing node health monitoring daemon integrated with node-problem-detector.
Currently, a problem daemon is running as a goroutine in the node-problem-detector binary. In the future, we'll separate node-problem-detector and problem daemons into different containers, and compose them with pod specification.
List of supported problem daemons:
| Problem Daemon | NodeCondition | Description |
|---|---|---|
| KernelMonitor | KernelDeadlock | A problem daemon monitors kernel log and reports problem according to predefined rules. |
Usage
Build Image
Run make in the top directory. It will:
- Build the binary.
- Build the docker image. The binary and
config/are copied into the docker image. - Upload the docker image to registry. By default, the image will be uploaded to
gcr.io/google_containers. It's easy to modify theMakefileto push the image to another registry
Start DaemonSet
- Create a file node-problem-daemon.yaml with the following yaml.
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector
spec:
template:
spec:
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.2
imagePullPolicy: Always
securityContext:
privileged: true
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: log
mountPath: /log
readOnly: true
volumes:
- name: log
# Config `log` to your system log directory
hostPath:
path: /var/log/
- Edit node-problem-detector.yaml to fit your environment: Set
logvolueme to your system log diretory. (Used by KernelMonitor) - Create the DaemonSet with
kubectl create -f node-problem-detector.yaml - If needed, you can use ConfigMap
to overwrite the
config/.