Files
awesome-kubernetes/docs/monitoring.md
Inaki Fernandez e36451cd57 typo
2020-05-12 22:22:52 +02:00

46 KiB
Raw Blame History

OpenShift Monitoring and Performance. Prometheus, Grafana, APMs and more

Monitoring

OpenShift Cluster Monitoring Built-in solutions

OpenShift 3.11 Metrics and Logging

OpenShift Container Platform Monitoring ships with a Prometheus instance for cluster monitoring and a central Alertmanager cluster. In addition to Prometheus and Alertmanager, OpenShift Container Platform Monitoring also includes a Grafana instance as well as pre-built dashboards for cluster monitoring troubleshooting. The Grafana instance that is provided with the monitoring stack, along with its dashboards, is read-only.

Monitoring Component Release URL
ElasticSearch 5 OpenShift 3.11 Metrics & Logging
Fluentd 0.12 OpenShift 3.11 Metrics & Logging
Kibana 5.6.13 kibana 5.6.13
Prometheus 2.3.2 OpenShift 3.11 Prometheus Cluster Monitoring
Prometheus Operator Prometheus Operator technical preview
Prometheus Alert Manager 0.15.1 OpenShift 3.11 Configuring Prometheus Alert Manager
Grafana 5.2.3 OpenShift 3.11 Prometheus Cluster Monitoring

Prometheus and Grafana

openshift3 Monitoring

Custom Grafana Dashboard for OpenShift 3.11

By default OpenShift 3.11 Grafana is a read-only instance. Many organizations may want to add new custom dashboards. This custom grafana will interact with existing Prometheus and will also add all out-of-the-box dashboards plus few more interesting dashboards which may require from day to day operation. Custom Grafana pod uses OpenShift oAuth to authenticate users and assigns "Admin" role to all users so that users can create their own dashboards for additional monitoring.

Getting Started with Custom Dashboarding on OpenShift using Grafana. This repository contains scaffolding and automation for developing a custom dashboarding strategy on OpenShift using the OpenShift Monitoring stac

Capacity Management Grafana Dashboard

This repo adds a capacity management Grafana dashboard. The intent of this dashboard is to answer a single question: Do I need a new node? . We believe this is the most important question when setting up a capacity management process. We are aware that this is not the only question a capacity management process may need to be able to answer. Thus, this should be considered as the starting point for organizations to build their capacity management process.

Software Delivery Metrics Grafana Dashboard

This repo contains tooling to help organizations measure Software Delivery and Value Stream metrics.

Prometheus for OpenShift 3.11

This repo contains example components for running either an operational Prometheus setup for your OpenShift cluster, or deploying a standalone secured Prometheus instance for configurating yourself.

OpenShift 4

OpenShift Container Platform includes a pre-configured, pre-installed, and self-updating monitoring stack that is based on the Prometheus open source project and its wider eco-system. It provides monitoring of cluster components and includes a set of alerts to immediately notify the cluster administrator about any occurring problems and a set of Grafana dashboards. The cluster monitoring stack is only supported for monitoring OpenShift Container Platform clusters.

OpenShift Cluster Monitoring components cannot be extended since they are read only.

Monitor your own services (technology preview): The existing monitoring stack can be extended so you can configure monitoring for your own Services.

Monitoring Component Deployed By Default OCP 4.1 OCP 4.2 OCP 4.3 OCP 4.4
ElasticSearch No 5.6.13.6
Fluentd No 0.12.43
Kibana No 5.6.13
Prometheus Yes 2.7.2 2.14.0 2.15.2
Prometheus Operator Yes 0.34.0 0.35.1
Prometheus Alert Manager Yes 0.16.2 0.19.0 0.20.0
kube-state-metrics Yes 1.8.0 1.9.5
Grafana Yes 5.4.3 6.2.4 6.4.3 6.5.3

Prometheus

prometheus architecture

Prometheus Storage

  • Proporciona etiquetado clave-valor y "time-series". La propia documentación de Prometheus explica cómo se gestiona el almacenamiento en disco (Prometheus Time-Series DB). La ingestión de datos se agrupa en bloques de dos horas, donde cada bloque es un directorio conteniendo uno o más "chunk files" (los datos), además de un fichero de metadatos y un fichero index:
  • Almacenamiento de datos en disco (Prometheus Time-Series DB):
./data/01BKGV7JBM69T2G1BGBGM6KB12
./data/01BKGV7JBM69T2G1BGBGM6KB12/meta.json
./data/01BKGV7JBM69T2G1BGBGM6KB12/wal
./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000002
./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000001
  • Un proceso en segundo plano compacta los bloques de dos horas en otros más grandes.
  • Es posible almacenar los datos en otras soluciones de "Time-Series Database" como InfluxDB.

Scalability, High Availability (HA) and Long-Term Storage

  • Prometheus fue diseñado para ser fácil de desplegar. Es extremadamente fácil ponerlo en marcha, recoger algunas métricas, y empezar a construir nuestra propia herramienta de monitorización. Las cosas se complican cuando se intenta operar a un nivel de escalado considerable.
  • Para entender si esto va a ser un problema, conviene plantearse las siguiente preguntas:
    • ¿Cuántas métricas puede ingerir el sistema de monitorización y cuántas son necesarias?
    • ¿Cuál es la cardinalidad de las métricas? La cardinalidad es el número de etiquetas que cada métrica puede tener. Es una cuestión muy frecuente en las métricas pertenecientes a entornos dinámicos donde a los contenedores se les asignan un ID ó nombre diferente cada vez que son lanzados, reiniciados o movidos entre nodos (caso de kubernetes).
    • ¿Es necesaria la Alta Disponibilidad (HA)?
    • ¿Durante cuánto tiempo es necesario mantener las métricas y con qué resolución?
  • La implementación de HA es laboriosa porque la funcionalidad de cluster requiere añadir plugins de terceros al servidor Prometheus. Es necesario tratar con "backups" y "restores", y el almacenamiento de métricas por un periodo de tiempo extendido hará que la base de datos crezca exponencialmente. Los servidores Prometheus proporcionan almacenamiento persistente, pero Prometheus no fue creado para el almacenamiento distribuido de métricas a lo largo de múltiples nodos de un cluster con replicación y capacidad curativa (como es el caso de Kubernetes). Esto es conocido como "almacenamiento a largo-plazo" (Long-Term) y actualmente es un requisito en unos pocos casos de uso, por ejemplo en la planificación de la capacidad para monitorizar cómo la infraestructura necesita evolucionar, contracargos para facturar diferentes equipos ó departamentos para un caso específico que han hecho de la infraestructura, análisis de tendencias de uso, o adherirse a regulaciones para verticales específicos como banca, seguros, etc.

Storage Solutions for Prometheus

  • Prometheus TSDB
  • Cortex: Provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus. Cortex allows for storing time series data in a key-value store like Cassandra, AWS DynamoDB, or Google BigTable. It offers a Prometheus compatible query API, and you can push metrics into a write endpoint. This makes it best suited for cloud environments and multi-tenant scenarios like service providers building hosted and managed platforms.
  • Thanos: Open source, highly available Prometheus setup with long term storage capabilities.
    • Thanos stores time series data in an object store like AWS S3, Google Cloud Storage, etc. Thanos pushes metrics through a side-car container from each Prometheus server through the gRPC store API to the query service in order to provide a global query view.
    • github.com/ruanbekker: Thanos Cluster Setup How to deploy a HA Prometheus setup with Unlimited Data Retention Capabilities on aws cloud S3 with Thanos Metrics.
  • InfluxDB: An open-source time series database (TSDB) developed by InfluxData. It is written in Go and optimized for fast, high-availability storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time analytics. It also has support for processing data from Graphite.
  • M3: An open source, large-scale metrics platform developed by Uber. It has its own time series database, M3DB. Like Thanos, M3 also uses a side-car container to push the metrics to the DB. In addition, it supports metric deduplication and merging, and provides distributed query support. Although it's exciting to see attempts to address the challenges of running Prometheus at scale, these are very young projects that are not widely used yet.

Collectors. Software exposing Prometheus metrics

Prometheus Exporters. Plug-in architecture and extensibility with Prometheus Exporters (collectors)

Prometheus Exporters Development. Node Exporter

Prometheus Third-party Collectors/Exporters

OpenTelemetry Collector
Telegraf Collector
Micrometer Collector

Prometheus Alarms and Event Tracking

  • Prometheus no soporta rastreo de eventos (event tracking), pero ofrece un soporte completo de alarmas y gestión de alarmas. El lenguaje de consultas (queries) de Prometheus permite en cambio implementar rastreo de eventos por cuenta propia.

Prometheus and Cloud Monitoring

  • AWS CloudWatch is supported by Prometheus.

Prometheus Installers

Binaries, source code or Docker

Ansible Roles

Prometheus SaaS Solutions

Grafana

  • Grafana
  • Prometheus utiliza plantillas de consola para los dashboards, si bien su curva de aprendizaje de sus múltiples funcionalidades es alta, con una interfaz de usuario insuficiente. Por este motivo es muy habitual utilizar Grafana como interfaz de usuario.
  • grafana.com: Provisioning Grafana 🌟 Las últimas versiones de Grafana permiten la creación de "datasources" y "dashboards" con Ansible, mediante las opciones de provisión de Grafana. Funciona con cualquier "datasource" (Prometheus, InfluxDB, etc), incluyendo la configuración de Grafana correspondiente y dejando poco margen para el error humano.

Grafana Dashboards

Monitored Component Collector Dashboard Number URL
ActiveMQ 5.x "classic" Telegraf 10702 Ref1, Ref2, Ref3, Ref4
ActiveMQ Artemis/Red Hat AMQ Broker JMX Exporter 9087 Ref1, Ref2, Ref3
Message Streams like Kafka/Red Hat AMQ Streams Other 9777

Kibana

Interactive Learning

Performance

Distributed Tracing. OpenTelemetry and Jaeger

Application Performance Management (APM)

Dynatrace APM

Message Queue Monitoring

Messaging Solution Monitoring Solution URL
ActiveMQ 5.8.0+ Dynatrace ref
ActiveMQ Artemis Micrometer Collector + Prometheus ref1, ref2
IBM MQ IBM MQ Exporter for Prometheus ref
Kakfa Dynatrace ref1, ref2, ref3
Kafka Prometheus JMX Exporter ref1, ref2, ref3, ref4, ref5, ref6, ref7
Kafka Kafka Exporter ref, Use JMX Exporter to export other metrics
Kafka Kafdrop Kafka Web UI ref
Kafka ZooNavigator: Web-based ZooKeeper UI ref
Kafka CMAK (Cluster Manager for Apache Kafka, previously known as Kafka Manager) ref
Kafka Xinfra Monitor (renamed from Kafka Monitor, created by Linkedin) ref
Kafka Telegraf + InfluxDB ref
Red Hat AMQ Broker (ActiveMQ Artemis) Prometheus plugin for AMQ Broker ref1, ref2, ref3
Red Hat AMQ Streams (Kafka) JMX, OpenTracing+Jaeger ref1,ref2
Red Hat AMQ Streams Operator AMQ Streams Operator (Prometheus & Jaeger), strimzi, jmxtrans ref1, ref2, ref3 strimzi, ref4: jmxtrans, ref5: banzai operator
Red Hat AMQ Broker Operator Prometheus (recommended) or Jolokia REST to JMX ref1, ref2, ref3, ref4, ref5

Red Hat AMQ 7 Broker Monitoring solutions based on Prometheus and Grafana

This is a selection of monitoring solutions suitable for RH AMQ 7 Broker based on Prometheus and Grafana:

Environment Collector/Exporter Details/URL
RHEL Prometheus Plugin for AMQ Broker ref
RHEL Prometheus JMX Exporter Same solution applied to ActiveMQ Artemis
OpenShift 3 Prometheus Plugin for AMQ Broker Grafana Dashboard not available, ref1, ref2
OpenShift 4 Prometheus Plugin for AMQ Broker Check if Grafana Dashboard is automatically setup by Red Hat AMQ Operator
OpenShift 3 Prometheus JMX Exporter Grafana Dashboard not available, ref1, ref2

Other Awesome Lists