Files
awesome-kubernetes/v2-docs/monitoring.md

49 KiB
Raw Permalink Blame History

Monitoring and Performance. Prometheus, Grafana, APMs and more

!!! info "Architectural Context" Detailed reference for Monitoring and Performance. Prometheus, Grafana, APMs and more in the context of Architectural Foundations.

Standard Reference

Cloud Infrastructure

Service Mesh

Istio Mesh

  • Istio.io [EN CONTENT] [ADVANCED LEVEL] [DE FACTO STANDARD] — The premier open-source service mesh providing advanced traffic management, end-to-end security, and granular observability. Uses Envoy proxies (via sidecars or Ambient mode) to secure and manage microservice fabrics.

Cloud Native Infrastructure

Observability

Distributed Tracing

Jaeger Platform
  • jaegertracing.io [DOCUMENTATION] [DE FACTO STANDARD] [ENTERPRISE-STABLE] — The official gateway for Jaeger, a CNCF-graduated distributed tracing platform. Essential for microservice architectures to monitor transactions, perform root-cause analysis, optimize performance bottlenecks, and visualize complex request propagation paths.

Log Analysis

Visualization Tools
  • Kibana [DOCUMENTATION] [DE FACTO STANDARD] [ENTERPRISE-STABLE] — The foundational visualization and management interface for the Elastic Stack. Enables operators to search, index, analyze, and construct real-time security dashboards and log analysis patterns for high-throughput microservice applications.

Cloud Native Languages

Java

Performance Tuning

  • tier1app.com [EN CONTENT] [ENTERPRISE-STABLE] — A dedicated APM tool for analyzing Java thread dumps and performance. Provides automated diagnostics for thread contention and deadlocks to optimize JVM application responsiveness.
  • fastthread.io [EN CONTENT] [DE FACTO STANDARD] [ENTERPRISE-STABLE] — Industrial-grade online Java thread dump analyzer that uses AI diagnostics to identify CPU spikes, thread leaks, and deadlock patterns. Essential for post-mortem analysis of containerized JVM workloads.
  • gceasy.io [EN CONTENT] [ADVANCED LEVEL] [DE FACTO STANDARD] [ENTERPRISE-STABLE] — Machine-learning powered JVM Garbage Collection log analyzer. Automates the detection of memory leaks, GC pauses, and heap sizing misconfigurations, offering actionable recommendations for optimization.
  • heaphero.io [EN CONTENT] [ADVANCED LEVEL] [ENTERPRISE-STABLE] — An automated cloud-based JVM heap dump analyzer built to parse large memory dumps quickly. Detects memory leaks and optimizes data structure footprints to resolve OutOfMemoryError crashes.

Event-Driven Architecture

Apache Kafka

Tooling and UI

  • Kafdrop Kafka Web UI 🌟 6135 [DE FACTO STANDARD] [ENTERPRISE-STABLE] — Curator Insight: Highly popular, lightweight web UI for monitoring and managing Apache Kafka. Live Grounding: Renders cluster info, brokers, topics, partition offsets, consumer group lag, and allows active JSON/protobuf message payload inspection.

Infrastructure Operations

Sysadmin Toolsets

Resource Curation

Awesome Lists
  • Awesome Sysadmin 33981 [DE FACTO STANDARD] — An incredibly rich curation containing production-grade open source utilities, control planes, networking layers, and security mechanisms used daily by systems architects and site reliability engineers.

Observability (1)

Telemetry Standards

OpenTelemetry vs Prometheus

  • Prometheus and OpenTelemetry Compatibility Issues [ADVANCED LEVEL] [COMMUNITY-TOOL] — An informative look at the historical data model incompatibilities between Prometheus and OpenTelemetry (OTel). It details the industry efforts to reconcile standard Prometheus structures with the broader OTel landscape.

Observability and Performance

Kubernetes Internals

Resource Management

  • The Hidden CPU Throttling Crisis in Kubernetes Clusters [EN CONTENT] [ADVANCED LEVEL] [COMMUNITY-TOOL] — An in-depth analysis exposing the silent threat of CPU throttling inside Kubernetes clusters caused by rigid CFS quota management. Demonstrates how microservices suffer latency spikes even with low aggregate CPU consumption.

Performance Testing

HTTP Benchmarking

  • blog.cloud-mercato.com: New HTTP benchmark tool pycurlb [EN CONTENT] [ADVANCED LEVEL] [COMMUNITY-TOOL] — A deep-dive introducing pycurlb, a fast performance tool wrapping libcurl for rapid HTTP request benchmarking in Python. Explores real-world performance results and technical comparisons.

Operations and Reliability

Observability and Monitoring

Foundations

  • Monitoring Distributed Systems - Google SRE Book [ADVANCED LEVEL] [DOCUMENTATION] [DE FACTO STANDARD] — The industry-standard chapter from Google's SRE book detailing the implementation of distributed systems monitoring. It defines the 'Four Golden Signals'—latency, traffic, errors, and saturation—providing practical blueprints to prevent alert fatigue and build actionable dashboard designs.

💡 Explore Related: Mkdocs | Cheatsheets | Git