Files
awesome-kubernetes/v2-docs/sre.md

24 KiB
Raw Permalink Blame History

Site Reliability Engineering (SRE)

!!! info "Architectural Context" Detailed reference for Site Reliability Engineering (SRE) in the context of Platform & Site Reliability.

Standard Reference

Education

Software Engineering

Professional Growth

  • Skills for Real Engineers 100172 [ADVANCED LEVEL] [DE FACTO STANDARD] — A massive, widely vetted resource compiling software engineering methodologies, design schemas, and performance protocols required for elite software delivery.

Operations and Reliability

DevOps and SRE Culture

Career Roadmap

  • dev.to: What You Need to Break into DevOps and SRE [ENTERPRISE-STABLE] — A comprehensive skills matrix designed for software engineers transitioning into DevOps and SRE domains. The roadmap covers essential technologies including Linux systems internals, network virtualization, cloud infrastructure as code, CI/CD automation, and metric-driven monitoring pipelines.

Role Definitions

  • youtube: Viktor Farcic - What is the difference between SRE and DevOps? [COMMUNITY-TOOL] — An analytical video comparing SRE and DevOps methodologies. The presentation details structural overlaps, detailing SRE as a concrete class implementing the abstract interface of DevOps, emphasizing automated tooling, error budget tracking, and shared organizational objectives.
  • dev.to: DevOps vs SRE: What's The Difference? [COMMUNITY-TOOL] — An introductory analysis exploring how DevOps and SRE differ conceptually and operationally. It illustrates how SRE provides the programmatic solutions and infrastructure instrumentation needed to realize the broader cultural transformations promised by DevOps methodologies.
  • phoenixnap.com: SRE Vs. DevOps: Differences Explained 🌟 [ENTERPRISE-STABLE] — An educational guide that outlines the operational boundaries and target objectives distinguishing SRE from DevOps. It features detailed structural comparisons highlighting specific KPI alignments, on-call expectations, and error budget implementation procedures.

Observability and Monitoring

Foundations

  • Monitoring Distributed Systems - Google SRE Book [ADVANCED LEVEL] [DOCUMENTATION] [DE FACTO STANDARD] — The industry-standard chapter from Google's SRE book detailing the implementation of distributed systems monitoring. It defines the 'Four Golden Signals'—latency, traffic, errors, and saturation—providing practical blueprints to prevent alert fatigue and build actionable dashboard designs.

Organization Design

Operational Models

  • thenewstack.io: Centralized vs. Decentralized Operations [ADVANCED LEVEL] [COMMUNITY-TOOL] — A deep-dive architectural comparison of centralized operations centers versus decentralized, application-embedded operations models. The analysis explores trade-offs regarding communication latency, incident ownership, and platform engineering scalability under cognitive overload.

Platform Engineering

Ecosystem Integration

  • thenewstack.io: SRE vs. DevOps? Successful Platform Engineering Needs Both [COMMUNITY-TOOL] — Analyzes how SRE stability frameworks and DevOps continuous pipelines must merge to construct efficient Internal Developer Platforms. Grounding emphasizes that treating the platform as a product is essential to balance development velocity with overall cloud-infrastructure reliability.

Role Definitions (1)

  • devops.com: SRE Vs. Platform Engineering: Whats the Difference? [DE FACTO STANDARD] — A critical examination of the conceptual and practical differences between SRE and Platform Engineering. It illustrates how SRE prioritizes system-centric stability, SLO management, and incident response, while Platform Engineering focuses on improving the developer experience through internal developer portals (IDPs).

Service Level Objectives

Community Events

  • SLOconf [DOCUMENTATION] [COMMUNITY-TOOL] — The official landing page for SLOconf, a premier community event dedicated to Service Level Objectives. The forum hosts deep technical tracks, production post-mortems, and deployment case studies, making it an essential hub for engineers refining reliability standards.

Foundations (1)

  • sre.google: The Art of SLOs [DOCUMENTATION] [DE FACTO STANDARD] — An essential training handbook from Google covering the foundational concepts of setting, calculating, and maintaining Service Level Objectives. It provides practical exercises to identify critical user pathways and align internal metrics with real-world customer expectations.

GitOps Implementation

Open Standards

  • OpenSLO specification 🌟 1491 [ADVANCED LEVEL] [DE FACTO STANDARD] — The open-source OpenSLO specification, which defines a vendor-agnostic standard for declaring SLOs, SLIs, and error budgets in YAML format. It enables platform engineers to implement declarative reliability metrics across diverse tracing systems like Prometheus and Datadog.

Progressive Delivery

  • Iter8 [ADVANCED LEVEL] [DOCUMENTATION] [ENTERPRISE-STABLE] — A Kubernetes-native progressive delivery platform that orchestrates metric-driven canary releases and A/B tests. Live grounding shows Iter8's ability to validate runtime SLO performance, using Prometheus and OpenTelemetry targets to automate application promotion or rollbacks.

Testing and Validation

Site Reliability Engineering

Best Practices

  • infracloud.io: Site Reliability Engineering (SRE) Best Practices [COMMUNITY-TOOL] — This architectural guide details actionable workflows for modern SRE execution, covering service level indicator definition, runbook automation, and collaborative blameless post-mortems. Curator insight and live grounding suggest that successful implementation requires transitioning team topologies from manual toil to reliability-first platform engineering.
  • toolbox.com: Site Reliability Engineering: What Is It and How Can It Help' Scale Operations? 🌟 [ENTERPRISE-STABLE] — An executive guide to how site reliability engineering helps scale complex IT footprints by converting manually executed procedures into self-healing code. It emphasizes how establishing shared error budgets minimizes organization-wide conflict over deployment speed and risk tolerance.

Career Roadmap (1)

  • devops.com: Top Nine Skills for SREs to Master 🌟 [ENTERPRISE-STABLE] — Highlights the nine core disciplines mandatory for modern SRE professionals. Essential competencies include programming, software-defined networking, container orchestration, proactive observability, secure deployment design, and automated release rollbacks.

Case Studies

  • thenewstack.io: Google SRE: Site Reliability Engineering at a Global Scale [COMMUNITY-TOOL] — A retrospective analysis of Google's journey in creating the modern SRE discipline to address unprecedented internet scale. It focuses on the core organizational policy that SRE teams must devote at least 50% of their bandwidth to engineering rather than operational toil.

Engagement Models

  • sre.google: sre-book - The Evolving SRE Engagement Model [ADVANCED LEVEL] [DOCUMENTATION] [DE FACTO STANDARD] — This seminal text from the Google SRE Book explores the life cycle of SRE engagements with product teams. It reviews diverse topological frameworks, detailing how to transition product teams from embedded SRE support models to decoupled, self-service infrastructure platforms.

Evolution

  • thenewstack.io: How the SRE Experience Is Changing with Cloud Native 🌟 [ENTERPRISE-STABLE] — This high-density industry analysis examines how the rise of complex cloud-native architectures shifts SRE responsibilities. It addresses how microservices, service meshes, and dynamic scheduling require SREs to move from simple system monitoring to deep, code-level observability and platform design.

Operational Tooling

  • (2022) getcortexapp.com: A guide to the best SRE tools [COMMUNITY-TOOL] — This reference guide provides a framework to evaluate SRE automation platforms, including service cataloging, telemetry aggregators, and automated incident response tools. It details how prioritizing tooling based on the engineering team's maturity level prevents overhead.
  • devops.com: How SREs Benefit From Feature Flags [COMMUNITY-TOOL] — This technical analysis shows how progressive delivery using feature flags supports high-availability operations. It details how feature flag infrastructure enables instant software rollbacks, controlled canary tests, and reduced operational blast radius without requiring full redeployments.
  • thenewstack.io: The Site Reliability Engineering Tool Stack [COMMUNITY-TOOL] — A comprehensive reference detailing the typical software suites utilized across modern SRE organizations. It groups tools into critical segments: observability engines, configuration managers, automated deployment tooling, issue-tracking dashboards, and dynamic status portals.
  • thenewstack.io: The Best Site Reliability Engineering Tools in 2021 [COMMUNITY-TOOL] — A specialized review of elite reliability tools, detailing their integration with cloud infrastructure and error budget managers. It outlines how telemetry tools can be leveraged programmatically within continuous deployment stages to trigger automated rolling upgrades or fast rollbacks.

Resources

  • sre.google/prodcast [DOCUMENTATION] [ENTERPRISE-STABLE] — Google SRE's official podcast platform, offering in-depth conversations on production readiness, disaster simulation, massive database scale, and global network engineering. This serves as an elite audio-learning resource for cloud architects designing resilient distributed architectures.

Role Definitions (2)

  • devops.com: Day in the Life of a Site Reliability Engineer (SRE) [COMMUNITY-TOOL] — A detailed technical narrative outlining the day-to-day work patterns of an active SRE. It details technical duties such as managing on-call alert systems, performing root-cause evaluations, coding infrastructure automation, and consulting with software development teams.

Training and Incident Response

  • infoq.com: Observing and Understanding Failures: SRE Apprentices [COMMUNITY-TOOL] — This session addresses the cognitive models of system failure, specifically targeting how apprentices and junior SREs can safely learn to analyze complex failures. It advocates for structured code-level tracing, game days, and interactive debugging to accelerate reliable operational troubleshooting.

Platform Engineering (1)

Architectural Patterns

Internal Developer Platforms

Site Reliability Engineering (1)

Case Studies (1)

Foundations (2)

Observability


💡 Explore Related: DevOps | Project Management Methodology | Scaffolding