mirror of
https://github.com/nubenetes/awesome-kubernetes.git
synced 2026-05-26 11:04:28 +00:00
24 KiB
24 KiB
Site Reliability Engineering (SRE)
!!! info "Architectural Context" Detailed reference for Site Reliability Engineering (SRE) in the context of Platform & Site Reliability.
Standard Reference
- wikipedia: Site Reliability Engineering [COMMUNITY-TOOL]
- overops.com: DevOps vs. SRE: What’s the Difference Between Them, and Which' One Are You? [COMMUNITY-TOOL]
- dzone: SRE vs. DevOps: SRE Is to DevOps What Scrum Is to Agile [COMMUNITY-TOOL]
- cncf.io: DevOps vs. SRE [COMMUNITY-TOOL]
- kelda.io: Why SREs Should be Responsible for Development Environments [COMMUNITY-TOOL]
- hernan-david-hd.medium.com: 5 pilares del SRE/DevOps [COMMUNITY-TOOL]
- hernan-david-hd.medium.com: Breaking down SRE/DevOps into 5 key areas [COMMUNITY-TOOL]
- stackpulse.com: Managing Reliability for Monoliths vs. Microservices: The' Challenges for SREs [COMMUNITY-TOOL]
- stackpulse.com: Managing Reliability for Monoliths vs. Microservices: Best' Practices for SREs [COMMUNITY-TOOL]
- stackpulse.com: No, SRE Is Not the New DevOps – Unless It Is [COMMUNITY-TOOL]
- medium: Agile vs. DevOps vs. SRE… it’s not OR, it’s AND ! [COMMUNITY-TOOL]
- blogs.letusdevops.com: How much programming should I know for DevOps/SRE' domain. [COMMUNITY-TOOL]
- cncf.io: DevOps vs. SRE vs. Platform Engineering? The gaps might be smaller' than you think [COMMUNITY-TOOL]
- dzone.com: DevOps vs. SRE vs. Platform Engineer vs. Cloud Engineer [COMMUNITY-TOOL]
- blog.acethecloud.com: A Step-by-Step Guide to Calculate SLAs, SLIs, and' SLOs for Your IT Services [COMMUNITY-TOOL]
- medium.com/picsart-engineering: Prioritizing Development Efforts with SLOs' in Microservices [COMMUNITY-TOOL]
Education
Software Engineering
Professional Growth
- Skills for Real Engineers ⭐ 100172 [ADVANCED LEVEL] [DE FACTO STANDARD] — A massive, widely vetted resource compiling software engineering methodologies, design schemas, and performance protocols required for elite software delivery.
Operations and Reliability
DevOps and SRE Culture
Career Roadmap
- dev.to: What You Need to Break into DevOps and SRE [ENTERPRISE-STABLE] — A comprehensive skills matrix designed for software engineers transitioning into DevOps and SRE domains. The roadmap covers essential technologies including Linux systems internals, network virtualization, cloud infrastructure as code, CI/CD automation, and metric-driven monitoring pipelines.
Role Definitions
- youtube: Viktor Farcic - What is the difference between SRE and DevOps? [COMMUNITY-TOOL] — An analytical video comparing SRE and DevOps methodologies. The presentation details structural overlaps, detailing SRE as a concrete class implementing the abstract interface of DevOps, emphasizing automated tooling, error budget tracking, and shared organizational objectives.
- dev.to: DevOps vs SRE: What's The Difference? [COMMUNITY-TOOL] — An introductory analysis exploring how DevOps and SRE differ conceptually and operationally. It illustrates how SRE provides the programmatic solutions and infrastructure instrumentation needed to realize the broader cultural transformations promised by DevOps methodologies.
- phoenixnap.com: SRE Vs. DevOps: Differences Explained 🌟 [ENTERPRISE-STABLE] — An educational guide that outlines the operational boundaries and target objectives distinguishing SRE from DevOps. It features detailed structural comparisons highlighting specific KPI alignments, on-call expectations, and error budget implementation procedures.
Observability and Monitoring
Foundations
- Monitoring Distributed Systems - Google SRE Book [ADVANCED LEVEL] [DOCUMENTATION] [DE FACTO STANDARD] — The industry-standard chapter from Google's SRE book detailing the implementation of distributed systems monitoring. It defines the 'Four Golden Signals'—latency, traffic, errors, and saturation—providing practical blueprints to prevent alert fatigue and build actionable dashboard designs.
Organization Design
Operational Models
- thenewstack.io: Centralized vs. Decentralized Operations [ADVANCED LEVEL] [COMMUNITY-TOOL] — A deep-dive architectural comparison of centralized operations centers versus decentralized, application-embedded operations models. The analysis explores trade-offs regarding communication latency, incident ownership, and platform engineering scalability under cognitive overload.
Platform Engineering
Ecosystem Integration
- thenewstack.io: SRE vs. DevOps? Successful Platform Engineering Needs Both [COMMUNITY-TOOL] — Analyzes how SRE stability frameworks and DevOps continuous pipelines must merge to construct efficient Internal Developer Platforms. Grounding emphasizes that treating the platform as a product is essential to balance development velocity with overall cloud-infrastructure reliability.
Role Definitions (1)
- devops.com: SRE Vs. Platform Engineering: What’s the Difference? [DE FACTO STANDARD] — A critical examination of the conceptual and practical differences between SRE and Platform Engineering. It illustrates how SRE prioritizes system-centric stability, SLO management, and incident response, while Platform Engineering focuses on improving the developer experience through internal developer portals (IDPs).
Service Level Objectives
Community Events
- SLOconf [DOCUMENTATION] [COMMUNITY-TOOL] — The official landing page for SLOconf, a premier community event dedicated to Service Level Objectives. The forum hosts deep technical tracks, production post-mortems, and deployment case studies, making it an essential hub for engineers refining reliability standards.
Foundations (1)
- sre.google: The Art of SLOs [DOCUMENTATION] [DE FACTO STANDARD] — An essential training handbook from Google covering the foundational concepts of setting, calculating, and maintaining Service Level Objectives. It provides practical exercises to identify critical user pathways and align internal metrics with real-world customer expectations.
GitOps Implementation
- thenewstack.io: Automate User Satisfaction with This GitOps-Friendly Spec' for Service Level Objectives [COMMUNITY-TOOL] — Details how declarative, GitOps-friendly schemas can be used to manage Service Level Objectives alongside primary code repositories. Grounding shows how treating SLO configs as code assets allows CI/CD systems to continuously audit user satisfaction and validate code merges.
Open Standards
- OpenSLO specification 🌟 ⭐ 1491 [ADVANCED LEVEL] [DE FACTO STANDARD] — The open-source OpenSLO specification, which defines a vendor-agnostic standard for declaring SLOs, SLIs, and error budgets in YAML format. It enables platform engineers to implement declarative reliability metrics across diverse tracing systems like Prometheus and Datadog.
Progressive Delivery
- Iter8 [ADVANCED LEVEL] [DOCUMENTATION] [ENTERPRISE-STABLE] — A Kubernetes-native progressive delivery platform that orchestrates metric-driven canary releases and A/B tests. Live grounding shows Iter8's ability to validate runtime SLO performance, using Prometheus and OpenTelemetry targets to automate application promotion or rollbacks.
Testing and Validation
- thenewstack.io: Validate Service-Level Objectives of REST APIs Using Iter8 [COMMUNITY-TOOL] — A technical walkthrough detailing how to integrate Iter8 with CI/CD runners to automatically validate REST API SLO configurations. The tutorial includes sample manifests for monitoring API response times and failure rates under heavy synthetic request loads.
Site Reliability Engineering
Best Practices
- infracloud.io: Site Reliability Engineering (SRE) Best Practices [COMMUNITY-TOOL] — This architectural guide details actionable workflows for modern SRE execution, covering service level indicator definition, runbook automation, and collaborative blameless post-mortems. Curator insight and live grounding suggest that successful implementation requires transitioning team topologies from manual toil to reliability-first platform engineering.
- toolbox.com: Site Reliability Engineering: What Is It and How Can It Help' Scale Operations? 🌟 [ENTERPRISE-STABLE] — An executive guide to how site reliability engineering helps scale complex IT footprints by converting manually executed procedures into self-healing code. It emphasizes how establishing shared error budgets minimizes organization-wide conflict over deployment speed and risk tolerance.
Career Roadmap (1)
- devops.com: Top Nine Skills for SREs to Master 🌟 [ENTERPRISE-STABLE] — Highlights the nine core disciplines mandatory for modern SRE professionals. Essential competencies include programming, software-defined networking, container orchestration, proactive observability, secure deployment design, and automated release rollbacks.
Case Studies
- thenewstack.io: Google SRE: Site Reliability Engineering at a Global Scale [COMMUNITY-TOOL] — A retrospective analysis of Google's journey in creating the modern SRE discipline to address unprecedented internet scale. It focuses on the core organizational policy that SRE teams must devote at least 50% of their bandwidth to engineering rather than operational toil.
Engagement Models
- sre.google: sre-book - The Evolving SRE Engagement Model [ADVANCED LEVEL] [DOCUMENTATION] [DE FACTO STANDARD] — This seminal text from the Google SRE Book explores the life cycle of SRE engagements with product teams. It reviews diverse topological frameworks, detailing how to transition product teams from embedded SRE support models to decoupled, self-service infrastructure platforms.
Evolution
- thenewstack.io: How the SRE Experience Is Changing with Cloud Native 🌟 [ENTERPRISE-STABLE] — This high-density industry analysis examines how the rise of complex cloud-native architectures shifts SRE responsibilities. It addresses how microservices, service meshes, and dynamic scheduling require SREs to move from simple system monitoring to deep, code-level observability and platform design.
Operational Tooling
- (2022) getcortexapp.com: A guide to the best SRE tools [COMMUNITY-TOOL] — This reference guide provides a framework to evaluate SRE automation platforms, including service cataloging, telemetry aggregators, and automated incident response tools. It details how prioritizing tooling based on the engineering team's maturity level prevents overhead.
- devops.com: How SREs Benefit From Feature Flags [COMMUNITY-TOOL] — This technical analysis shows how progressive delivery using feature flags supports high-availability operations. It details how feature flag infrastructure enables instant software rollbacks, controlled canary tests, and reduced operational blast radius without requiring full redeployments.
- thenewstack.io: The Site Reliability Engineering Tool Stack [COMMUNITY-TOOL] — A comprehensive reference detailing the typical software suites utilized across modern SRE organizations. It groups tools into critical segments: observability engines, configuration managers, automated deployment tooling, issue-tracking dashboards, and dynamic status portals.
- thenewstack.io: The Best Site Reliability Engineering Tools in 2021 [COMMUNITY-TOOL] — A specialized review of elite reliability tools, detailing their integration with cloud infrastructure and error budget managers. It outlines how telemetry tools can be leveraged programmatically within continuous deployment stages to trigger automated rolling upgrades or fast rollbacks.
Resources
- sre.google/prodcast [DOCUMENTATION] [ENTERPRISE-STABLE] — Google SRE's official podcast platform, offering in-depth conversations on production readiness, disaster simulation, massive database scale, and global network engineering. This serves as an elite audio-learning resource for cloud architects designing resilient distributed architectures.
Role Definitions (2)
- devops.com: Day in the Life of a Site Reliability Engineer (SRE) [COMMUNITY-TOOL] — A detailed technical narrative outlining the day-to-day work patterns of an active SRE. It details technical duties such as managing on-call alert systems, performing root-cause evaluations, coding infrastructure automation, and consulting with software development teams.
Training and Incident Response
- infoq.com: Observing and Understanding Failures: SRE Apprentices [COMMUNITY-TOOL] — This session addresses the cognitive models of system failure, specifically targeting how apprentices and junior SREs can safely learn to analyze complex failures. It advocates for structured code-level tracing, game days, and interactive debugging to accelerate reliable operational troubleshooting.
Platform Engineering (1)
Architectural Patterns
Internal Developer Platforms
- Platform Democracy: Rethinking Who Builds and Consumes Your Internal Platform [ADVANCED LEVEL] [COMMUNITY-TOOL] — An analytical piece explaining Platform Democracy as an operational framework. Discusses user-centric workflows when designing internal developer platform structures (IDPs).
Site Reliability Engineering (1)
Case Studies (1)
- (2023) openshift.com: From Ops to SRE - Evolution of the OpenShift Dedicated Team [ADVANCED LEVEL] 🌟🌟🌟 [COMMUNITY-TOOL] — An enterprise case study detailing how Red Hat transitioned its OpenShift Dedicated operations team to a modern SRE model, showing concrete scaling metrics.
Foundations (2)
- (2024) cloud.google.com: SRE vs. DevOps: competing standards or close friends? 🌟🌟🌟 [COMMUNITY-TOOL] — An official Google Cloud guide highlighting operational synergies and clear organizational distinctions between SRE and DevOps execution models.
- sre.google: What is Site Reliability Engineering (SRE)? 🌟 [ADVANCED LEVEL] [DE FACTO STANDARD] — The main portal hosting Google's legendary Site Reliability Engineering, Site Reliability Workbook, and Building Secure and Reliable Systems textbooks. Mandatory standard reference.
- devops.com: SRE vs. DevOps — a False Distinction? [COMMUNITY-TOOL] — An editorial analyzing the perceived conflict between DevOps and SRE, detailing why they should be integrated as complementary mechanisms for robust microservices delivery.
- devops.com: SRE vs. DevOps vs. Cloud Native: The Server Cage Match [COMMUNITY-TOOL] — A deep-dive technical comparison contrasting DevOps pipelines, SRE operational standards, and Cloud-Native application patterns inside modern server architecture.
- devops.com: Site Reliability Engineering 101: DevOps Versus SRE [COMMUNITY-TOOL] — An introductory, high-density primer defining what Site Reliability Engineering is, framing its core metrics and comparing its functional goals directly against DevOps.
- linkedin: DevOps vs Site Reliability Engineering [COMMUNITY-TOOL] — An analytical overview detailing the distinct career paths, everyday duties, and engineering goals that separate cloud systems administrators from dedicated SREs.
- opensource.com: What is an SRE and how does it relate to DevOps? [COMMUNITY-TOOL] — A practical exploration focusing on running SRE frameworks inside small-scale startups. Proposes strategies for establishing basic SLIs and SLOs under strict budget conditions.
- thenewstack.io: Where Site Reliability Engineering Overlaps with DevOps [COMMUNITY-TOOL] — A structural examination of the technical overlaps between DevOps and SRE, outlining how automated observability pipelines and telemetry serve both teams.
- linkedin.com: SRE: Key Insights-"Done the right way” [COMMUNITY-TOOL] — A tactical blog post detailing common pitfalls in enterprise SRE implementation, warning against rebranded ops ticket siloes that lack architectural power.
- devops.com: How the SRE Role Is Evolving [COMMUNITY-TOOL] — A technical essay tracking the evolution of the SRE role. Explains how advances in telemetry networks and AIOps automated loops are rewriting standard reliability metrics.
- cloud.google.com: SRE at Google: Our complete list of CRE life lessons 🌟 [ADVANCED LEVEL] [DE FACTO STANDARD] — An essential collection of enterprise insights gathered by Google Customer Reliability Engineers (CRE). Translates massive Google-scale SRE rules into practical roadmaps for external architectures.
Observability
- youtube: Platform9’s Madhura Maskasky says observability is also essential' for diagnosing and debugging in order for SREs to "get to the root cause quickly enough so that you can feed that back to the development teams." 🌟 [COMMUNITY-TOOL] — A highly informative video overview mapping how Platform9 integrates robust observability networks. Explains why deep trace analysis is vital for fast root-cause isolation in microservices.
- circonus.com: Monitoring for Success: What All SREs Need to Know [COMMUNITY-TOOL] — A deep technical evaluation of telemetry and metric requirements for SRE. Discusses the selection of appropriate service objectives and data collection frequencies.
💡 Explore Related: DevOps | Project Management Methodology | Scaffolding