mirror of
https://github.com/nubenetes/awesome-kubernetes.git
synced 2026-05-26 19:18:58 +00:00
7.9 KiB
7.9 KiB
Big Data and Kubernetes Big Data
!!! info "Architectural Context" Detailed reference for Big Data and Kubernetes Big Data in the context of The Container Stack.
Standard Reference
- tomlous.medium.com: CI/CD for Data Engineers. Reliably Deploying Scala Spark' containers for Kubernetes with Github Actions [COMMUNITY-TOOL]
- dzone: Run and Scale an Apache Spark Application on Kubernetes [COMMUNITY-TOOL]
- dzone: Running Apache Spark on Kubernetes [COMMUNITY-TOOL]
- medium: Running Apache Spark on Kubernetes [COMMUNITY-TOOL]
- levelup.gitconnected.com: Master SparkML: Practical Guide for Machine Learning [COMMUNITY-TOOL]
Cloud Native AI
Batch Workloads
Kueue Scheduling
- Red Hat Build of Kueue [ADVANCED LEVEL] [DOCUMENTATION] [COMMUNITY-TOOL] — Curator Insight: Documentation for the Red Hat Build of Kueue scheduler within OpenShift. Live Grounding: Kueue offers advanced queueing mechanism controls, priority groupings, and resource quotas, making it the premier platform tool for managing AI/ML and batch workloads.
Big Data Orchestration
Data Pipelines
- hevodata.com: Building Apache Spark Data Pipeline? Made Easy 101 🌟 [COMMUNITY-TOOL] [GUIDE] — Curator Insight: Fundamental guide detailing Apache Spark-based data pipeline creations. Live Grounding: Explains basic architecture of Spark RDDs, DataFrames, and structural connections required to route data from transactional sources into modern cloud warehouses.
Market Surveys
- opensourceforu.com: Kubernetes Adoption Widespread for Big Data: Survey [COMMUNITY-TOOL] — Curator Insight: Survey results discussing the widespread adoption of Kubernetes scheduling for big data workloads. Live Grounding: Outlines historical transition metrics from static clusters to unified container environments, citing resource efficiency and deployment agility as top motivators.
Spark on Kubernetes
- (2021) datamechanics.co: Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available [COMMUNITY-TOOL] — Curator Insight: Platform news highlighting Apache Spark 3.1 generally available support for native Kubernetes. Live Grounding: Covers the native scheduling capabilities, decommissioning behaviors, and executor tracking improvements that made Kubernetes a first-class citizen for Spark.
- itnext.io: Migrating Apache Spark workloads from AWS EMR to Kubernetes [ADVANCED LEVEL] [COMMUNITY-TOOL] — Curator Insight: Technical breakdown of migrating Apache Spark analytics engines from AWS EMR to Kubernetes clusters. Live Grounding: Deep-dives into memory allocation, dynamic resource allocation, storage mounting, and cost optimizations compared to traditional VM-based EMR setups.
Data Platforms
Distributed Processing
Apache Spark on Kubernetes
- (2023) spot.io: Setting up, Managing & Monitoring Spark on Kubernetes [COMMUNITY-TOOL] — Curator Insight outlines workflows for configuring, managing, and tracking Spark applications on Kubernetes. Live Grounding shows Spot's cloud cost-optimization strategies, illustrating how spot instances can be dynamically allocated for ephemeral Spark workers. This guide bridges infrastructure sizing with cost management.
- coderstan.com: Apache Spark on Kubernetes—Lessons Learned from Launching' Millions of Spark Executors (Databricks Data+AI Summit 2022) [ADVANCED LEVEL] [COMMUNITY-TOOL] — Curator Insight details lessons from Databricks' deployment of millions of Spark executors on Kubernetes. Live Grounding highlights Spark's core challenges in cluster autoscaling and executor lost events. This resource outlines precise architecture patterns to scale heavy data workloads under Kubernetes.
Databricks
- (2024) docs.databricks.com: Use scheduler pools for multiple streaming workloads [ADVANCED LEVEL] [COMMUNITY-TOOL] — Curator Insight explains how to configure fair scheduler pools to run concurrent streaming jobs. Live Grounding verifies that multi-tenant Databricks runtimes require resource isolated scheduler pools to mitigate thread starvation. This documentation provides actionable enterprise patterns for streaming production loads.
- aprenderbigdata.com: Databricks: Introducción a Spark en la nube [SPANISH CONTENT] [COMMUNITY-TOOL] [GUIDE] — Curator Insight introduces the core components of the Databricks cloud platform and its managed Spark framework. Live Grounding indicates the rise in Spanish-speaking markets for distributed computing educational paths. This tutorial provides structural steps to deploy first-party data clusters. [SPANISH CONTENT]
Databricks Tools
- github.com/databrickslabs/ucx: Databricks Labs UCX ⭐ 308 [ADVANCED LEVEL] [ENTERPRISE-STABLE] [LEGACY] — Curator Insight introduces UCX as a toolset to migrate legacy workspaces to Unity Catalog. Live Grounding validates that Databricks Labs continuously maintains UCX to safely upgrade metastores with metadata isolation. This repository is standard for enterprise migration pipelines.
Hybrid Cloud and Enterprise
OpenShift
Big Data Workloads
- (2020) cloud.redhat.com: Getting Started running Spark workloads on OpenShift [COMMUNITY-TOOL] — Curator Insight: Practical guide detailing setup steps for hosting Spark data processors on OpenShift Platform. Live Grounding: Demystifies user routing, security context constraints, and performance tuning when running containerized Spark clusters on enterprise Red Hat foundations.
💡 Explore Related: Container Managers | Kubernetes Monitoring | Kubernetes Troubleshooting