Files
awesome-kubernetes/v2-docs/kubernetes-bigdata.md

7.9 KiB

Big Data and Kubernetes Big Data

!!! info "Architectural Context" Detailed reference for Big Data and Kubernetes Big Data in the context of The Container Stack.

Standard Reference

Cloud Native AI

Batch Workloads

Kueue Scheduling

  • Red Hat Build of Kueue [ADVANCED LEVEL] [DOCUMENTATION] [COMMUNITY-TOOL] — Curator Insight: Documentation for the Red Hat Build of Kueue scheduler within OpenShift. Live Grounding: Kueue offers advanced queueing mechanism controls, priority groupings, and resource quotas, making it the premier platform tool for managing AI/ML and batch workloads.

Big Data Orchestration

Data Pipelines

  • hevodata.com: Building Apache Spark Data Pipeline? Made Easy 101 🌟 [COMMUNITY-TOOL] [GUIDE] — Curator Insight: Fundamental guide detailing Apache Spark-based data pipeline creations. Live Grounding: Explains basic architecture of Spark RDDs, DataFrames, and structural connections required to route data from transactional sources into modern cloud warehouses.

Market Surveys

  • opensourceforu.com: Kubernetes Adoption Widespread for Big Data: Survey [COMMUNITY-TOOL] — Curator Insight: Survey results discussing the widespread adoption of Kubernetes scheduling for big data workloads. Live Grounding: Outlines historical transition metrics from static clusters to unified container environments, citing resource efficiency and deployment agility as top motivators.

Spark on Kubernetes

  • (2021) datamechanics.co: Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available [COMMUNITY-TOOL] — Curator Insight: Platform news highlighting Apache Spark 3.1 generally available support for native Kubernetes. Live Grounding: Covers the native scheduling capabilities, decommissioning behaviors, and executor tracking improvements that made Kubernetes a first-class citizen for Spark.
  • itnext.io: Migrating Apache Spark workloads from AWS EMR to Kubernetes [ADVANCED LEVEL] [COMMUNITY-TOOL] — Curator Insight: Technical breakdown of migrating Apache Spark analytics engines from AWS EMR to Kubernetes clusters. Live Grounding: Deep-dives into memory allocation, dynamic resource allocation, storage mounting, and cost optimizations compared to traditional VM-based EMR setups.

Data Platforms

Distributed Processing

Apache Spark on Kubernetes

Databricks

  • (2024) docs.databricks.com: Use scheduler pools for multiple streaming workloads [ADVANCED LEVEL] [COMMUNITY-TOOL] — Curator Insight explains how to configure fair scheduler pools to run concurrent streaming jobs. Live Grounding verifies that multi-tenant Databricks runtimes require resource isolated scheduler pools to mitigate thread starvation. This documentation provides actionable enterprise patterns for streaming production loads.
  • aprenderbigdata.com: Databricks: Introducción a Spark en la nube [SPANISH CONTENT] [COMMUNITY-TOOL] [GUIDE] — Curator Insight introduces the core components of the Databricks cloud platform and its managed Spark framework. Live Grounding indicates the rise in Spanish-speaking markets for distributed computing educational paths. This tutorial provides structural steps to deploy first-party data clusters. [SPANISH CONTENT]

Databricks Tools

  • github.com/databrickslabs/ucx: Databricks Labs UCX 308 [ADVANCED LEVEL] [ENTERPRISE-STABLE] [LEGACY] — Curator Insight introduces UCX as a toolset to migrate legacy workspaces to Unity Catalog. Live Grounding validates that Databricks Labs continuously maintains UCX to safely upgrade metastores with metadata isolation. This repository is standard for enterprise migration pipelines.

Hybrid Cloud and Enterprise

OpenShift

Big Data Workloads

  • (2020) cloud.redhat.com: Getting Started running Spark workloads on OpenShift [COMMUNITY-TOOL] — Curator Insight: Practical guide detailing setup steps for hosting Spark data processors on OpenShift Platform. Live Grounding: Demystifies user routing, security context constraints, and performance tuning when running containerized Spark clusters on enterprise Red Hat foundations.

💡 Explore Related: Container Managers | Kubernetes Monitoring | Kubernetes Troubleshooting