feat: Load tests

2026-02-14 18:09:50 +00:00 · 2026-01-06 11:03:26 +01:00
parent a5d1012570
commit 9a3edf13d2
17 changed files with 6191 additions and 82 deletions
--- a/test/loadtest/README.md
+++ b/test/loadtest/README.md
@@ -0,0 +1,544 @@
+# Reloader Load Test Framework
+
+This framework provides A/B comparison testing between two Reloader container images.
+
+## Overview
+
+The load test framework:
+1. Creates a local kind cluster (1 control-plane + 6 worker nodes)
+2. Deploys Prometheus for metrics collection
+3. Loads the provided Reloader container images into the cluster
+4. Runs standardized test scenarios (S1-S13)
+5. Collects metrics via Prometheus scraping
+6. Generates comparison reports with pass/fail criteria
+
+## Prerequisites
+
+- Docker or Podman
+- kind (Kubernetes in Docker)
+- kubectl
+- Go 1.22+
+
+## Building
+
+```bash
+cd test/loadtest
+go build -o loadtest ./cmd/loadtest
+```
+
+## Quick Start
+
+```bash
+# Compare two published images (e.g., different versions)
+./loadtest run \
+  --old-image=stakater/reloader:v1.0.0 \
+  --new-image=stakater/reloader:v1.1.0
+
+# Run a specific scenario
+./loadtest run \
+  --old-image=stakater/reloader:v1.0.0 \
+  --new-image=stakater/reloader:v1.1.0 \
+  --scenario=S2 \
+  --duration=120
+
+# Test only a single image (no comparison)
+./loadtest run --new-image=myregistry/reloader:dev
+
+# Use local images built with docker/podman
+./loadtest run \
+  --old-image=localhost/reloader:baseline \
+  --new-image=localhost/reloader:feature-branch
+
+# Skip cluster creation (use existing kind cluster)
+./loadtest run \
+  --old-image=stakater/reloader:v1.0.0 \
+  --new-image=stakater/reloader:v1.1.0 \
+  --skip-cluster
+
+# Run all scenarios in parallel on 4 clusters (faster execution)
+./loadtest run \
+  --new-image=localhost/reloader:dev \
+  --parallelism=4
+
+# Run all 13 scenarios in parallel (one cluster per scenario)
+./loadtest run \
+  --new-image=localhost/reloader:dev \
+  --parallelism=13
+
+# Generate report from existing results
+./loadtest report --scenario=S2 --results-dir=./results
+```
+
+## Command Line Options
+
+### Run Command
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--old-image=IMAGE` | Container image for "old" version | - |
+| `--new-image=IMAGE` | Container image for "new" version | - |
+| `--scenario=ID` | Test scenario: S1-S13 or "all" | all |
+| `--duration=SECONDS` | Test duration in seconds | 60 |
+| `--parallelism=N` | Run N scenarios in parallel on N kind clusters | 1 |
+| `--skip-cluster` | Skip kind cluster creation (use existing, only for parallelism=1) | false |
+| `--results-dir=DIR` | Directory for results | ./results |
+
+**Note:** At least one of `--old-image` or `--new-image` is required. Provide both for A/B comparison.
+
+### Report Command
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--scenario=ID` | Scenario to report on (required) | - |
+| `--results-dir=DIR` | Directory containing results | ./results |
+| `--output=FILE` | Output file (default: stdout) | - |
+
+## Test Scenarios
+
+| ID  | Name                  | Description                                     |
+|-----|-----------------------|-------------------------------------------------|
+| S1  | Burst Updates         | Many ConfigMap/Secret updates in quick succession |
+| S2  | Fan-Out               | One ConfigMap used by many (50) workloads       |
+| S3  | High Cardinality      | Many CMs/Secrets across many namespaces         |
+| S4  | No-Op Updates         | Updates that don't change data (annotation only)|
+| S5  | Workload Churn        | Deployments created/deleted rapidly             |
+| S6  | Controller Restart    | Restart controller pod under load               |
+| S7  | API Pressure          | Many concurrent update requests                 |
+| S8  | Large Objects         | ConfigMaps > 100KB                              |
+| S9  | Multi-Workload Types  | Tests all workload types (Deploy, STS, DS)      |
+| S10 | Secrets + Mixed       | Secrets and mixed ConfigMap+Secret workloads    |
+| S11 | Annotation Strategy   | Tests `--reload-strategy=annotations`           |
+| S12 | Pause & Resume        | Tests pause-period during rapid updates         |
+| S13 | Complex References    | Init containers, valueFrom, projected volumes   |
+
+## Metrics Reference
+
+This section explains each metric collected during load tests, what it measures, and what different values might indicate.
+
+### Counter Metrics (Totals)
+
+#### `reconcile_total`
+**What it measures:** The total number of reconciliation loops executed by the controller.
+
+**What it indicates:**
+- **Higher in new vs old:** The new controller-runtime implementation may batch events differently. This is often expected behavior, not a problem.
+- **Lower in new vs old:** Better event batching/deduplication. Controller-runtime's work queue naturally deduplicates events.
+- **Expected behavior:** The new implementation typically has *fewer* reconciles due to intelligent event batching.
+
+#### `action_total`
+**What it measures:** The total number of reload actions triggered (rolling restarts of Deployments/StatefulSets/DaemonSets).
+
+**What it indicates:**
+- **Should match expected value:** Both implementations should trigger the same number of reloads for the same workload.
+- **Lower than expected:** Some updates were missed - potential bug or race condition.
+- **Higher than expected:** Duplicate reloads triggered - inefficiency but not data loss.
+
+#### `reload_executed_total`
+**What it measures:** Successful reload operations executed, labeled by `success=true/false`.
+
+**What it indicates:**
+- **`success=true` count:** Number of workloads successfully restarted.
+- **`success=false` count:** Failed restart attempts (API errors, permission issues).
+- **Should match `action_total`:** If significantly lower, reloads are failing.
+
+#### `workloads_scanned_total`
+**What it measures:** Number of workloads (Deployments, etc.) scanned when checking for ConfigMap/Secret references.
+
+**What it indicates:**
+- **High count:** Controller is scanning many workloads per reconcile.
+- **Expected behavior:** Should roughly match the number of workloads × number of reconciles.
+- **Optimization signal:** If very high, namespace filtering or label selectors could help.
+
+#### `workloads_matched_total`
+**What it measures:** Number of workloads that matched (reference the changed ConfigMap/Secret).
+
+**What it indicates:**
+- **Should match `reload_executed_total`:** Every matched workload should be reloaded.
+- **Higher than reloads:** Some matched workloads weren't reloaded (potential issue).
+
+#### `errors_total`
+**What it measures:** Total errors encountered, labeled by error type.
+
+**What it indicates:**
+- **Should be 0:** Any errors indicate problems.
+- **Common causes:** API server timeouts, RBAC issues, resource conflicts.
+- **Critical metric:** Non-zero errors in production should be investigated.
+
+### API Efficiency Metrics (REST Client)
+
+These metrics track Kubernetes API server calls made by Reloader. Lower values indicate more efficient operation with less API server load.
+
+#### `rest_client_requests_total`
+**What it measures:** Total number of HTTP requests made to the Kubernetes API server.
+
+**What it indicates:**
+- **Lower is better:** Fewer API calls means less load on the API server.
+- **High count:** May indicate inefficient caching or excessive reconciles.
+- **Comparison use:** Shows overall API efficiency between implementations.
+
+#### `rest_client_requests_get`
+**What it measures:** Number of GET requests (fetching individual resources or listings).
+
+**What it indicates:**
+- **Includes:** Fetching ConfigMaps, Secrets, Deployments, etc.
+- **Higher count:** More frequent resource fetching, possibly due to cache misses.
+- **Expected behavior:** Controller-runtime's caching should reduce GET requests compared to direct API calls.
+
+#### `rest_client_requests_patch`
+**What it measures:** Number of PATCH requests (partial updates to resources).
+
+**What it indicates:**
+- **Used for:** Rolling restart annotations on workloads.
+- **Should correlate with:** `reload_executed_total` - each reload typically requires one PATCH.
+- **Lower is better:** Fewer patches means more efficient batching or deduplication.
+
+#### `rest_client_requests_put`
+**What it measures:** Number of PUT requests (full resource updates).
+
+**What it indicates:**
+- **Used for:** Full object replacements (less common than PATCH).
+- **Should be low:** Most updates use PATCH for efficiency.
+- **High count:** May indicate suboptimal update strategy.
+
+#### `rest_client_requests_errors`
+**What it measures:** Number of failed API requests (4xx/5xx responses).
+
+**What it indicates:**
+- **Should be 0:** Errors indicate API server issues or permission problems.
+- **Common causes:** Rate limiting, RBAC issues, resource conflicts, network issues.
+- **Non-zero:** Investigate API server logs and Reloader permissions.
+
+### Latency Metrics (Percentiles)
+
+All latency metrics are reported in **seconds**. The report shows p50 (median), p95, and p99 percentiles.
+
+#### `reconcile_duration (s)`
+**What it measures:** Time spent inside each reconcile loop, from start to finish.
+
+**What it indicates:**
+- **p50 (median):** Typical reconcile time. Should be < 100ms for good performance.
+- **p95:** 95th percentile - only 5% of reconciles take longer than this.
+- **p99:** 99th percentile - indicates worst-case performance.
+
+**Interpreting differences:**
+- **New higher than old:** Controller-runtime reconciles may do more work per loop but run fewer times. Check `reconcile_total` - if it's lower, this is expected.
+- **Minor differences (< 0.5s absolute):** Not significant for sub-second values.
+
+#### `action_latency (s)`
+**What it measures:** End-to-end time from ConfigMap/Secret change detection to workload restart triggered.
+
+**What it indicates:**
+- **This is the user-facing latency:** How long users wait for their config changes to take effect.
+- **p50 < 1s:** Excellent - most changes apply within a second.
+- **p95 < 5s:** Good - even under load, changes apply quickly.
+- **p99 > 10s:** May need investigation - some changes take too long.
+
+**What affects this:**
+- API server responsiveness
+- Number of workloads to scan
+- Concurrent updates competing for resources
+
+### Understanding the Report
+
+#### Report Columns
+
+```
+Metric                           Old          New   Expected  Old✓  New✓   Status
+------                           ---          ---   --------  ----  ----   ------
+action_total                  100.00       100.00        100     ✓     ✓     pass
+action_latency_p95 (s)          0.15         0.04          -     -     -     pass
+```
+
+- **Old/New:** Measured values from each implementation
+- **Expected:** Known expected value (for throughput metrics)
+- **Old✓/New✓:** Whether the value is within 15% of expected (✓ = yes, ✗ = no, - = no expected value)
+- **Status:** pass/fail based on comparison thresholds
+
+#### Pass/Fail Logic
+
+| Metric Type | Pass Condition |
+|-------------|----------------|
+| Throughput (action_total, reload_executed_total) | New value within 15% of expected |
+| Latency (p50, p95, p99) | New not more than threshold% worse than old, OR absolute difference < minimum threshold |
+| Errors | New ≤ Old (ideally both 0) |
+| API Efficiency (rest_client_requests_*) | New ≤ Old (lower is better), or New not more than 50% higher |
+
+#### Latency Thresholds
+
+Latency comparisons use both percentage AND absolute thresholds to avoid false failures:
+
+| Metric | Max % Worse | Min Absolute Diff |
+|--------|-------------|-------------------|
+| p50 | 100% | 0.5s |
+| p95 | 100% | 1.0s |
+| p99 | 100% | 1.0s |
+
+**Example:** If old p50 = 0.01s and new p50 = 0.08s:
+- Percentage difference: +700% (would fail % check)
+- Absolute difference: 0.07s (< 0.5s threshold)
+- **Result: PASS** (both values are fast enough that the difference doesn't matter)
+
+### Resource Consumption Metrics
+
+These metrics track CPU, memory, and Go runtime resource usage. Lower values generally indicate more efficient operation.
+
+#### Memory Metrics
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `memory_rss_mb_avg` | Average RSS (resident set size) memory | MB |
+| `memory_rss_mb_max` | Peak RSS memory during test | MB |
+| `memory_heap_mb_avg` | Average Go heap allocation | MB |
+| `memory_heap_mb_max` | Peak Go heap allocation | MB |
+
+**What to watch for:**
+- **High RSS:** May indicate memory leaks or inefficient caching
+- **High heap:** Many objects being created (check GC metrics)
+- **Growing over time:** Potential memory leak
+
+#### CPU Metrics
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `cpu_cores_avg` | Average CPU usage rate | cores |
+| `cpu_cores_max` | Peak CPU usage rate | cores |
+
+**What to watch for:**
+- **High CPU:** Inefficient algorithms or excessive reconciles
+- **Spiky max:** May indicate burst handling issues
+
+#### Go Runtime Metrics
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `goroutines_avg` | Average goroutine count | count |
+| `goroutines_max` | Peak goroutine count | count |
+| `gc_pause_p99_ms` | 99th percentile GC pause time | ms |
+
+**What to watch for:**
+- **High goroutines:** Potential goroutine leak or unbounded concurrency
+- **High GC pause:** Large heap or allocation pressure
+
+### Scenario-Specific Expectations
+
+| Scenario | Key Metrics to Watch | Expected Behavior |
+|----------|---------------------|-------------------|
+| S1 (Burst) | action_latency_p99, cpu_cores_max, goroutines_max | Should handle bursts without queue backup |
+| S2 (Fan-Out) | reconcile_total, workloads_matched, memory_rss_mb_max | One CM change → 50 workload reloads |
+| S3 (High Cardinality) | reconcile_duration, memory_heap_mb_avg | Many namespaces shouldn't increase memory |
+| S4 (No-Op) | action_total = 0, cpu_cores_avg should be low | Minimal resource usage for no-op |
+| S5 (Churn) | errors_total, goroutines_avg | Graceful handling, no goroutine leak |
+| S6 (Restart) | All metrics captured | Metrics survive controller restart |
+| S7 (API Pressure) | errors_total, cpu_cores_max, goroutines_max | No errors under concurrent load |
+| S8 (Large Objects) | memory_rss_mb_max, gc_pause_p99_ms | Large ConfigMaps don't cause OOM or GC issues |
+| S9 (Multi-Workload) | reload_executed_total per type | All workload types (Deploy, STS, DS) reload |
+| S10 (Secrets) | reload_executed_total, workloads_matched | Both Secrets and ConfigMaps trigger reloads |
+| S11 (Annotation) | workload annotations present | Deployments get `last-reloaded-from` annotation |
+| S12 (Pause) | reload_executed_total << updates | Pause-period reduces reload frequency |
+| S13 (Complex) | reload_executed_total | All reference types trigger reloads |
+
+### Troubleshooting
+
+#### New implementation shows 0 for all metrics
+- Check if Prometheus is scraping the new Reloader pod
+- Verify pod annotations: `prometheus.io/scrape: "true"`
+- Check Prometheus targets: `http://localhost:9091/targets`
+
+#### Metrics don't match expected values
+- Verify test ran to completion (check logs)
+- Ensure Prometheus scraped final metrics (18s wait after test)
+- Check for pod restarts during test (metrics reset on restart - handled by `increase()`)
+
+#### High latency in new implementation
+- Check Reloader pod resource limits
+- Look for API server throttling in logs
+- Compare `reconcile_total` - fewer reconciles with higher duration may be normal
+
+#### REST client errors are non-zero
+- **Common causes:**
+  - Optional CRD schemes registered but CRDs not installed (e.g., Argo Rollouts, OpenShift DeploymentConfig)
+  - API server rate limiting under high load
+  - RBAC permissions missing for certain resource types
+- **Argo Rollouts errors:** If you see ~4 errors per test, ensure `--enable-argo-rollouts=false` if not using Argo Rollouts
+- **OpenShift errors:** Similarly, ensure DeploymentConfig support is disabled on non-OpenShift clusters
+
+#### REST client requests much higher in new implementation
+- Check if caching is working correctly
+- Look for excessive re-queuing in controller logs
+- Compare `reconcile_total` - more reconciles naturally means more API calls
+
+## Report Format
+
+The report generator produces a comparison table with units and expected value indicators:
+
+```
+================================================================================
+                     RELOADER A/B COMPARISON REPORT
+================================================================================
+
+Scenario:     S2
+Generated:    2026-01-03 14:30:00
+Status:       PASS
+Summary:      All metrics within acceptable thresholds
+
+Test:         S2: Fan-out test - 1 CM update triggers 50 deployment reloads
+
+--------------------------------------------------------------------------------
+                           METRIC COMPARISONS
+--------------------------------------------------------------------------------
+(Old✓/New✓ = meets expected value within 15%)
+
+Metric                                   Old          New   Expected  Old✓  New✓   Status
+------                                   ---          ---   --------  ----  ----   ------
+reconcile_total                        50.00        25.00          -     -     -     pass
+reconcile_duration_p50 (s)              0.01         0.05          -     -     -     pass
+reconcile_duration_p95 (s)              0.02         0.15          -     -     -     pass
+action_total                           50.00        50.00         50     ✓     ✓     pass
+action_latency_p50 (s)                  0.05         0.03          -     -     -     pass
+action_latency_p95 (s)                  0.12         0.08          -     -     -     pass
+errors_total                            0.00         0.00          -     -     -     pass
+reload_executed_total                  50.00        50.00         50     ✓     ✓     pass
+workloads_scanned_total                50.00        50.00         50     ✓     ✓     pass
+workloads_matched_total                50.00        50.00         50     ✓     ✓     pass
+rest_client_requests_total              850         720            -     -     -     pass
+rest_client_requests_get                500         420            -     -     -     pass
+rest_client_requests_patch              300         250            -     -     -     pass
+rest_client_requests_errors               0           0            -     -     -     pass
+```
+
+Reports are saved to `results/<scenario>/report.txt` after each test.
+
+## Directory Structure
+
+```
+test/loadtest/
+├── cmd/
+│   └── loadtest/              # Unified CLI (run + report)
+│       └── main.go
+├── internal/
+│   ├── cluster/               # Kind cluster management
+│   │   └── kind.go
+│   ├── prometheus/            # Prometheus deployment & querying
+│   │   └── prometheus.go
+│   ├── reloader/              # Reloader deployment
+│   │   └── deploy.go
+│   └── scenarios/             # Test scenario implementations
+│       └── scenarios.go
+├── manifests/
+│   └── prometheus.yaml        # Prometheus deployment manifest
+├── results/                   # Generated after tests
+│   └── <scenario>/
+│       ├── old/               # Old version data
+│       │   ├── *.json         # Prometheus metric snapshots
+│       │   └── reloader.log   # Reloader pod logs
+│       ├── new/               # New version data
+│       │   ├── *.json         # Prometheus metric snapshots
+│       │   └── reloader.log   # Reloader pod logs
+│       ├── expected.json      # Expected values from test
+│       └── report.txt         # Comparison report
+├── go.mod
+├── go.sum
+└── README.md
+```
+
+## Building Local Images for Testing
+
+If you want to test local code changes:
+
+```bash
+# Build the new Reloader image from current source
+docker build -t localhost/reloader:dev -f Dockerfile .
+
+# Build from a different branch/commit
+git checkout feature-branch
+docker build -t localhost/reloader:feature -f Dockerfile .
+
+# Then run comparison
+./loadtest run \
+  --old-image=stakater/reloader:v1.0.0 \
+  --new-image=localhost/reloader:feature
+```
+
+## Interpreting Results
+
+### PASS
+All metrics are within acceptable thresholds. The new implementation is comparable or better than the old one.
+
+### FAIL
+One or more metrics exceeded thresholds. Review the specific metrics:
+- **Latency degradation**: p95/p99 latencies are significantly higher
+- **Missed reloads**: `reload_executed_total` differs significantly
+- **Errors increased**: `errors_total` is higher in new version
+
+### Investigation
+
+If tests fail, check:
+1. Pod logs: `kubectl logs -n reloader-new deployment/reloader` (or check `results/<scenario>/new/reloader.log`)
+2. Resource usage: `kubectl top pods -n reloader-new`
+3. Events: `kubectl get events -n reloader-test`
+
+## Parallel Execution
+
+The `--parallelism` option enables running scenarios on multiple kind clusters simultaneously, significantly reducing total test time.
+
+### How It Works
+
+1. **Multiple Clusters**: Creates N kind clusters named `reloader-loadtest-0`, `reloader-loadtest-1`, etc.
+2. **Separate Prometheus**: Each cluster gets its own Prometheus instance with a unique port (9091, 9092, etc.)
+3. **Worker Pool**: Scenarios are distributed to workers via a channel, with each worker running on its own cluster
+4. **Independent Execution**: Each scenario runs in complete isolation with no resource contention
+
+### Usage
+
+```bash
+# Run 4 scenarios at a time (creates 4 clusters)
+./loadtest run --new-image=my-image:tag --parallelism=4
+
+# Run all 13 scenarios in parallel (creates 13 clusters)
+./loadtest run --new-image=my-image:tag --parallelism=13 --scenario=all
+```
+
+### Resource Requirements
+
+Parallel execution requires significant system resources:
+
+| Parallelism | Clusters | Est. Memory | Est. CPU |
+|-------------|----------|-------------|----------|
+| 1 (default) | 1 | ~4GB | 2-4 cores |
+| 4 | 4 | ~16GB | 8-16 cores |
+| 13 | 13 | ~52GB | 26-52 cores |
+
+### Notes
+
+- The `--skip-cluster` option is not supported with parallelism > 1
+- Each worker loads images independently, so initial setup takes longer
+- All results are written to the same `--results-dir` with per-scenario subdirectories
+- If a cluster setup fails, remaining workers continue with available clusters
+- Parallelism automatically reduces to match scenario count if set higher
+
+## CI Integration
+
+### GitHub Actions
+
+Load tests can be triggered on pull requests by commenting `/loadtest`:
+
+```
+/loadtest
+```
+
+This will:
+1. Build a container image from the PR branch
+2. Run all load test scenarios against it
+3. Post results as a PR comment
+4. Upload detailed results as artifacts
+
+### Make Target
+
+Run load tests locally or in CI:
+
+```bash
+# From repository root
+make loadtest
+```
+
+This builds the container image and runs all scenarios with a 60-second duration.