kubescape/docs/optimization-plan.md

# Kubescape CPU/Memory Optimization Plan

**Issue:** #1793 - High CPU and Memory Usage on System-Constrained Environments
**Date:** February 3, 2026
**Root Cause Analysis:** Completed
**Proposed Solution:** Combined optimization approach across multiple components

---

## Executive Summary

Investigation into issue #1793 revealed that the original worker pool proposal addressed the symptoms but not the root causes. The actual sources of resource exhaustion are:

- **Memory:** Unbounded data structures loading entire cluster state into memory
- **CPU:** Repeated expensive operations (OPA compilation) and nested loop complexity

This document outlines a phased approach to reduce memory usage by 40-60% and CPU usage by 30-50%.

---

## Root Cause Analysis

### Memory Hotspots

1. **AllResources Map** (`core/cautils/datastructures.go:53`)
   - Loads ALL Kubernetes resources into memory at once
   - No pre-sizing causes reallocations
   - Contains every pod, deployment, service, etc. in cluster
   - **Impact:** Hundreds of MBs to several GBs for large clusters

2. **ResourcesResult Map** (`core/cautils/datastructures.go:54`)
   - Stores scan results for every resource
   - Grows dynamically without capacity hints
   - **Impact:** Proportional to resources scanned

3. **Temporary Data Structures**
   - Nested loops create temporary slices in `getKubernetesObjects`
   - Repeated allocation per rule evaluation
   - **Impact:** Memory churn and GC pressure

### CPU Hotspots

1. **OPA Module Compilation** (`core/pkg/opaprocessor/processorhandler.go:324-330`)
   - Comment explicitly states: *"OPA module compilation is the most resource-intensive operation"*
   - Compiles EVERY rule from scratch (no caching)
   - Typical scan: ~100 controls × 5 rules = 500+ compilations
   - **Impact:** High CPU, repeated compilation overhead

2. **6-Level Nested Loops** (`core/pkg/opaprocessor/processorhandlerutils.go:136-167`)
   - Creates temporary data structures for each rule
   - Iterates all matched resources multiple times
   - **Impact:** O(n×m×...) complexity

3. **O(n) Slice Operations**
   - `slices.Contains()` for deduplication in image scanning
   - `RelatedResourcesIDs` slice growth with O(n) membership checks
   - **Impact:** Degraded performance with larger datasets

### Codebase Evidence

The team is already aware of this issue, with internal documentation acknowledging the problem:

```go
// isLargeCluster returns true if the cluster size is larger than the largeClusterSize
// This code is a workaround for large clusters. The final solution will be to scan resources individually
// Source: core/pkg/opaprocessor/processorhandlerutils.go:279
```

---

## Proposed Solutions: Six-Phase Implementation

### Phase 1: OPA Module Caching

**Objective:** Eliminate redundant rule compilations

**Files Modified:**
- `core/pkg/opaprocessor/processorhandler.go`
- `core/pkg/opaprocessor/processorhandler_test.go`

**Changes:**
```go
type OPAProcessor struct {
    // existing fields...
    compiledModules map[string]*ast.Compiler
    compiledMu      sync.RWMutex
}

func (opap *OPAProcessor) getCompiledRule(ctx context.Context, rule reporthandling.Rule, modules map[string]string) (*ast.Compiler, error) {
    // Check cache with read lock
    cacheKey := rule.Name + "|" + rule.Rule
    opap.compiledMu.RLock()
    if compiled, ok := opap.compiledModules[cacheKey]; ok {
        opap.compiledMu.RUnlock()
        return compiled, nil
    }
    opap.compiledMu.RUnlock()

    // Compile new module with write lock
    opap.compiledMu.Lock()
    defer opap.compiledMu.Unlock()

    // Double-check pattern (cache might have been filled)
    if compiled, ok := opap.compiledModules[cacheKey]; ok {
        return compiled, nil
    }

    compiled, err := ast.CompileModulesWithOpt(modules, ast.CompileOpts{
        EnablePrintStatements: opap.printEnabled,
        ParserOptions:         ast.ParserOptions{RegoVersion: ast.RegoV0},
    })
    if err != nil {
        return nil, fmt.Errorf("failed to compile rule '%s': %w", rule.Name, err)
    }

    opap.compiledModules[cacheKey] = compiled
    return compiled, nil
}
```

**Integration Point:** Replace direct compilation call in `runRegoOnK8s(:338` with cached retrieval

**Testing:**
- Unit test: Verify cache hit for identical rules
- Unit test: Verify cache miss for different rules
- Integration test: Measure scan time before/after

**Expected Savings:** 30-40% CPU reduction

**Risk:** Low - caching is a well-known pattern, minimal behavior change

**Dependencies:** None

---

### Phase 2: Map Pre-sizing

**Objective:** Reduce memory allocations and fragmentation

**Files Modified:**
- `core/cautils/datastructures.go`
- `core/cautils/datastructures_test.go`
- `core/pkg/resourcehandler/handlerpullresources.go`
- `core/pkg/resourcehandler/k8sresources.go`

**Changes:**

1. Update constructor to pre-size maps (cluster size estimated internally):
```go
func NewOPASessionObj(ctx context.Context, frameworks []reporthandling.Framework, k8sResources K8SResources, scanInfo *ScanInfo) *OPASessionObj {
    clusterSize := estimateClusterSize(k8sResources)
    if clusterSize < 100 {
        clusterSize = 100
    }
    return &OPASessionObj{
        AllResources:    make(map[string]workloadinterface.IMetadata, clusterSize),
        ResourcesResult: make(map[string]resourcesresults.Result, clusterSize),
        // ... other pre-sized collections
    }
}
```

2. Update resource collection to return count:
```go
func (k8sHandler *K8sResourceHandler) pullResources(queryableResources QueryableResources, ...) (K8SResources, map[string]workloadinterface.IMetadata, map[string]workloadinterface.IMetadata, map[string]map[string]bool, int, error) {
    // ... existing code ...
    return k8sResources, allResources, externalResources, excludedRulesMap, estimatedCount, nil
}
```

3. Pass size during initialization:
```go
func CollectResources(ctx context.Context, rsrcHandler IResourceHandler, opaSessionObj *cautils.OPASessionObj, ...) error {
    resourcesMap, allResources, externalResources, excludedRulesMap, estimatedCount, err := rsrcHandler.GetResources(ctx, opaSessionObj, scanInfo)

    // Re-initialize with proper size
    if opaSessionObj.AllResources == nil {
        opaSessionObj = cautils.NewOPASessionObj(estimatedCount)
    }

    opaSessionObj.K8SResources = resourcesMap
    opaSessionObj.AllResources = allResources
    // ...
}
```

**Testing:**
- Unit test: Verify pre-sized maps with expected content
- Performance test: Compare memory usage before/after
- Integration test: Scan with varying cluster sizes

**Expected Savings:** 10-20% memory reduction, reduced GC pressure

**Risk:** Low - Go's make() with capacity hint is well-tested

**Dependencies:** None

---

### Phase 3: Set-based Deduplication

**Objective:** Replace O(n) slice operations with O(1) set operations

**Files Modified:**
- `core/pkg/utils/dedup.go` (new file)
- `core/core/scan.go`
- `core/pkg/opaprocessor/processorhandler.go`

**Changes:**

1. Create new utility:
```go
// core/pkg/utils/dedup.go
package utils

import "sync"

type StringSet struct {
    items map[string]struct{}
    mu    sync.RWMutex
}

func NewStringSet() *StringSet {
    return &StringSet{
        items: make(map[string]struct{}),
    }
}

func (s *StringSet) Add(item string) {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.items[item] = struct{}{}
}

func (s *StringSet) AddAll(items []string) {
    s.mu.Lock()
    defer s.mu.Unlock()
    for _, item := range items {
        s.items[item] = struct{}{}
    }
}

func (s *StringSet) Contains(item string) bool {
    s.mu.RLock()
    defer s.mu.RUnlock()
    _, ok := s.items[item]
    return ok
}

func (s *StringSet) ToSlice() []string {
    s.mu.RLock()
    defer s.mu.RUnlock()
    result := make([]string, 0, len(s.items))
    for item := range s.items {
        result = append(result, item)
    }
    return result
}
```

2. Update image scanning (`core/core/scan.go:249`):
```go
func scanImages(scanType cautils.ScanTypes, scanData *cautils.OPASessionObj, ...) {
    var imagesToScan *utils.StringSet
    imagesToScan = utils.NewStringSet()

    for _, workload := range scanData.AllResources {
        containers, err := workloadinterface.NewWorkloadObj(workload.GetObject()).GetContainers()
        if err != nil {
            logger.L().Error(...)
            continue
        }
        for _, container := range containers {
            if !imagesToScan.Contains(container.Image) {
                imagesToScan.Add(container.Image)
            }
        }
    }

    // Use imagesToScan.ToSlice() for iteration
}
```

3. Update related resources (`core/pkg/opaprocessor/processorhandler.go:261`):
```go
var relatedResourcesIDs *utils.StringSet
relatedResourcesIDs = utils.NewStringSet()

// Inside loop
if !relatedResourcesIDs.Contains(wl.GetID()) {
    relatedResourcesIDs.Add(wl.GetID())
    // ... process related resource
}
```

**Testing:**
- Unit tests for StringSet operations
- Benchmark tests comparing slice.Contains vs set.Contains
- Integration tests with real scan scenarios

**Expected Savings:** 5-10% CPU reduction for large clusters

**Risk:** Low - thread-safe set implementation, minimal behavior change

**Dependencies:** None

---

### Phase 4: Cache getKubernetesObjects

**Objective:** Eliminate repeated computation of resource groupings

**Files Modified:**
- `core/pkg/opaprocessor/processorhandler.go`
- `core/pkg/opaprocessor/processorhandlerutils.go`
- `core/pkg/opaprocessor/processorhandler_test.go`

**Changes:**

1. Add cache to processor:
```go
type OPAProcessor struct {
    // existing fields...
    k8sObjectsCache map[string]map[string][]workloadinterface.IMetadata
    k8sObjectsMu    sync.RWMutex
}
```

2. Add cache key generation:
```go
func (opap *OPAProcessor) getCacheKey(match []reporthandling.RuleMatchObjects) string {
    var strings []string
    for _, m := range match {
        for _, group := range m.APIGroups {
            for _, version := range m.APIVersions {
                for _, resource := range m.Resources {
                    strings = append(strings, fmt.Sprintf("%s/%s/%s", group, version, resource))
                }
            }
        }
    }
    sort.Strings(strings)
    return strings.Join(strings, "|")
}
```

3. Wrap getKubernetesObjects with caching:
```go
func (opap *OPAProcessor) getKubernetesObjectsCached(k8sResources cautils.K8SResources, match []reporthandling.RuleMatchObjects) map[string][]workloadinterface.IMetadata {
    cacheKey := opap.getCacheKey(match)

    // Try cache
    opap.k8sObjectsMu.RLock()
    if cached, ok := opap.k8sObjectsCache[cacheKey]; ok {
        opap.k8sObjectsMu.RUnlock()
        return cached
    }
    opap.k8sObjectsMu.RUnlock()

    // Compute new value
    result := getKubernetesObjects(k8sResources, opap.AllResources, match)

    // Store in cache
    opap.k8sObjectsMu.Lock()
    opap.k8sObjectsCache[cacheKey] = result
    opap.k8sObjectsMu.Unlock()

    return result
}
```

**Testing:**
- Unit test: Verify cache correctness
- Benchmark: Compare execution time with/without cache
- Integration test: Measure scan time on large cluster

**Expected Savings:** 10-15% CPU reduction

**Risk:** Low-Medium - needs proper cache invalidation logic (not needed as resources are static during scan)

**Dependencies:** None

---

### Phase 5: Resource Streaming

**Objective:** Process resources in batches instead of loading all at once

**Files Modified:**
- `core/pkg/resourcehandler/k8sresources.go`
- `core/pkg/resourcehandler/interface.go`
- `core/pkg/resourcehandler/filesloader.go`
- `core/pkg/opaprocessor/processorhandler.go`
- `cmd/scan/scan.go`

**Changes:**

1. Add streaming interface:
```go
// core/pkg/resourcehandler/interface.go
type IResourceHandler interface {
    GetResources(...) (...)
    StreamResources(ctx context.Context, batchSize int) (<-chan workloadinterface.IMetadata, error)
}
```

2. Implement streaming for Kubernetes resources:
```go
func (k8sHandler *K8sResourceHandler) StreamResources(ctx context.Context, batchSize int) (<-chan workloadinterface.IMetadata, error) {
    ch := make(chan workloadinterface.IMetadata, batchSize)

    go func() {
        defer close(ch)

        queryableResources := k8sHandler.getQueryableResources()

        for i := range queryableResources {
            select {
            case <-ctx.Done():
                return
            default:
                apiGroup, apiVersion, resource := k8sinterface.StringToResourceGroup(queryableResources[i].GroupVersionResourceTriplet)
                gvr := schema.GroupVersionResource{Group: apiGroup, Version: apiVersion, Resource: resource}

                result, err := k8sHandler.pullSingleResource(&gvr, nil, queryableResources[i].FieldSelectors, nil)
                if err != nil {
                    continue
                }

                metaObjs := ConvertMapListToMeta(k8sinterface.ConvertUnstructuredSliceToMap(result))

                for _, metaObj := range metaObjs {
                    select {
                    case ch <- metaObj:
                    case <-ctx.Done():
                        return
                    }
                }
            }
        }
    }()

    return ch, nil
}
```

3. Update OPA processor to handle streaming:
```go
func (opap *OPAProcessor) ProcessWithStreaming(ctx context.Context, policies *cautils.Policies, resourceStream <-chan workloadinterface.IMetadata, batchSize int) error {
    batch := make([]workloadinterface.IMetadata, 0, batchSize)
    opaSessionObj := cautils.NewOPASessionObj(batchSize)

    // Collect batch
    done := false
    for !done {
        select {
        case resource, ok := <-resourceStream:
            if !ok {
                done = true
                break
            }
            batch = append(batch, resource)

            if len(batch) >= batchSize {
                opaSessionObj.AllResources = batchToMap(batch)
                if err := opap.ProcessBatch(ctx, policies); err != nil {
                    return err
                }
                batch = batch[:0] // Clear batch
            }
        case <-ctx.Done():
            return ctx.Err()
        }
    }

    // Process remaining batch
    if len(batch) > 0 {
        opaSessionObj.AllResources = batchToMap(batch)
        if err := opap.ProcessBatch(ctx, policies); err != nil {
            return err
        }
    }

    return nil
}
```

4. Add CLI flags:
```go
// cmd/scan/scan.go
scanCmd.PersistentFlags().BoolVar(&scanInfo.StreamMode, "stream-resources", false, "Process resources in batches (lower memory, slightly slower)")
scanCmd.PersistentFlags().IntVar(&scanInfo.StreamBatchSize, "stream-batch-size", 100, "Batch size for resource streaming (lower = less memory)")
```

5. Auto-enable for large clusters:
```go
func shouldEnableStreaming(scanInfo *cautils.ScanInfo, estimatedClusterSize int) bool {
    if scanInfo.StreamMode {
        return true
    }

    largeClusterSize, _ := cautils.ParseIntEnvVar("LARGE_CLUSTER_SIZE", 2500)
    if estimatedClusterSize > largeClusterSize {
        logger.L().Info("Large cluster detected, enabling streaming mode")
        return true
    }

    return false
}
```

**Testing:**
- Unit test: Verify streaming produces same results as batch mode
- Performance test: Compare memory usage on large cluster
- Integration test: Test with various batch sizes
- End-to-end test: Verify scan results match existing behavior

**Expected Savings:** 30-50% memory reduction for large clusters

**Risk:** Medium - significant behavior change, needs thorough testing

**Dependencies:** Phase 2 (map pre-sizing)

---

### Phase 6: Early Cleanup

**Objective:** Free memory promptly after resources are processed

**Files Modified:**
- `core/pkg/opaprocessor/processorhandler.go`
- `core/pkg/opaprocessor/processorhandlerutils.go`

**Changes:**

```go
func (opap *OPAProcessor) Process(ctx context.Context, policies *cautils.Policies, progressListener IJobProgressNotificationClient) error {
    resourcesRemaining := make(map[string]bool)
    for id := range opap.AllResources {
        resourcesRemaining[id] = true
    }

    for _, toPin := range policies.Controls {
        control := toPin

        resourcesAssociatedControl, err := opap.processControl(ctx, &control)
        if err != nil {
            logger.L().Ctx(ctx).Warning(err.Error())
        }

        // Clean up processed resources if not needed for future controls
        if len(policies.Controls) > 10 && !isLargeCluster(len(opap.AllResources)) {
            for id := range resourcesAssociatedControl {
                if resourcesRemaining[id] {
                    delete(resourcesRemaining, id)

                    // Remove from AllResources
                    if resource, ok := opap.AllResources[id]; ok {
                        removeData(resource)
                        delete(opap.AllResources, id)
                    }
                }
            }
        }
    }

    return nil
}
```

**Testing:**
- Unit test: Verify cleanup doesn't affect scan results
- Memory test: Verify memory decreases during scan
- Integration test: Test with policies that reference same resources

**Expected Savings:** 10-20% memory reduction, reduced peak memory usage

**Risk:** Medium - needs careful tracking of which resources are still needed

**Dependencies:** Phase 5 (resource streaming)

---

## Implementation Timeline

### Iteration 1 (Quick Wins)
- **Week 1:** Phase 1 - OPA Module Caching
- **Week 1:** Phase 2 - Map Pre-sizing
- **Week 2:** Phase 3 - Set-based Deduplication

### Iteration 2 (Mid-Term)
- **Week 3:** Phase 4 - Cache getKubernetesObjects

### Iteration 3 (Long-Term)
- **Weeks 4-5:** Phase 5 - Resource Streaming
- **Week 6:** Phase 6 - Early Cleanup

### Total Duration: 6 weeks

---

## Risk Assessment

| Phase | Risk Level | Mitigation Strategy |
|-------|------------|-------------------|
| 1 - OPA Caching | Low | Comprehensive unit tests, fallback to uncached mode |
| 2 - Map Pre-sizing | Low | Backward compatible, capacity hints are safe |
| 3 - Set Dedup | Low | Thread-safe implementation, comprehensive tests |
| 4 - getK8SCache | Low-Medium | Cache key validation, cache invalidation logic |
| 5 - Streaming | Medium | Feature flag (disable by default), extensive integration tests |
| 6 - Early Cleanup | Medium | Track resource dependencies, thorough validation |

---

## Performance Targets

### Memory Usage
- **Current (Large Cluster >2500 resources):** ~2-4 GB
- **Target:** ~1-2 GB (50% reduction)

### CPU Usage
- **Current:** High peaks during OPA evaluation
- **Target:** 30-50% reduction in peak CPU

### Scan Time
- **Expected:** Neutral to slight improvement (streaming may add 5-10% overhead on small clusters, large clusters benefit from reduced GC)

---

## CLI Flags (Phase 5)

```bash
# Manual streaming mode
kubescape scan framework all --stream-resources --stream-batch-size 50

# Auto-detection (default)
kubescape scan framework all  # Automatically enables streaming for large clusters

# Environment variable
export KUBESCAPE_STREAM_BATCH_SIZE=100
```

---

## Backward Compatibility

All changes are backward compatible:

1. Default behavior unchanged for small clusters (<2500 resources)
2. Streaming mode requires explicit flag or auto-detection
3. Cache changes are transparent to users
4. No breaking API changes

---

## Dependencies on External Packages

- `github.com/open-policy-agent/opa/ast` - OPA compilation (Phase 1)
- `github.com/kubescape/opa-utils` - Existing dependencies maintained

No new external dependencies required.

---

## Testing Strategy

### Unit Tests
- Each phase includes comprehensive unit tests
- Mock-based testing for components without external dependencies
- Property-based testing where applicable

### Integration Tests
- End-to-end scan validation
- Test clusters of varying sizes (100, 1000, 5000 resources)
- Validate identical results with and without optimizations

### Performance Tests
- Benchmark suite before/after each phase
- Memory profiling (pprof) for memory validation
- CPU profiling for CPU validation

### Regression Tests
- Compare scan results before/after all phases
- Validate all controls produce identical findings
- Test across different Kubernetes versions

---

## Success Criteria

1. **CPU Usage:** ≥30% reduction in peak CPU during scanning (measured with profiling)
2. **Memory Usage:** ≥40% reduction in peak memory for clusters >2500 resources
3. **Functional Correctness:** 100% of control findings identical to current implementation
4. **Scan Time:** No degradation >15% on small clusters; improvement on large clusters
5. **Stability:** Zero new race conditions or panics in production-style testing

---

## Alternative Approaches Considered

### Alternative 1: Worker Pool (Original #1793 Proposal)
- **Problem:** Addresses symptoms (concurrency) not root causes (data structures)
- **Conclusion:** Rejected - would not solve memory accumulation

### Alternative 2: Offload to Managed Service
- **Problem:** Shifts problem to infrastructure, doesn't solve core architecture
- **Conclusion:** Not appropriate for CLI tool use case

### Alternative 3: External Database for State
- **Problem:** Adds complexity, requires additional dependencies
- **Conclusion:** Overkill for single-scan operations

---

## Open Questions

1. **Cache Eviction Policy:** Should OPA module cache expire after N scans? (Current: process-scoped)
2. **Batch Size Tuning:** What default batch size balances memory vs. performance? (Proposed: 100)
3. **Early Cleanup Threshold:** What minimum control count enables early cleanup? (Proposed: 10)
4. **Large Cluster Threshold:** Keep existing 2500 or adjust based on optimization results?

---

## Recommendations

1. **Start with Phases 1-4** (low risk, good ROI) for immediate improvement
2. **Evaluate Phase 5-6** based on actual memory gains from earlier phases
3. **Add monitoring** to track real-world resource usage after deployment
4. **Consider making streaming opt-in** initially, then opt-out after validation

---

## Appendix: Key Code Locations

| Component | File | Line | Notes |
|-----------|------|------|-------|
| AllResources initialization | `core/cautils/datastructures.go` | 80-81 | Map pre-sizing target |
| OPA compilation | `core/pkg/opaprocessor/processorhandler.go` | 324-330 | Most CPU-intensive operation |
| getKubernetesObjects | `core/pkg/opaprocessor/processorhandlerutils.go` | 136-167 | 6-level nested loops |
| Resource collection | `core/pkg/resourcehandler/k8sresources.go` | 313-355 | Loads all resources |
| Image deduplication | `core/core/scan.go` | 249 | O(n) slice.Contains |
| Throttle package (unused) | `core/pkg/throttle/throttle.go` | - | Could be repurposed |

---

**Document Version:** 1.0
**Prepared by:** Code Investigation Team
**Review Status:** Awaiting stakeholder approval