Files
goldpinger/pkg
Cooper Ry Lees 5e625bbd40 Prune stale Prometheus metrics for defunct peer pod IPs on teardown
After a DaemonSet rolling update, goldpinger retained response-time
histogram and error counter series for old pod IPs that no longer exist.
These stale single-sample series skewed P95/P99 latency calculations and
made transient rollout errors appear permanent. (Fixes #167)

The existing destroyPingers path only cleaned UDP-specific per-peer
metrics (and only when UDP was enabled). This adds:

- DeletePeerMetrics(): removes goldpinger_peers_response_time_s histogram
  label sets for destroyed peers, called unconditionally on pinger teardown
- goldpinger_udp_errors_total cleanup in DeletePeerUDPMetrics(), which was
  previously missed

Testing:
- TestDeletePeerMetrics_CleansResponseTimeHistogram: verifies the
  response-time histogram label set is removed after DeletePeerMetrics()
- TestDeletePeerMetrics_LeavesOtherPeersIntact: verifies pruning one
  peer does not affect another peer's metric series
- TestDeletePeerUDPMetrics_CleansAllPerPeerMetrics: extended to also
  verify goldpinger_udp_errors_total cleanup
- All 11 tests pass (go test ./pkg/goldpinger/ -v)

Validated on a 6-node IPv6 kubeadm cluster by upgrading goldpinger with
a rolling update and confirming /metrics only contains current pod IPs
after the rollout completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
2026-04-20 14:02:40 -05:00
..