mirror of
https://github.com/bloomberg/goldpinger.git
synced 2026-05-25 10:02:45 +00:00
After a DaemonSet rolling update, goldpinger retained response-time histogram and error counter series for old pod IPs that no longer exist. These stale single-sample series skewed P95/P99 latency calculations and made transient rollout errors appear permanent. (Fixes #167) The existing destroyPingers path only cleaned UDP-specific per-peer metrics (and only when UDP was enabled). This adds: - DeletePeerMetrics(): removes goldpinger_peers_response_time_s histogram label sets for destroyed peers, called unconditionally on pinger teardown - goldpinger_udp_errors_total cleanup in DeletePeerUDPMetrics(), which was previously missed Testing: - TestDeletePeerMetrics_CleansResponseTimeHistogram: verifies the response-time histogram label set is removed after DeletePeerMetrics() - TestDeletePeerMetrics_LeavesOtherPeersIntact: verifies pruning one peer does not affect another peer's metric series - TestDeletePeerUDPMetrics_CleansAllPerPeerMetrics: extended to also verify goldpinger_udp_errors_total cleanup - All 11 tests pass (go test ./pkg/goldpinger/ -v) Validated on a 6-node IPv6 kubeadm cluster by upgrading goldpinger with a rolling update and confirming /metrics only contains current pod IPs after the rollout completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Cooper Ry Lees <me@cooperlees.com>