Rename the swagger field from path-length to hop-count so the
generated Go struct field (PathLength → HopCount) and JSON key
(path-length → hop-count) align with the Prometheus metric rename
to goldpinger_peers_hop_count from the previous commit.
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrent HTTP + UDP pings:
HTTP ping and UDP probe now run in separate goroutines via
sync.WaitGroup, so UDP timeout doesn't add to the ping cycle
latency. (skamboj on pinger.go:124)
Remove duplicate log:
Removed the "UDP echo listener started" log from main.go since
StartUDPListener already logs it. (skamboj on main.go:191)
Prometheus base units (seconds):
Renamed goldpinger_peers_udp_rtt_ms back to goldpinger_peers_udp_rtt_s
with sub-millisecond histogram buckets (.0001s to 1s), per Prometheus
naming conventions. RTT is computed in seconds internally and only
converted to ms for the JSON API. (skamboj on stats.go:150)
Rename path_length to hop_count:
goldpinger_peers_path_length → goldpinger_peers_hop_count, and
SetPeerPathLength → SetPeerHopCount. (skamboj on stats.go:139)
UDP buffer constant and packet size clamping:
Added udpMaxPacketSize=1500 constant, documented as standard Ethernet
MTU — the largest UDP payload that survives most networks without
fragmentation. Used for both listener and prober receive buffers.
ProbeUDP now clamps UDP_PACKET_SIZE to udpMaxPacketSize to prevent
silent truncation if someone configures a size > MTU.
(skamboj on udp_probe.go:54)
Guard count=0:
ProbeUDP returns an error immediately if count <= 0 instead of
dividing by zero. (skamboj on udp_probe.go:176)
UDP error counter:
Added goldpinger_udp_errors_total counter (labels: goldpinger_instance,
host). CountUDPError is called on dial failures and send errors.
(skamboj on udp_probe.go:115)
Test: random source port for full loss:
TestProbeUDP_FullLoss now binds an ephemeral port and closes it,
instead of assuming port 19999 is free. (skamboj on udp_probe_test.go:56)
Test: partial loss validation:
New TestProbeUDP_PartialLoss uses a lossy echo listener that drops
every Nth packet to validate loss calculations are exact:
drop every 2nd → 50.0%, every 3rd → 33.3%,
every 5th → 20.0%, every 10th → 10.0%
(skamboj on udp_probe_test.go:96)
Test: zero count:
New TestProbeUDP_ZeroCount verifies error is returned for count=0.
Test results:
```
=== RUN TestProbeUDP_NoLoss
udp_probe_test.go:88: avg UDP RTT: 0.0816 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN TestProbeUDP_PartialLoss
=== RUN TestProbeUDP_PartialLoss/drop_every_2nd_(50%)
udp_probe_test.go:134: loss: 50.0% (expected 50.0%)
=== RUN TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%)
udp_probe_test.go:134: loss: 33.3% (expected 33.3%)
=== RUN TestProbeUDP_PartialLoss/drop_every_5th_(20%)
udp_probe_test.go:134: loss: 20.0% (expected 20.0%)
=== RUN TestProbeUDP_PartialLoss/drop_every_10th_(10%)
udp_probe_test.go:134: loss: 10.0% (expected 10.0%)
--- PASS: TestProbeUDP_PartialLoss (8.00s)
=== RUN TestProbeUDP_ZeroCount
--- PASS: TestProbeUDP_ZeroCount (0.00s)
=== RUN TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an opt-in UDP echo probe that runs alongside the existing HTTP
ping. Each goldpinger pod listens on a configurable UDP port (default
6969). During each ping cycle, the prober sends N sequenced packets
to the peer's listener, which echoes them back. From the replies we
compute packet loss percentage, path hop count (from IPv4 TTL / IPv6
HopLimit), and average round-trip time.
New Prometheus metrics:
- goldpinger_peers_loss_pct (gauge) — per-peer UDP loss %
- goldpinger_peers_path_length (gauge) — estimated hop count
- goldpinger_peers_udp_rtt_ms (histogram) — UDP RTT in milliseconds
The graph UI shows yellow edges for links with partial loss, and
displays sub-millisecond UDP RTT instead of HTTP latency when UDP
is enabled. Stale metric labels are cleaned up when a pinger is
destroyed so rolled pods don't leave ghost entries.
Configuration (all via env vars, disabled by default):
UDP_ENABLED=true enable UDP probing and listener
UDP_PORT=6969 listener port
UDP_PACKET_COUNT=10 packets per probe
UDP_PACKET_SIZE=64 bytes per packet
UDP_TIMEOUT=1s probe timeout
New files:
pkg/goldpinger/udp_probe.go — echo listener + probe client
pkg/goldpinger/udp_probe_test.go — unit tests
Unit tests:
```
=== RUN TestProbeUDP_NoLoss
udp_probe_test.go:51: avg UDP RTT: 0.0823 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```
Cluster test (6-node IPv6 k8s, UDP_ENABLED=true):
```
Prometheus metrics (healthy cluster, 0% loss):
goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0
goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0
Simulated 50% loss via ip6tables DROP in pod netns on node-0:
goldpinger_peers_loss_pct{instance="server",...} 60
goldpinger_peers_loss_pct{instance="node-1",...} 30
goldpinger_peers_loss_pct{instance="server2",...} 30
UDP RTT vs HTTP RTT (check_all API):
node-0 -> server: udp=2.18ms http=2ms
node-2 -> node-2: udp=0.40ms http=1ms
server -> node-0: udp=0.55ms http=2ms
Post-rollout stale metrics cleanup verified:
All 36 edges show 0% loss, no stale pod IPs.
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>