Track received sequence numbers in a map during each UDP probe to
detect duplicate replies (same seq seen twice) and out-of-order
delivery (seq lower than the highest previously seen).
The receive loop now uses a recvState struct with a processPacket
method that validates magic, checks for duplicate seq numbers, and
tracks ordering. Duplicates are not counted toward the received
total, and the loop allows up to 2*count iterations to handle them
without prematurely timing out.
New Prometheus counters:
- goldpinger_udp_duplicates_total — duplicate reply packets
- goldpinger_udp_out_of_order_total — out-of-order reply packets
Both are cumulative counters with labels (goldpinger_instance,
host_ip, pod_ip), incremented per-probe by the number of events
detected. Non-zero values indicate network-level packet duplication
or path asymmetry worth investigating.
New tests:
- TestProbeUDP_Duplicates: echo listener sends every packet twice,
verifies duplicates are detected and don't inflate received count
- TestProbeUDP_OutOfOrder: echo listener buffers pairs and returns
them in reverse order, verifies out-of-order is detected
Test results:
```
=== RUN TestProbeUDP_NoLoss
udp_probe_test.go:158: avg UDP RTT: 0.1011 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN TestProbeUDP_PartialLoss
=== RUN TestProbeUDP_PartialLoss/drop_every_2nd_(50%)
udp_probe_test.go:204: loss: 50.0% (expected 50.0%)
=== RUN TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%)
udp_probe_test.go:204: loss: 33.3% (expected 33.3%)
=== RUN TestProbeUDP_PartialLoss/drop_every_5th_(20%)
udp_probe_test.go:204: loss: 20.0% (expected 20.0%)
=== RUN TestProbeUDP_PartialLoss/drop_every_10th_(10%)
udp_probe_test.go:204: loss: 10.0% (expected 10.0%)
--- PASS: TestProbeUDP_PartialLoss (8.01s)
=== RUN TestProbeUDP_ZeroCount
--- PASS: TestProbeUDP_ZeroCount (0.00s)
=== RUN TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN TestProbeUDP_Duplicates
udp_probe_test.go:246: duplicates detected: 4
--- PASS: TestProbeUDP_Duplicates (0.00s)
=== RUN TestProbeUDP_OutOfOrder
udp_probe_test.go:263: out-of-order detected: 5, duplicates: 0
--- PASS: TestProbeUDP_OutOfOrder (0.00s)
=== RUN TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrent HTTP + UDP pings:
HTTP ping and UDP probe now run in separate goroutines via
sync.WaitGroup, so UDP timeout doesn't add to the ping cycle
latency. (skamboj on pinger.go:124)
Remove duplicate log:
Removed the "UDP echo listener started" log from main.go since
StartUDPListener already logs it. (skamboj on main.go:191)
Prometheus base units (seconds):
Renamed goldpinger_peers_udp_rtt_ms back to goldpinger_peers_udp_rtt_s
with sub-millisecond histogram buckets (.0001s to 1s), per Prometheus
naming conventions. RTT is computed in seconds internally and only
converted to ms for the JSON API. (skamboj on stats.go:150)
Rename path_length to hop_count:
goldpinger_peers_path_length → goldpinger_peers_hop_count, and
SetPeerPathLength → SetPeerHopCount. (skamboj on stats.go:139)
UDP buffer constant and packet size clamping:
Added udpMaxPacketSize=1500 constant, documented as standard Ethernet
MTU — the largest UDP payload that survives most networks without
fragmentation. Used for both listener and prober receive buffers.
ProbeUDP now clamps UDP_PACKET_SIZE to udpMaxPacketSize to prevent
silent truncation if someone configures a size > MTU.
(skamboj on udp_probe.go:54)
Guard count=0:
ProbeUDP returns an error immediately if count <= 0 instead of
dividing by zero. (skamboj on udp_probe.go:176)
UDP error counter:
Added goldpinger_udp_errors_total counter (labels: goldpinger_instance,
host). CountUDPError is called on dial failures and send errors.
(skamboj on udp_probe.go:115)
Test: random source port for full loss:
TestProbeUDP_FullLoss now binds an ephemeral port and closes it,
instead of assuming port 19999 is free. (skamboj on udp_probe_test.go:56)
Test: partial loss validation:
New TestProbeUDP_PartialLoss uses a lossy echo listener that drops
every Nth packet to validate loss calculations are exact:
drop every 2nd → 50.0%, every 3rd → 33.3%,
every 5th → 20.0%, every 10th → 10.0%
(skamboj on udp_probe_test.go:96)
Test: zero count:
New TestProbeUDP_ZeroCount verifies error is returned for count=0.
Test results:
```
=== RUN TestProbeUDP_NoLoss
udp_probe_test.go:88: avg UDP RTT: 0.0816 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN TestProbeUDP_PartialLoss
=== RUN TestProbeUDP_PartialLoss/drop_every_2nd_(50%)
udp_probe_test.go:134: loss: 50.0% (expected 50.0%)
=== RUN TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%)
udp_probe_test.go:134: loss: 33.3% (expected 33.3%)
=== RUN TestProbeUDP_PartialLoss/drop_every_5th_(20%)
udp_probe_test.go:134: loss: 20.0% (expected 20.0%)
=== RUN TestProbeUDP_PartialLoss/drop_every_10th_(10%)
udp_probe_test.go:134: loss: 10.0% (expected 10.0%)
--- PASS: TestProbeUDP_PartialLoss (8.00s)
=== RUN TestProbeUDP_ZeroCount
--- PASS: TestProbeUDP_ZeroCount (0.00s)
=== RUN TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an opt-in UDP echo probe that runs alongside the existing HTTP
ping. Each goldpinger pod listens on a configurable UDP port (default
6969). During each ping cycle, the prober sends N sequenced packets
to the peer's listener, which echoes them back. From the replies we
compute packet loss percentage, path hop count (from IPv4 TTL / IPv6
HopLimit), and average round-trip time.
New Prometheus metrics:
- goldpinger_peers_loss_pct (gauge) — per-peer UDP loss %
- goldpinger_peers_path_length (gauge) — estimated hop count
- goldpinger_peers_udp_rtt_ms (histogram) — UDP RTT in milliseconds
The graph UI shows yellow edges for links with partial loss, and
displays sub-millisecond UDP RTT instead of HTTP latency when UDP
is enabled. Stale metric labels are cleaned up when a pinger is
destroyed so rolled pods don't leave ghost entries.
Configuration (all via env vars, disabled by default):
UDP_ENABLED=true enable UDP probing and listener
UDP_PORT=6969 listener port
UDP_PACKET_COUNT=10 packets per probe
UDP_PACKET_SIZE=64 bytes per packet
UDP_TIMEOUT=1s probe timeout
New files:
pkg/goldpinger/udp_probe.go — echo listener + probe client
pkg/goldpinger/udp_probe_test.go — unit tests
Unit tests:
```
=== RUN TestProbeUDP_NoLoss
udp_probe_test.go:51: avg UDP RTT: 0.0823 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```
Cluster test (6-node IPv6 k8s, UDP_ENABLED=true):
```
Prometheus metrics (healthy cluster, 0% loss):
goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0
goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0
Simulated 50% loss via ip6tables DROP in pod netns on node-0:
goldpinger_peers_loss_pct{instance="server",...} 60
goldpinger_peers_loss_pct{instance="node-1",...} 30
goldpinger_peers_loss_pct{instance="server2",...} 30
UDP RTT vs HTTP RTT (check_all API):
node-0 -> server: udp=2.18ms http=2ms
node-2 -> node-2: udp=0.40ms http=1ms
server -> node-0: udp=0.55ms http=2ms
Post-rollout stale metrics cleanup verified:
All 36 edges show 0% loss, no stale pod IPs.
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Kubernetes master node by most of default installation are taint, so goldPinger DaemonSet will not deployed to master, in order to make it run on master nodes also, you have to tolerate the taint
Signed-off-by: suleimanWA <suleiman-94@hotmail.com>
If you have Prometheus in your environment, adding these annotation will let Prometheus auto-discovery fetch your metrics automatically from service-name:port/metrics
Signed-off-by: suleimanWA <suleiman-94@hotmail.com>
Select a user-defined number of pods via
rendezvous hash. This is important for larger
clusters, where the metric cardinality explosion
is too much for a single prometheus to handle.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>