20 Commits

Author SHA1 Message Date
Cooper Ry Lees
145d2bf000 Rename PathLength to HopCount in swagger model and UI
Rename the swagger field from path-length to hop-count so the
generated Go struct field (PathLength → HopCount) and JSON key
(path-length → hop-count) align with the Prometheus metric rename
to goldpinger_peers_hop_count from the previous commit.

Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:45:31 +00:00
Cooper Ry Lees
832bc7b598 Add UDP probe metrics: packet loss, hop count, and RTT
Add an opt-in UDP echo probe that runs alongside the existing HTTP
ping. Each goldpinger pod listens on a configurable UDP port (default
6969). During each ping cycle, the prober sends N sequenced packets
to the peer's listener, which echoes them back. From the replies we
compute packet loss percentage, path hop count (from IPv4 TTL / IPv6
HopLimit), and average round-trip time.

New Prometheus metrics:
  - goldpinger_peers_loss_pct      (gauge)     — per-peer UDP loss %
  - goldpinger_peers_path_length   (gauge)     — estimated hop count
  - goldpinger_peers_udp_rtt_ms    (histogram) — UDP RTT in milliseconds

The graph UI shows yellow edges for links with partial loss, and
displays sub-millisecond UDP RTT instead of HTTP latency when UDP
is enabled. Stale metric labels are cleaned up when a pinger is
destroyed so rolled pods don't leave ghost entries.

Configuration (all via env vars, disabled by default):
  UDP_ENABLED=true      enable UDP probing and listener
  UDP_PORT=6969         listener port
  UDP_PACKET_COUNT=10   packets per probe
  UDP_PACKET_SIZE=64    bytes per packet
  UDP_TIMEOUT=1s        probe timeout

New files:
  pkg/goldpinger/udp_probe.go       — echo listener + probe client
  pkg/goldpinger/udp_probe_test.go  — unit tests

Unit tests:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:51: avg UDP RTT: 0.0823 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Cluster test (6-node IPv6 k8s, UDP_ENABLED=true):
```
Prometheus metrics (healthy cluster, 0% loss):
  goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0
  goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0

Simulated 50% loss via ip6tables DROP in pod netns on node-0:
  goldpinger_peers_loss_pct{instance="server",...} 60
  goldpinger_peers_loss_pct{instance="node-1",...} 30
  goldpinger_peers_loss_pct{instance="server2",...} 30

UDP RTT vs HTTP RTT (check_all API):
  node-0 -> server:  udp=2.18ms  http=2ms
  node-2 -> node-2:  udp=0.40ms  http=1ms
  server -> node-0:  udp=0.55ms  http=2ms

Post-rollout stale metrics cleanup verified:
  All 36 edges show 0% loss, no stale pod IPs.
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
2026-03-27 16:05:32 +00:00
tgetachew
14ea96999a add external probes
Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>

make timeout flags backwards compatible

Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>
2022-05-08 22:02:09 -04:00
Mikolaj Pawlikowski
52ff43ec7d Make it return 418 on cluster health problem
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 17:06:57 +00:00
Mikolaj Pawlikowski
5d2ad6ce19 Also add a total number of nodes in a field for convenience
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:31:01 +00:00
Mikolaj Pawlikowski
b0730e88df Actually, just keep it simple
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:05:53 +00:00
Mikolaj Pawlikowski
c7a7008bf5 Separately for the pods and hosts
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 12:56:27 +00:00
Mikolaj Pawlikowski
eb3113aa7f Add the schema for the new endpoint
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 11:21:02 +00:00
Sachin Kamboj
d68d35bbab Add a ping time that gives the last time a node was pinged
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 21:04:43 -04:00
Sachin Kamboj
00cd1e3886 Auto-generated code with the changes to the swagger
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-03 23:09:05 -04:00
Sachin Kamboj
4152784d21 Add a PodIP to the results now that we are using PodName as a key
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-03 23:00:48 -04:00
Chris Green
1b12b7dc6b Version bump
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-09 08:44:49 -04:00
Chris Green
b64f5152f2 Moved DnsResults into CheckResults
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-05 14:28:40 -04:00
Chris Green
bbac97c17e Refactored swagger.yml for fewer dns requests
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-04 02:54:23 -04:00
Chris Green
6541250aa9 Updated swagger.yml for map of dnsresults
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-02 11:54:20 -04:00
Chris Green
aa66c94d47 Initial thoughts on swagger change
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-01 11:41:01 -04:00
Mikolaj Pawlikowski
2bcff8b2d1 make the response time field explicit in its usage of milliseconds
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-02-20 16:41:15 +00:00
tfinethy
871b86471d Include response time in PodResult
Signed-off-by: tfinethy <tfinethy@cogolabs.com>

Remove all swagger updates

Change response-time to match status-code formatting

Switch to float64 and use milliseconds as the unit
2019-02-17 12:38:53 -05:00
Ivan Kalita
ebbe41f4ef Add simple healthz endpoint
According to #8.

Signed-off-by: Ivan Kalita <kaduev13@gmail.com>
2018-12-20 18:12:18 +01:00
Kevin P. Fleming
fa643e9be8 Initial commit 2018-12-04 13:33:45 -05:00