Commit Graph

31 Commits

Author SHA1 Message Date
Cooper Ry Lees
145d2bf000 Rename PathLength to HopCount in swagger model and UI
Rename the swagger field from path-length to hop-count so the
generated Go struct field (PathLength → HopCount) and JSON key
(path-length → hop-count) align with the Prometheus metric rename
to goldpinger_peers_hop_count from the previous commit.

Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:45:31 +00:00
Cooper Ry Lees
832bc7b598 Add UDP probe metrics: packet loss, hop count, and RTT
Add an opt-in UDP echo probe that runs alongside the existing HTTP
ping. Each goldpinger pod listens on a configurable UDP port (default
6969). During each ping cycle, the prober sends N sequenced packets
to the peer's listener, which echoes them back. From the replies we
compute packet loss percentage, path hop count (from IPv4 TTL / IPv6
HopLimit), and average round-trip time.

New Prometheus metrics:
  - goldpinger_peers_loss_pct      (gauge)     — per-peer UDP loss %
  - goldpinger_peers_path_length   (gauge)     — estimated hop count
  - goldpinger_peers_udp_rtt_ms    (histogram) — UDP RTT in milliseconds

The graph UI shows yellow edges for links with partial loss, and
displays sub-millisecond UDP RTT instead of HTTP latency when UDP
is enabled. Stale metric labels are cleaned up when a pinger is
destroyed so rolled pods don't leave ghost entries.

Configuration (all via env vars, disabled by default):
  UDP_ENABLED=true      enable UDP probing and listener
  UDP_PORT=6969         listener port
  UDP_PACKET_COUNT=10   packets per probe
  UDP_PACKET_SIZE=64    bytes per packet
  UDP_TIMEOUT=1s        probe timeout

New files:
  pkg/goldpinger/udp_probe.go       — echo listener + probe client
  pkg/goldpinger/udp_probe_test.go  — unit tests

Unit tests:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:51: avg UDP RTT: 0.0823 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Cluster test (6-node IPv6 k8s, UDP_ENABLED=true):
```
Prometheus metrics (healthy cluster, 0% loss):
  goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0
  goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0

Simulated 50% loss via ip6tables DROP in pod netns on node-0:
  goldpinger_peers_loss_pct{instance="server",...} 60
  goldpinger_peers_loss_pct{instance="node-1",...} 30
  goldpinger_peers_loss_pct{instance="server2",...} 30

UDP RTT vs HTTP RTT (check_all API):
  node-0 -> server:  udp=2.18ms  http=2ms
  node-2 -> node-2:  udp=0.40ms  http=1ms
  server -> node-0:  udp=0.55ms  http=2ms

Post-rollout stale metrics cleanup verified:
  All 36 edges show 0% loss, no stale pod IPs.
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
2026-03-27 16:05:32 +00:00
tgetachew
14ea96999a add external probes
Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>

make timeout flags backwards compatible

Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>
2022-05-08 22:02:09 -04:00
Mikolaj Pawlikowski
52ff43ec7d Make it return 418 on cluster health problem
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 17:06:57 +00:00
Mikolaj Pawlikowski
5d2ad6ce19 Also add a total number of nodes in a field for convenience
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:31:01 +00:00
Mikolaj Pawlikowski
807f193b07 Regenerate
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:06:26 +00:00
Mikolaj Pawlikowski
e827a8dc67 Regenearte the code
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 12:57:20 +00:00
Mikolaj Pawlikowski
3edecea467 Implement a stub of the new endpoint
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 12:24:42 +00:00
Mikolaj Pawlikowski
3ff592b1e8 Re-generate using the latest swagger gen cli
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 11:30:53 +00:00
Sachin Kamboj
d68d35bbab Add a ping time that gives the last time a node was pinged
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 21:04:43 -04:00
Sachin Kamboj
2e1c799a25 Replace log statements with zap
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-06 22:18:04 -04:00
Sachin Kamboj
aa7eaca30e Get the context from the request and add overall timeouts
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-06 19:38:32 -04:00
Sachin Kamboj
1338f28163 Increment the major version since this is a breaking change
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-03 23:31:05 -04:00
Sachin Kamboj
86febf8295 Update the way of selecting pods
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-03 23:17:26 -04:00
Sachin Kamboj
00cd1e3886 Auto-generated code with the changes to the swagger
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-03 23:09:05 -04:00
Chris Green
1b12b7dc6b Version bump
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-09 08:44:49 -04:00
Chris Green
b64f5152f2 Moved DnsResults into CheckResults
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-05 14:28:40 -04:00
Chris Green
bbac97c17e Refactored swagger.yml for fewer dns requests
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-04 02:54:23 -04:00
Chris Green
9586e16237 Updated models from swagger
Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>
2019-06-02 13:18:05 -04:00
stuart nelson
895af850a1 Make PodSelecter a member on config struct
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2019-03-13 15:30:18 +01:00
stuart nelson
771f303062 Add rendezvous hash for selecting subset of nodes
Select a user-defined number of pods via
rendezvous hash. This is important for larger
clusters, where the metric cardinality explosion
is too much for a single prometheus to handle.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2019-03-13 15:30:18 +01:00
Mikolaj Pawlikowski
f5c2763000 add an endpoint for generating a /heatmap.png
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-02-21 13:51:20 +00:00
Mikolaj Pawlikowski
513d8ee489 newer go-swagger has some fancier templates
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-02-20 16:44:25 +00:00
Mikolaj Pawlikowski
15b4598606 make swagger to update the response-time-ms field
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-02-20 16:43:42 +00:00
tfinethy
871b86471d Include response time in PodResult
Signed-off-by: tfinethy <tfinethy@cogolabs.com>

Remove all swagger updates

Change response-time to match status-code formatting

Switch to float64 and use milliseconds as the unit
2019-02-17 12:38:53 -05:00
Ivan Kalita
8bd1672e07 Remove license header from healthz-related autogenerated files
According to #32.

Signed-off-by: Ivan Kalita <kaduev13@gmail.com>
2018-12-21 14:45:54 +01:00
Mikolaj Pawlikowski
1dc28da389 Merge branch 'master' into 8-healthz-endpoint 2018-12-21 14:42:48 +01:00
Ivan Kalita
1e324f78b9 Remove license header from autogenerated files
According to #32.

Signed-off-by: Ivan Kalita <kaduev13@gmail.com>
2018-12-21 14:28:44 +01:00
Ivan Kalita
d40bed4d3e Fix of the healthz endpoint handler
According to #8.

Signed-off-by: Ivan Kalita <kaduev13@gmail.com>
2018-12-20 18:17:09 +01:00
Ivan Kalita
ebbe41f4ef Add simple healthz endpoint
According to #8.

Signed-off-by: Ivan Kalita <kaduev13@gmail.com>
2018-12-20 18:12:18 +01:00
Kevin P. Fleming
fa643e9be8 Initial commit 2018-12-04 13:33:45 -05:00