goldpinger

mirror of https://github.com/bloomberg/goldpinger.git synced 2026-05-24 17:42:49 +00:00

Author	SHA1	Message	Date
Cooper Ry Lees	145d2bf000	Rename PathLength to HopCount in swagger model and UI Rename the swagger field from path-length to hop-count so the generated Go struct field (PathLength → HopCount) and JSON key (path-length → hop-count) align with the Prometheus metric rename to goldpinger_peers_hop_count from the previous commit. Signed-off-by: Cooper Ry Lees <me@cooperlees.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:45:31 +00:00
Cooper Ry Lees	832bc7b598	Add UDP probe metrics: packet loss, hop count, and RTT Add an opt-in UDP echo probe that runs alongside the existing HTTP ping. Each goldpinger pod listens on a configurable UDP port (default 6969). During each ping cycle, the prober sends N sequenced packets to the peer's listener, which echoes them back. From the replies we compute packet loss percentage, path hop count (from IPv4 TTL / IPv6 HopLimit), and average round-trip time. New Prometheus metrics: - goldpinger_peers_loss_pct (gauge) — per-peer UDP loss % - goldpinger_peers_path_length (gauge) — estimated hop count - goldpinger_peers_udp_rtt_ms (histogram) — UDP RTT in milliseconds The graph UI shows yellow edges for links with partial loss, and displays sub-millisecond UDP RTT instead of HTTP latency when UDP is enabled. Stale metric labels are cleaned up when a pinger is destroyed so rolled pods don't leave ghost entries. Configuration (all via env vars, disabled by default): UDP_ENABLED=true enable UDP probing and listener UDP_PORT=6969 listener port UDP_PACKET_COUNT=10 packets per probe UDP_PACKET_SIZE=64 bytes per packet UDP_TIMEOUT=1s probe timeout New files: pkg/goldpinger/udp_probe.go — echo listener + probe client pkg/goldpinger/udp_probe_test.go — unit tests Unit tests: ``` === RUN TestProbeUDP_NoLoss udp_probe_test.go:51: avg UDP RTT: 0.0823 ms --- PASS: TestProbeUDP_NoLoss (0.00s) === RUN TestProbeUDP_FullLoss --- PASS: TestProbeUDP_FullLoss (0.00s) === RUN TestProbeUDP_PacketFormat --- PASS: TestProbeUDP_PacketFormat (0.00s) === RUN TestEstimateHops --- PASS: TestEstimateHops (0.00s) PASS ``` Cluster test (6-node IPv6 k8s, UDP_ENABLED=true): ``` Prometheus metrics (healthy cluster, 0% loss): goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0 goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0 Simulated 50% loss via ip6tables DROP in pod netns on node-0: goldpinger_peers_loss_pct{instance="server",...} 60 goldpinger_peers_loss_pct{instance="node-1",...} 30 goldpinger_peers_loss_pct{instance="server2",...} 30 UDP RTT vs HTTP RTT (check_all API): node-0 -> server: udp=2.18ms http=2ms node-2 -> node-2: udp=0.40ms http=1ms server -> node-0: udp=0.55ms http=2ms Post-rollout stale metrics cleanup verified: All 36 edges show 0% loss, no stale pod IPs. ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Cooper Ry Lees <me@cooperlees.com>	2026-03-27 16:05:32 +00:00
tgetachew	14ea96999a	add external probes Signed-off-by: kitfoman <thaddeusgetachew@gmail.com> make timeout flags backwards compatible Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>	2022-05-08 22:02:09 -04:00
Mikolaj Pawlikowski	52ff43ec7d	Make it return 418 on cluster health problem Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 17:06:57 +00:00
Mikolaj Pawlikowski	5d2ad6ce19	Also add a total number of nodes in a field for convenience Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:31:01 +00:00
Mikolaj Pawlikowski	b0730e88df	Actually, just keep it simple Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:05:53 +00:00
Mikolaj Pawlikowski	c7a7008bf5	Separately for the pods and hosts Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 12:56:27 +00:00
Mikolaj Pawlikowski	eb3113aa7f	Add the schema for the new endpoint Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 11:21:02 +00:00
Sachin Kamboj	d68d35bbab	Add a ping time that gives the last time a node was pinged Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 21:04:43 -04:00
Sachin Kamboj	00cd1e3886	Auto-generated code with the changes to the swagger Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-03 23:09:05 -04:00
Sachin Kamboj	4152784d21	Add a PodIP to the results now that we are using PodName as a key Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-03 23:00:48 -04:00
Chris Green	1b12b7dc6b	Version bump Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>	2019-06-09 08:44:49 -04:00
Chris Green	b64f5152f2	Moved DnsResults into CheckResults Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>	2019-06-05 14:28:40 -04:00
Chris Green	bbac97c17e	Refactored swagger.yml for fewer dns requests Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>	2019-06-04 02:54:23 -04:00
Chris Green	6541250aa9	Updated swagger.yml for map of dnsresults Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>	2019-06-02 11:54:20 -04:00
Chris Green	aa66c94d47	Initial thoughts on swagger change Signed-off-by: Chris Green <34572557+cgreen12@users.noreply.github.com>	2019-06-01 11:41:01 -04:00
Mikolaj Pawlikowski	2bcff8b2d1	make the response time field explicit in its usage of milliseconds Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2019-02-20 16:41:15 +00:00
tfinethy	871b86471d	Include response time in PodResult Signed-off-by: tfinethy <tfinethy@cogolabs.com> Remove all swagger updates Change response-time to match status-code formatting Switch to float64 and use milliseconds as the unit	2019-02-17 12:38:53 -05:00
Ivan Kalita	ebbe41f4ef	Add simple healthz endpoint According to #8. Signed-off-by: Ivan Kalita <kaduev13@gmail.com>	2018-12-20 18:12:18 +01:00
Kevin P. Fleming	fa643e9be8	Initial commit	2018-12-04 13:33:45 -05:00

20 Commits