goldpinger

mirror of https://github.com/bloomberg/goldpinger.git synced 2026-04-12 13:26:51 +00:00

Author	SHA1	Message	Date
Cooper Ry Lees	145d2bf000	Rename PathLength to HopCount in swagger model and UI Rename the swagger field from path-length to hop-count so the generated Go struct field (PathLength → HopCount) and JSON key (path-length → hop-count) align with the Prometheus metric rename to goldpinger_peers_hop_count from the previous commit. Signed-off-by: Cooper Ry Lees <me@cooperlees.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:45:31 +00:00
Cooper Ry Lees	641b658f23	Address PR #164 review feedback Concurrent HTTP + UDP pings: HTTP ping and UDP probe now run in separate goroutines via sync.WaitGroup, so UDP timeout doesn't add to the ping cycle latency. (skamboj on pinger.go:124) Remove duplicate log: Removed the "UDP echo listener started" log from main.go since StartUDPListener already logs it. (skamboj on main.go:191) Prometheus base units (seconds): Renamed goldpinger_peers_udp_rtt_ms back to goldpinger_peers_udp_rtt_s with sub-millisecond histogram buckets (.0001s to 1s), per Prometheus naming conventions. RTT is computed in seconds internally and only converted to ms for the JSON API. (skamboj on stats.go:150) Rename path_length to hop_count: goldpinger_peers_path_length → goldpinger_peers_hop_count, and SetPeerPathLength → SetPeerHopCount. (skamboj on stats.go:139) UDP buffer constant and packet size clamping: Added udpMaxPacketSize=1500 constant, documented as standard Ethernet MTU — the largest UDP payload that survives most networks without fragmentation. Used for both listener and prober receive buffers. ProbeUDP now clamps UDP_PACKET_SIZE to udpMaxPacketSize to prevent silent truncation if someone configures a size > MTU. (skamboj on udp_probe.go:54) Guard count=0: ProbeUDP returns an error immediately if count <= 0 instead of dividing by zero. (skamboj on udp_probe.go:176) UDP error counter: Added goldpinger_udp_errors_total counter (labels: goldpinger_instance, host). CountUDPError is called on dial failures and send errors. (skamboj on udp_probe.go:115) Test: random source port for full loss: TestProbeUDP_FullLoss now binds an ephemeral port and closes it, instead of assuming port 19999 is free. (skamboj on udp_probe_test.go:56) Test: partial loss validation: New TestProbeUDP_PartialLoss uses a lossy echo listener that drops every Nth packet to validate loss calculations are exact: drop every 2nd → 50.0%, every 3rd → 33.3%, every 5th → 20.0%, every 10th → 10.0% (skamboj on udp_probe_test.go:96) Test: zero count: New TestProbeUDP_ZeroCount verifies error is returned for count=0. Test results: ``` === RUN TestProbeUDP_NoLoss udp_probe_test.go:88: avg UDP RTT: 0.0816 ms --- PASS: TestProbeUDP_NoLoss (0.00s) === RUN TestProbeUDP_FullLoss --- PASS: TestProbeUDP_FullLoss (0.00s) === RUN TestProbeUDP_PartialLoss === RUN TestProbeUDP_PartialLoss/drop_every_2nd_(50%) udp_probe_test.go:134: loss: 50.0% (expected 50.0%) === RUN TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%) udp_probe_test.go:134: loss: 33.3% (expected 33.3%) === RUN TestProbeUDP_PartialLoss/drop_every_5th_(20%) udp_probe_test.go:134: loss: 20.0% (expected 20.0%) === RUN TestProbeUDP_PartialLoss/drop_every_10th_(10%) udp_probe_test.go:134: loss: 10.0% (expected 10.0%) --- PASS: TestProbeUDP_PartialLoss (8.00s) === RUN TestProbeUDP_ZeroCount --- PASS: TestProbeUDP_ZeroCount (0.00s) === RUN TestProbeUDP_PacketFormat --- PASS: TestProbeUDP_PacketFormat (0.00s) === RUN TestEstimateHops --- PASS: TestEstimateHops (0.00s) PASS ``` Signed-off-by: Cooper Ry Lees <me@cooperlees.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:37:52 +00:00
Cooper Ry Lees	832bc7b598	Add UDP probe metrics: packet loss, hop count, and RTT Add an opt-in UDP echo probe that runs alongside the existing HTTP ping. Each goldpinger pod listens on a configurable UDP port (default 6969). During each ping cycle, the prober sends N sequenced packets to the peer's listener, which echoes them back. From the replies we compute packet loss percentage, path hop count (from IPv4 TTL / IPv6 HopLimit), and average round-trip time. New Prometheus metrics: - goldpinger_peers_loss_pct (gauge) — per-peer UDP loss % - goldpinger_peers_path_length (gauge) — estimated hop count - goldpinger_peers_udp_rtt_ms (histogram) — UDP RTT in milliseconds The graph UI shows yellow edges for links with partial loss, and displays sub-millisecond UDP RTT instead of HTTP latency when UDP is enabled. Stale metric labels are cleaned up when a pinger is destroyed so rolled pods don't leave ghost entries. Configuration (all via env vars, disabled by default): UDP_ENABLED=true enable UDP probing and listener UDP_PORT=6969 listener port UDP_PACKET_COUNT=10 packets per probe UDP_PACKET_SIZE=64 bytes per packet UDP_TIMEOUT=1s probe timeout New files: pkg/goldpinger/udp_probe.go — echo listener + probe client pkg/goldpinger/udp_probe_test.go — unit tests Unit tests: ``` === RUN TestProbeUDP_NoLoss udp_probe_test.go:51: avg UDP RTT: 0.0823 ms --- PASS: TestProbeUDP_NoLoss (0.00s) === RUN TestProbeUDP_FullLoss --- PASS: TestProbeUDP_FullLoss (0.00s) === RUN TestProbeUDP_PacketFormat --- PASS: TestProbeUDP_PacketFormat (0.00s) === RUN TestEstimateHops --- PASS: TestEstimateHops (0.00s) PASS ``` Cluster test (6-node IPv6 k8s, UDP_ENABLED=true): ``` Prometheus metrics (healthy cluster, 0% loss): goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0 goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0 Simulated 50% loss via ip6tables DROP in pod netns on node-0: goldpinger_peers_loss_pct{instance="server",...} 60 goldpinger_peers_loss_pct{instance="node-1",...} 30 goldpinger_peers_loss_pct{instance="server2",...} 30 UDP RTT vs HTTP RTT (check_all API): node-0 -> server: udp=2.18ms http=2ms node-2 -> node-2: udp=0.40ms http=1ms server -> node-0: udp=0.55ms http=2ms Post-rollout stale metrics cleanup verified: All 36 edges show 0% loss, no stale pod IPs. ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Cooper Ry Lees <me@cooperlees.com>	2026-03-27 16:05:32 +00:00
j4ckstraw	acac9dee8b	use protobuf and add resourceVersion in listOption 1. communicate to kube-apiserver with protobuf 2. listOption add resourceVersion=0. without resourceversion, list will force kube-apiserver retrieve data from etcd. In a 100+ nodes, 7500+ pods kubernetes cluster, this patch make kube-apiserver cpu utils reduce 5-10%. Signed-off-by: j4ckstraw <j4ckstraw@foxmail.com>	2023-06-17 18:42:37 +08:00
Maxime Leroy	a913318ae3	feat: support advanced zap configuration Signed-off-by: Maxime Leroy <19607336+maxime1907@users.noreply.github.com>	2022-10-25 12:30:35 +02:00
tgetachew	14ea96999a	add external probes Signed-off-by: kitfoman <thaddeusgetachew@gmail.com> make timeout flags backwards compatible Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>	2022-05-08 22:02:09 -04:00
Tyler Lloyd	05ab610f10	cleanup IPv6 references only check node IPs when determining hostIP IP_VERSIONS not IP_FAMILIES Signed-off-by: Tyler Lloyd <tyler.lloyd@microsoft.com>	2022-05-08 22:02:03 -04:00
wanglijie6	72832bcbc4	Fix pinger be removed by not found. heatmap will be broken in every refeshPeriod, I found pinger is be deleted because of exists check faild. updatePingers will check if a pod still exist or a new one, and update pingers in every refreshPeriod. the function exists failed to check pod exist, so fix it. Signed-off-by: wanglijie6 <wanglijie6@xiaomi.com>	2022-04-15 19:14:53 +08:00
wanglijie6	7609a3ab3f	Add --display-nodename option to control UI display Add GoldpingerConfig.DisplayNodeName to control UI display, default is `false`, which means to display podName Signed-off-by: wanglijie6 <wanglijie6@xiaomi.com>	2022-04-13 10:54:59 +08:00
wanglijie6	588c1a0173	Show hostName other than podName Signed-off-by: wanglijie6 <wanglijie6@xiaomi.com>	2022-04-12 20:06:48 +08:00
Mikolaj Pawlikowski	842dfadeea	Merge branch 'master' into list-running-pods	2021-11-08 15:11:18 +00:00
Tyler Lloyd	34b78537c9	changed return type to IPFamily Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>	2021-11-08 09:38:38 -05:00
Tyler Lloyd	cfd26c8d26	changing to IP_VERSIONS Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>	2021-11-05 10:40:37 -04:00
Evan Baker	980a85b04d	only list running pods Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>	2021-11-03 18:06:29 -05:00
Tyler Lloyd	b88d0f3ec5	don't use Sprintf for creating host address Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>	2021-11-03 16:58:38 -04:00
Tyler Lloyd	03dd6706b8	add node IP cache and change to get node Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>	2021-11-03 16:58:38 -04:00
Tyler Lloyd	5d2070fad1	get pod and host IPv6 IPs when USE_IPV6 is set Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>	2021-11-03 16:58:38 -04:00
Mikolaj Pawlikowski	ed40304dd8	1 when healthy, 0 when unhealthy (bool, not return code) Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-19 14:45:58 +00:00
Mikolaj Pawlikowski	e6aa196232	Compare the actual host ips Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-19 14:26:07 +00:00
Mikolaj Pawlikowski	948b67a09b	Break here, continue there Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-19 13:46:37 +00:00
Mikolaj Pawlikowski	d0e2e25ad2	Make it a bit more obvious Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-19 13:43:40 +00:00
Mikolaj Pawlikowski	8d5262d316	Make the naming a little less bad, remove the break statements Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-19 13:25:39 +00:00
Mikolaj Pawlikowski	407d201591	Add an overall metric goldpinger_cluster_health_total (pings + DNS check) Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-19 12:31:49 +00:00
Mikolaj Pawlikowski	1f5589db8c	Simplify Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-16 17:24:56 +00:00
Mikolaj Pawlikowski	bc94f4e058	Forgot the set the default to true, if nothing bad happens Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-16 17:21:46 +00:00
Mikolaj Pawlikowski	634e04ec44	Handle the situation when one of the nodes returns an error Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 17:51:12 +00:00
Mikolaj Pawlikowski	13ae09d93e	Compare all nodes return the expected nodes that Kubernetes returns Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 17:16:58 +00:00
Mikolaj Pawlikowski	52ff43ec7d	Make it return 418 on cluster health problem Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 17:06:57 +00:00
Mikolaj Pawlikowski	bfc4603e45	Always return the DurationNs Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:38:32 +00:00
Mikolaj Pawlikowski	ad828cf5a3	And implement it Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:31:39 +00:00
Mikolaj Pawlikowski	5d2ad6ce19	Also add a total number of nodes in a field for convenience Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:31:01 +00:00
Mikolaj Pawlikowski	1310f9b12b	Add the generated at field Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:27:19 +00:00
Mikolaj Pawlikowski	f93526c58f	Lint Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:27:07 +00:00
Mikolaj Pawlikowski	1f2f00ba35	First draft of implementing it Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:24:58 +00:00
Mikolaj Pawlikowski	807f193b07	Regenerate Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 13:06:26 +00:00
Mikolaj Pawlikowski	e827a8dc67	Regenearte the code Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 12:57:20 +00:00
Mikolaj Pawlikowski	3edecea467	Implement a stub of the new endpoint Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 12:24:42 +00:00
Mikolaj Pawlikowski	3ff592b1e8	Re-generate using the latest swagger gen cli Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2021-03-12 11:30:53 +00:00
Seth Pellegrino	07ef524aed	feat: configurable namespace for pod discovery Adds a configuration option to allow for cross-namespace pings. Signed-off-by: Seth Pellegrino <seth@verica.io>	2020-11-23 14:01:25 -08:00
Sachin Kamboj	7e60ee675a	Simplify updateCounters, don't try to maintain a running count Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-08 07:56:04 -04:00
Sachin Kamboj	d68d35bbab	Add a ping time that gives the last time a node was pinged Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 21:04:43 -04:00
Sachin Kamboj	2a78a9cec5	Don't ping all pods on call, return existing data Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 20:49:01 -04:00
Sachin Kamboj	9db241d67d	Update the set of pingers at regular intervals from the k8s API server Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 19:54:19 -04:00
Sachin Kamboj	40f57b1a4e	Get rid of the lock and keep a running count of healthy/unhealthy nodes Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 08:45:14 -04:00
Sachin Kamboj	8a40aee927	Have the updater continuously ping pods and collate the results Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 08:02:36 -04:00
Sachin Kamboj	0690ac21a2	Command line options for adding a jitter-factor Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 07:55:48 -04:00
Sachin Kamboj	d0dfd3e493	Add code to continuously ping the pods and send the results over a channel Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-07 07:54:10 -04:00
Sachin Kamboj	9ae3e78035	Better structured logging Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-06 22:39:57 -04:00
Sachin Kamboj	2e1c799a25	Replace log statements with zap Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-06 22:18:04 -04:00
Sachin Kamboj	1c6362b2a9	Add a context to the ping results Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-06 19:43:22 -04:00

1 2 3

107 Commits