goldpinger

mirror of https://github.com/bloomberg/goldpinger.git synced 2026-05-26 18:42:48 +00:00

Author	SHA1	Message	Date
Cooper Ry Lees	943609d9a5	Add UDP sequence number tracking for duplicate and out-of-order detection Track received sequence numbers in a map during each UDP probe to detect duplicate replies (same seq seen twice) and out-of-order delivery (seq lower than the highest previously seen). The receive loop now uses a recvState struct with a processPacket method that validates magic, checks for duplicate seq numbers, and tracks ordering. Duplicates are not counted toward the received total, and the loop allows up to 2*count iterations to handle them without prematurely timing out. New Prometheus counters: - goldpinger_udp_duplicates_total — duplicate reply packets - goldpinger_udp_out_of_order_total — out-of-order reply packets Both are cumulative counters with labels (goldpinger_instance, host_ip, pod_ip), incremented per-probe by the number of events detected. Non-zero values indicate network-level packet duplication or path asymmetry worth investigating. New tests: - TestProbeUDP_Duplicates: echo listener sends every packet twice, verifies duplicates are detected and don't inflate received count - TestProbeUDP_OutOfOrder: echo listener buffers pairs and returns them in reverse order, verifies out-of-order is detected Test results: ``` === RUN TestProbeUDP_NoLoss udp_probe_test.go:158: avg UDP RTT: 0.1011 ms --- PASS: TestProbeUDP_NoLoss (0.00s) === RUN TestProbeUDP_FullLoss --- PASS: TestProbeUDP_FullLoss (0.00s) === RUN TestProbeUDP_PartialLoss === RUN TestProbeUDP_PartialLoss/drop_every_2nd_(50%) udp_probe_test.go:204: loss: 50.0% (expected 50.0%) === RUN TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%) udp_probe_test.go:204: loss: 33.3% (expected 33.3%) === RUN TestProbeUDP_PartialLoss/drop_every_5th_(20%) udp_probe_test.go:204: loss: 20.0% (expected 20.0%) === RUN TestProbeUDP_PartialLoss/drop_every_10th_(10%) udp_probe_test.go:204: loss: 10.0% (expected 10.0%) --- PASS: TestProbeUDP_PartialLoss (8.01s) === RUN TestProbeUDP_ZeroCount --- PASS: TestProbeUDP_ZeroCount (0.00s) === RUN TestProbeUDP_PacketFormat --- PASS: TestProbeUDP_PacketFormat (0.00s) === RUN TestProbeUDP_Duplicates udp_probe_test.go:246: duplicates detected: 4 --- PASS: TestProbeUDP_Duplicates (0.00s) === RUN TestProbeUDP_OutOfOrder udp_probe_test.go:263: out-of-order detected: 5, duplicates: 0 --- PASS: TestProbeUDP_OutOfOrder (0.00s) === RUN TestEstimateHops --- PASS: TestEstimateHops (0.00s) PASS ``` Signed-off-by: Cooper Ry Lees <me@cooperlees.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 17:24:45 +00:00
Cooper Ry Lees	641b658f23	Address PR #164 review feedback Concurrent HTTP + UDP pings: HTTP ping and UDP probe now run in separate goroutines via sync.WaitGroup, so UDP timeout doesn't add to the ping cycle latency. (skamboj on pinger.go:124) Remove duplicate log: Removed the "UDP echo listener started" log from main.go since StartUDPListener already logs it. (skamboj on main.go:191) Prometheus base units (seconds): Renamed goldpinger_peers_udp_rtt_ms back to goldpinger_peers_udp_rtt_s with sub-millisecond histogram buckets (.0001s to 1s), per Prometheus naming conventions. RTT is computed in seconds internally and only converted to ms for the JSON API. (skamboj on stats.go:150) Rename path_length to hop_count: goldpinger_peers_path_length → goldpinger_peers_hop_count, and SetPeerPathLength → SetPeerHopCount. (skamboj on stats.go:139) UDP buffer constant and packet size clamping: Added udpMaxPacketSize=1500 constant, documented as standard Ethernet MTU — the largest UDP payload that survives most networks without fragmentation. Used for both listener and prober receive buffers. ProbeUDP now clamps UDP_PACKET_SIZE to udpMaxPacketSize to prevent silent truncation if someone configures a size > MTU. (skamboj on udp_probe.go:54) Guard count=0: ProbeUDP returns an error immediately if count <= 0 instead of dividing by zero. (skamboj on udp_probe.go:176) UDP error counter: Added goldpinger_udp_errors_total counter (labels: goldpinger_instance, host). CountUDPError is called on dial failures and send errors. (skamboj on udp_probe.go:115) Test: random source port for full loss: TestProbeUDP_FullLoss now binds an ephemeral port and closes it, instead of assuming port 19999 is free. (skamboj on udp_probe_test.go:56) Test: partial loss validation: New TestProbeUDP_PartialLoss uses a lossy echo listener that drops every Nth packet to validate loss calculations are exact: drop every 2nd → 50.0%, every 3rd → 33.3%, every 5th → 20.0%, every 10th → 10.0% (skamboj on udp_probe_test.go:96) Test: zero count: New TestProbeUDP_ZeroCount verifies error is returned for count=0. Test results: ``` === RUN TestProbeUDP_NoLoss udp_probe_test.go:88: avg UDP RTT: 0.0816 ms --- PASS: TestProbeUDP_NoLoss (0.00s) === RUN TestProbeUDP_FullLoss --- PASS: TestProbeUDP_FullLoss (0.00s) === RUN TestProbeUDP_PartialLoss === RUN TestProbeUDP_PartialLoss/drop_every_2nd_(50%) udp_probe_test.go:134: loss: 50.0% (expected 50.0%) === RUN TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%) udp_probe_test.go:134: loss: 33.3% (expected 33.3%) === RUN TestProbeUDP_PartialLoss/drop_every_5th_(20%) udp_probe_test.go:134: loss: 20.0% (expected 20.0%) === RUN TestProbeUDP_PartialLoss/drop_every_10th_(10%) udp_probe_test.go:134: loss: 10.0% (expected 10.0%) --- PASS: TestProbeUDP_PartialLoss (8.00s) === RUN TestProbeUDP_ZeroCount --- PASS: TestProbeUDP_ZeroCount (0.00s) === RUN TestProbeUDP_PacketFormat --- PASS: TestProbeUDP_PacketFormat (0.00s) === RUN TestEstimateHops --- PASS: TestEstimateHops (0.00s) PASS ``` Signed-off-by: Cooper Ry Lees <me@cooperlees.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:37:52 +00:00
Cooper Ry Lees	832bc7b598	Add UDP probe metrics: packet loss, hop count, and RTT Add an opt-in UDP echo probe that runs alongside the existing HTTP ping. Each goldpinger pod listens on a configurable UDP port (default 6969). During each ping cycle, the prober sends N sequenced packets to the peer's listener, which echoes them back. From the replies we compute packet loss percentage, path hop count (from IPv4 TTL / IPv6 HopLimit), and average round-trip time. New Prometheus metrics: - goldpinger_peers_loss_pct (gauge) — per-peer UDP loss % - goldpinger_peers_path_length (gauge) — estimated hop count - goldpinger_peers_udp_rtt_ms (histogram) — UDP RTT in milliseconds The graph UI shows yellow edges for links with partial loss, and displays sub-millisecond UDP RTT instead of HTTP latency when UDP is enabled. Stale metric labels are cleaned up when a pinger is destroyed so rolled pods don't leave ghost entries. Configuration (all via env vars, disabled by default): UDP_ENABLED=true enable UDP probing and listener UDP_PORT=6969 listener port UDP_PACKET_COUNT=10 packets per probe UDP_PACKET_SIZE=64 bytes per packet UDP_TIMEOUT=1s probe timeout New files: pkg/goldpinger/udp_probe.go — echo listener + probe client pkg/goldpinger/udp_probe_test.go — unit tests Unit tests: ``` === RUN TestProbeUDP_NoLoss udp_probe_test.go:51: avg UDP RTT: 0.0823 ms --- PASS: TestProbeUDP_NoLoss (0.00s) === RUN TestProbeUDP_FullLoss --- PASS: TestProbeUDP_FullLoss (0.00s) === RUN TestProbeUDP_PacketFormat --- PASS: TestProbeUDP_PacketFormat (0.00s) === RUN TestEstimateHops --- PASS: TestEstimateHops (0.00s) PASS ``` Cluster test (6-node IPv6 k8s, UDP_ENABLED=true): ``` Prometheus metrics (healthy cluster, 0% loss): goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0 goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0 Simulated 50% loss via ip6tables DROP in pod netns on node-0: goldpinger_peers_loss_pct{instance="server",...} 60 goldpinger_peers_loss_pct{instance="node-1",...} 30 goldpinger_peers_loss_pct{instance="server2",...} 30 UDP RTT vs HTTP RTT (check_all API): node-0 -> server: udp=2.18ms http=2ms node-2 -> node-2: udp=0.40ms http=1ms server -> node-0: udp=0.55ms http=2ms Post-rollout stale metrics cleanup verified: All 36 edges show 0% loss, no stale pod IPs. ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Cooper Ry Lees <me@cooperlees.com>	2026-03-27 16:05:32 +00:00
leundai	ba779f50e7	feat: Add deepwiki badge Small enhancement to improve quick onboarding for the curious Signed-off-by: leundai <leogalindofrias@gmail.com>	2025-07-12 14:48:02 -04:00
skamboj	dbd1f5f295	Merge branch 'master' into add-helm-chart	2024-05-13 15:41:59 -04:00
ABC Taylor	562df92c3a	Add default namespace `default` to ServiceAccount definition, to catch case where users find-replace `default` with another namespace but don't change it for the ServiceAccount Signed-Off-By: ABC Taylor <abc@abctaylor.com>	2024-04-11 08:37:09 +01:00
Derek Brown	4af6666853	feat: add helm chart Signed-off-by: Derek Brown <derektbrown@users.noreply.github.com>	2023-09-25 15:51:14 -07:00
Will Daly	1f3ad0acc9	Remove deprecated rbac.authorization.k8s.io/v1beta1 This commit updates the README and examples to use rbac.authorization.k8s.io/v1 instead, which has been available since K8s 1.8 rbac.authorization.k8s.io/v1beta1 was deprecated in K8s 1.17 and removed in K8s 1.22. Reference: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#rbac-resources-v122 Signed-off-by: Will Daly <widaly@microsoft.com>	2023-05-03 11:29:42 -07:00
tgetachew	14ea96999a	add external probes Signed-off-by: kitfoman <thaddeusgetachew@gmail.com> make timeout flags backwards compatible Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>	2022-05-08 22:02:09 -04:00
Tyler Lloyd	05ab610f10	cleanup IPv6 references only check node IPs when determining hostIP IP_VERSIONS not IP_FAMILIES Signed-off-by: Tyler Lloyd <tyler.lloyd@microsoft.com>	2022-05-08 22:02:03 -04:00
Mikolaj Pawlikowski	76f054ba50	Fix the build badge Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2022-02-04 10:51:43 +00:00
Mike Tougeron	6aee150cd0	Multi-arch builds for goldpinger Signed-off-by: Mike Tougeron <tougeron@adobe.com>	2022-01-15 17:12:42 -08:00
Tyler Lloyd	5b080c7087	update readme for IPv6 example Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>	2021-11-03 16:58:38 -04:00
Johannes M. Scheuermann	97ec159852	Correct example in the readme Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>	2020-07-22 14:00:45 +02:00
Mikolaj Pawlikowski	6e29c16148	Specify the size Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:24:17 +01:00
Mikolaj Pawlikowski	f83c1de387	MOAR hyperlinks Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:22:44 +01:00
Mikolaj Pawlikowski	e24f789b68	Hyperlink all the things Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:22:05 +01:00
Mikolaj Pawlikowski	3a922f4278	Some more polish Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:20:16 +01:00
Mikolaj Pawlikowski	24d74544e0	Well, I clearly don't know my emoticons Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:19:11 +01:00
Mikolaj Pawlikowski	28f7655170	Add a note about Chaos Engineering and the authors Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:17:28 +01:00
Mikolaj Pawlikowski	e9d3f8cd2b	Refresh the readme a little bit Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-06-10 22:17:09 +01:00
Mikolaj Pawlikowski	8790d3e7c4	Update README to use v3.0.0 of Goldpinger Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-05-08 11:44:30 +01:00
Mikolaj Pawlikowski	34d84b233c	Merge branch 'master' into migrate-to-go-modules	2020-04-02 17:56:00 +01:00
Mikolaj Pawlikowski	6844a8d2b4	Merge branch 'master' into master	2020-04-02 17:36:43 +01:00
Sachin Kamboj	436c1a7243	Update the README to remove references to dep Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>	2020-04-01 21:00:31 -04:00
Sandeep Mendiratta	0870833cf5	minor typo correction in heat map page. Also updated version and Readme with new version Signed-off-by: Sandeep Mendiratta <smendiratta@yahoo.com>	2020-03-22 16:28:22 -05:00
Mikolaj Pawlikowski	94dc18c9c2	Update the version in README Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2020-03-11 13:05:22 +00:00
Joe Lei	60436daf18	fix prometheus ruler expr Signed-off-by: Joe Lei <thezero12@hotmail.com>	2020-02-18 10:59:43 +08:00
suleiman abualrob	6315c71a2c	Update README.md Kubernetes master node by most of default installation are taint, so goldPinger DaemonSet will not deployed to master, in order to make it run on master nodes also, you have to tolerate the taint Signed-off-by: suleimanWA <suleiman-94@hotmail.com>	2019-11-26 20:42:27 +02:00
suleiman abualrob	ec2155878a	Update README.md If you have Prometheus in your environment, adding these annotation will let Prometheus auto-discovery fetch your metrics automatically from service-name:port/metrics Signed-off-by: suleimanWA <suleiman-94@hotmail.com>	2019-11-26 19:03:14 +02:00
Ángel Barrera Sánchez	e552b236a0	Change documentation example Signed-off-by: Ángel Barrera Sánchez <angel@sighup.io>	2019-11-14 17:19:42 +01:00
Mikolaj Pawlikowski	75315a872a	Better wording in the README Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2019-09-06 14:58:10 +01:00
Mikolaj Pawlikowski	433a6b8b88	Add a note about the DNS usage Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2019-09-06 14:51:22 +01:00
Danny Kulchinsky	477ba69a72	Add livenessProbe and readinessProbe to README Signed-off-by: Danny Kulchinsky <danny.kul@gmail.com> Signed-off-by: Danny Kulchinsky <dannyk@tuenti.com>	2019-03-17 21:01:15 -04:00
Mikolaj Pawlikowski	c006eede86	Merge branch 'master' into stn/rendezvous-hashing	2019-03-13 17:11:23 +00:00
stuart nelson	771f303062	Add rendezvous hash for selecting subset of nodes Select a user-defined number of pods via rendezvous hash. This is important for larger clusters, where the metric cardinality explosion is too much for a single prometheus to handle. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2019-03-13 15:30:18 +01:00
Mikolaj Pawlikowski	b7c1d2dfb4	add extra info in README, about the `-vendor` image tag Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2019-03-13 14:29:54 +00:00
Mikolaj Pawlikowski	82a0d6ae8c	Merge branch 'master' into docker-push-updates	2019-03-12 23:05:32 +00:00
Mikolaj Pawlikowski	057e360c5b	Update README.md Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2019-03-12 22:59:32 +00:00
Otto Yiu	2c4e069c08	fix README.md to point to right metric for AlertManager example Signed-Off-By: Otto Yiu <otto@live.ca>	2019-01-31 09:23:31 -08:00
Mikolaj Pawlikowski	c6b22741d7	Update README.md Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-21 10:18:23 +01:00
Mikolaj Pawlikowski	9b30561ccb	Add a note about sudo usage for beginners Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-20 21:34:14 +01:00
Mikolaj Pawlikowski	97808d9365	Update README.md Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-20 18:51:11 +01:00
0xflotus	cbee29b97e	did you mean 'compiling'? Signed-off-by: 0xflotus <0xflotus@gmail.com>	2018-12-20 11:59:20 +01:00
Mikolaj Pawlikowski	8f738e70a7	fix the reference to the Dockerfile in the buld folder Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-19 12:05:10 +00:00
Mikolaj Pawlikowski	6640f0d5c8	more typos Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-19 12:00:14 +00:00
Mikolaj Pawlikowski	53c1b0e78e	add link to michiel's profile in readme Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-19 11:58:33 +00:00
Mikolaj Pawlikowski	28cc2894eb	added the menu, fixed few typos Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-19 11:57:09 +00:00
Mikolaj Pawlikowski	bdcf179d5f	rename the base Dockerfile to Dockerfile-simple Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>	2018-12-19 11:55:13 +00:00
Mikolaj Pawlikowski	0ddc7f0fe6	Merge branch 'master' into update-readme	2018-12-18 15:53:16 +00:00

1 2

58 Commits