107 Commits

Author SHA1 Message Date
Cooper Ry Lees
145d2bf000 Rename PathLength to HopCount in swagger model and UI
Rename the swagger field from path-length to hop-count so the
generated Go struct field (PathLength → HopCount) and JSON key
(path-length → hop-count) align with the Prometheus metric rename
to goldpinger_peers_hop_count from the previous commit.

Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:45:31 +00:00
Cooper Ry Lees
641b658f23 Address PR #164 review feedback
Concurrent HTTP + UDP pings:
  HTTP ping and UDP probe now run in separate goroutines via
  sync.WaitGroup, so UDP timeout doesn't add to the ping cycle
  latency. (skamboj on pinger.go:124)

Remove duplicate log:
  Removed the "UDP echo listener started" log from main.go since
  StartUDPListener already logs it. (skamboj on main.go:191)

Prometheus base units (seconds):
  Renamed goldpinger_peers_udp_rtt_ms back to goldpinger_peers_udp_rtt_s
  with sub-millisecond histogram buckets (.0001s to 1s), per Prometheus
  naming conventions. RTT is computed in seconds internally and only
  converted to ms for the JSON API. (skamboj on stats.go:150)

Rename path_length to hop_count:
  goldpinger_peers_path_length → goldpinger_peers_hop_count, and
  SetPeerPathLength → SetPeerHopCount. (skamboj on stats.go:139)

UDP buffer constant and packet size clamping:
  Added udpMaxPacketSize=1500 constant, documented as standard Ethernet
  MTU — the largest UDP payload that survives most networks without
  fragmentation. Used for both listener and prober receive buffers.
  ProbeUDP now clamps UDP_PACKET_SIZE to udpMaxPacketSize to prevent
  silent truncation if someone configures a size > MTU.
  (skamboj on udp_probe.go:54)

Guard count=0:
  ProbeUDP returns an error immediately if count <= 0 instead of
  dividing by zero. (skamboj on udp_probe.go:176)

UDP error counter:
  Added goldpinger_udp_errors_total counter (labels: goldpinger_instance,
  host). CountUDPError is called on dial failures and send errors.
  (skamboj on udp_probe.go:115)

Test: random source port for full loss:
  TestProbeUDP_FullLoss now binds an ephemeral port and closes it,
  instead of assuming port 19999 is free. (skamboj on udp_probe_test.go:56)

Test: partial loss validation:
  New TestProbeUDP_PartialLoss uses a lossy echo listener that drops
  every Nth packet to validate loss calculations are exact:
    drop every 2nd → 50.0%, every 3rd → 33.3%,
    every 5th → 20.0%, every 10th → 10.0%
  (skamboj on udp_probe_test.go:96)

Test: zero count:
  New TestProbeUDP_ZeroCount verifies error is returned for count=0.

Test results:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:88: avg UDP RTT: 0.0816 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PartialLoss
=== RUN   TestProbeUDP_PartialLoss/drop_every_2nd_(50%)
    udp_probe_test.go:134: loss: 50.0% (expected 50.0%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%)
    udp_probe_test.go:134: loss: 33.3% (expected 33.3%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_5th_(20%)
    udp_probe_test.go:134: loss: 20.0% (expected 20.0%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_10th_(10%)
    udp_probe_test.go:134: loss: 10.0% (expected 10.0%)
--- PASS: TestProbeUDP_PartialLoss (8.00s)
=== RUN   TestProbeUDP_ZeroCount
--- PASS: TestProbeUDP_ZeroCount (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:37:52 +00:00
Cooper Ry Lees
832bc7b598 Add UDP probe metrics: packet loss, hop count, and RTT
Add an opt-in UDP echo probe that runs alongside the existing HTTP
ping. Each goldpinger pod listens on a configurable UDP port (default
6969). During each ping cycle, the prober sends N sequenced packets
to the peer's listener, which echoes them back. From the replies we
compute packet loss percentage, path hop count (from IPv4 TTL / IPv6
HopLimit), and average round-trip time.

New Prometheus metrics:
  - goldpinger_peers_loss_pct      (gauge)     — per-peer UDP loss %
  - goldpinger_peers_path_length   (gauge)     — estimated hop count
  - goldpinger_peers_udp_rtt_ms    (histogram) — UDP RTT in milliseconds

The graph UI shows yellow edges for links with partial loss, and
displays sub-millisecond UDP RTT instead of HTTP latency when UDP
is enabled. Stale metric labels are cleaned up when a pinger is
destroyed so rolled pods don't leave ghost entries.

Configuration (all via env vars, disabled by default):
  UDP_ENABLED=true      enable UDP probing and listener
  UDP_PORT=6969         listener port
  UDP_PACKET_COUNT=10   packets per probe
  UDP_PACKET_SIZE=64    bytes per packet
  UDP_TIMEOUT=1s        probe timeout

New files:
  pkg/goldpinger/udp_probe.go       — echo listener + probe client
  pkg/goldpinger/udp_probe_test.go  — unit tests

Unit tests:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:51: avg UDP RTT: 0.0823 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Cluster test (6-node IPv6 k8s, UDP_ENABLED=true):
```
Prometheus metrics (healthy cluster, 0% loss):
  goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0
  goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0

Simulated 50% loss via ip6tables DROP in pod netns on node-0:
  goldpinger_peers_loss_pct{instance="server",...} 60
  goldpinger_peers_loss_pct{instance="node-1",...} 30
  goldpinger_peers_loss_pct{instance="server2",...} 30

UDP RTT vs HTTP RTT (check_all API):
  node-0 -> server:  udp=2.18ms  http=2ms
  node-2 -> node-2:  udp=0.40ms  http=1ms
  server -> node-0:  udp=0.55ms  http=2ms

Post-rollout stale metrics cleanup verified:
  All 36 edges show 0% loss, no stale pod IPs.
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
2026-03-27 16:05:32 +00:00
j4ckstraw
acac9dee8b use protobuf and add resourceVersion in listOption
1. communicate to kube-apiserver with protobuf
2. listOption add resourceVersion=0. without resourceversion,
list will force kube-apiserver retrieve data from etcd.

In a 100+ nodes, 7500+ pods kubernetes cluster, this patch make
kube-apiserver cpu utils reduce 5-10%.

Signed-off-by: j4ckstraw <j4ckstraw@foxmail.com>
2023-06-17 18:42:37 +08:00
Maxime Leroy
a913318ae3 feat: support advanced zap configuration
Signed-off-by: Maxime Leroy <19607336+maxime1907@users.noreply.github.com>
2022-10-25 12:30:35 +02:00
tgetachew
14ea96999a add external probes
Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>

make timeout flags backwards compatible

Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>
2022-05-08 22:02:09 -04:00
Tyler Lloyd
05ab610f10 cleanup IPv6 references
only check node IPs when determining hostIP

IP_VERSIONS not IP_FAMILIES

Signed-off-by: Tyler Lloyd <tyler.lloyd@microsoft.com>
2022-05-08 22:02:03 -04:00
wanglijie6
72832bcbc4 Fix pinger be removed by not found.
heatmap will be broken in every refeshPeriod,
I found pinger is be deleted because of exists check faild.

updatePingers will check if a pod still exist or a new one,
and update pingers in every refreshPeriod.

the function exists failed to check pod exist, so fix it.

Signed-off-by: wanglijie6 <wanglijie6@xiaomi.com>
2022-04-15 19:14:53 +08:00
wanglijie6
7609a3ab3f Add --display-nodename option to control UI display
Add GoldpingerConfig.DisplayNodeName to control UI display, default is
`false`, which means to display podName

Signed-off-by: wanglijie6 <wanglijie6@xiaomi.com>
2022-04-13 10:54:59 +08:00
wanglijie6
588c1a0173 Show hostName other than podName
Signed-off-by: wanglijie6 <wanglijie6@xiaomi.com>
2022-04-12 20:06:48 +08:00
Mikolaj Pawlikowski
842dfadeea Merge branch 'master' into list-running-pods 2021-11-08 15:11:18 +00:00
Tyler Lloyd
34b78537c9 changed return type to IPFamily
Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>
2021-11-08 09:38:38 -05:00
Tyler Lloyd
cfd26c8d26 changing to IP_VERSIONS
Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>
2021-11-05 10:40:37 -04:00
Evan Baker
980a85b04d only list running pods
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
2021-11-03 18:06:29 -05:00
Tyler Lloyd
b88d0f3ec5 don't use Sprintf for creating host address
Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>
2021-11-03 16:58:38 -04:00
Tyler Lloyd
03dd6706b8 add node IP cache and change to get node
Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>
2021-11-03 16:58:38 -04:00
Tyler Lloyd
5d2070fad1 get pod and host IPv6 IPs when USE_IPV6 is set
Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>
2021-11-03 16:58:38 -04:00
Mikolaj Pawlikowski
ed40304dd8 1 when healthy, 0 when unhealthy (bool, not return code)
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-19 14:45:58 +00:00
Mikolaj Pawlikowski
e6aa196232 Compare the actual host ips
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-19 14:26:07 +00:00
Mikolaj Pawlikowski
948b67a09b Break here, continue there
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-19 13:46:37 +00:00
Mikolaj Pawlikowski
d0e2e25ad2 Make it a bit more obvious
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-19 13:43:40 +00:00
Mikolaj Pawlikowski
8d5262d316 Make the naming a little less bad, remove the break statements
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-19 13:25:39 +00:00
Mikolaj Pawlikowski
407d201591 Add an overall metric goldpinger_cluster_health_total (pings + DNS check)
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-19 12:31:49 +00:00
Mikolaj Pawlikowski
1f5589db8c Simplify
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-16 17:24:56 +00:00
Mikolaj Pawlikowski
bc94f4e058 Forgot the set the default to true, if nothing bad happens
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-16 17:21:46 +00:00
Mikolaj Pawlikowski
634e04ec44 Handle the situation when one of the nodes returns an error
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 17:51:12 +00:00
Mikolaj Pawlikowski
13ae09d93e Compare all nodes return the expected nodes that Kubernetes returns
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 17:16:58 +00:00
Mikolaj Pawlikowski
52ff43ec7d Make it return 418 on cluster health problem
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 17:06:57 +00:00
Mikolaj Pawlikowski
bfc4603e45 Always return the DurationNs
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:38:32 +00:00
Mikolaj Pawlikowski
ad828cf5a3 And implement it
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:31:39 +00:00
Mikolaj Pawlikowski
5d2ad6ce19 Also add a total number of nodes in a field for convenience
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:31:01 +00:00
Mikolaj Pawlikowski
1310f9b12b Add the generated at field
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:27:19 +00:00
Mikolaj Pawlikowski
f93526c58f Lint
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:27:07 +00:00
Mikolaj Pawlikowski
1f2f00ba35 First draft of implementing it
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:24:58 +00:00
Mikolaj Pawlikowski
807f193b07 Regenerate
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 13:06:26 +00:00
Mikolaj Pawlikowski
e827a8dc67 Regenearte the code
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 12:57:20 +00:00
Mikolaj Pawlikowski
3edecea467 Implement a stub of the new endpoint
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 12:24:42 +00:00
Mikolaj Pawlikowski
3ff592b1e8 Re-generate using the latest swagger gen cli
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2021-03-12 11:30:53 +00:00
Seth Pellegrino
07ef524aed feat: configurable namespace for pod discovery
Adds a configuration option to allow for cross-namespace pings.

Signed-off-by: Seth Pellegrino <seth@verica.io>
2020-11-23 14:01:25 -08:00
Sachin Kamboj
7e60ee675a Simplify updateCounters, don't try to maintain a running count
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-08 07:56:04 -04:00
Sachin Kamboj
d68d35bbab Add a ping time that gives the last time a node was pinged
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 21:04:43 -04:00
Sachin Kamboj
2a78a9cec5 Don't ping all pods on call, return existing data
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 20:49:01 -04:00
Sachin Kamboj
9db241d67d Update the set of pingers at regular intervals from the k8s API server
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 19:54:19 -04:00
Sachin Kamboj
40f57b1a4e Get rid of the lock and keep a running count of healthy/unhealthy nodes
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 08:45:14 -04:00
Sachin Kamboj
8a40aee927 Have the updater continuously ping pods and collate the results
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 08:02:36 -04:00
Sachin Kamboj
0690ac21a2 Command line options for adding a jitter-factor
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 07:55:48 -04:00
Sachin Kamboj
d0dfd3e493 Add code to continuously ping the pods and send the results over a channel
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-07 07:54:10 -04:00
Sachin Kamboj
9ae3e78035 Better structured logging
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-06 22:39:57 -04:00
Sachin Kamboj
2e1c799a25 Replace log statements with zap
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-06 22:18:04 -04:00
Sachin Kamboj
1c6362b2a9 Add a context to the ping results
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-06 19:43:22 -04:00