58 Commits

Author SHA1 Message Date
Cooper Ry Lees
943609d9a5 Add UDP sequence number tracking for duplicate and out-of-order detection
Track received sequence numbers in a map during each UDP probe to
detect duplicate replies (same seq seen twice) and out-of-order
delivery (seq lower than the highest previously seen).

The receive loop now uses a recvState struct with a processPacket
method that validates magic, checks for duplicate seq numbers, and
tracks ordering. Duplicates are not counted toward the received
total, and the loop allows up to 2*count iterations to handle them
without prematurely timing out.

New Prometheus counters:
  - goldpinger_udp_duplicates_total    — duplicate reply packets
  - goldpinger_udp_out_of_order_total  — out-of-order reply packets

Both are cumulative counters with labels (goldpinger_instance,
host_ip, pod_ip), incremented per-probe by the number of events
detected. Non-zero values indicate network-level packet duplication
or path asymmetry worth investigating.

New tests:
  - TestProbeUDP_Duplicates: echo listener sends every packet twice,
    verifies duplicates are detected and don't inflate received count
  - TestProbeUDP_OutOfOrder: echo listener buffers pairs and returns
    them in reverse order, verifies out-of-order is detected

Test results:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:158: avg UDP RTT: 0.1011 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PartialLoss
=== RUN   TestProbeUDP_PartialLoss/drop_every_2nd_(50%)
    udp_probe_test.go:204: loss: 50.0% (expected 50.0%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%)
    udp_probe_test.go:204: loss: 33.3% (expected 33.3%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_5th_(20%)
    udp_probe_test.go:204: loss: 20.0% (expected 20.0%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_10th_(10%)
    udp_probe_test.go:204: loss: 10.0% (expected 10.0%)
--- PASS: TestProbeUDP_PartialLoss (8.01s)
=== RUN   TestProbeUDP_ZeroCount
--- PASS: TestProbeUDP_ZeroCount (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestProbeUDP_Duplicates
    udp_probe_test.go:246: duplicates detected: 4
--- PASS: TestProbeUDP_Duplicates (0.00s)
=== RUN   TestProbeUDP_OutOfOrder
    udp_probe_test.go:263: out-of-order detected: 5, duplicates: 0
--- PASS: TestProbeUDP_OutOfOrder (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:24:45 +00:00
Cooper Ry Lees
641b658f23 Address PR #164 review feedback
Concurrent HTTP + UDP pings:
  HTTP ping and UDP probe now run in separate goroutines via
  sync.WaitGroup, so UDP timeout doesn't add to the ping cycle
  latency. (skamboj on pinger.go:124)

Remove duplicate log:
  Removed the "UDP echo listener started" log from main.go since
  StartUDPListener already logs it. (skamboj on main.go:191)

Prometheus base units (seconds):
  Renamed goldpinger_peers_udp_rtt_ms back to goldpinger_peers_udp_rtt_s
  with sub-millisecond histogram buckets (.0001s to 1s), per Prometheus
  naming conventions. RTT is computed in seconds internally and only
  converted to ms for the JSON API. (skamboj on stats.go:150)

Rename path_length to hop_count:
  goldpinger_peers_path_length → goldpinger_peers_hop_count, and
  SetPeerPathLength → SetPeerHopCount. (skamboj on stats.go:139)

UDP buffer constant and packet size clamping:
  Added udpMaxPacketSize=1500 constant, documented as standard Ethernet
  MTU — the largest UDP payload that survives most networks without
  fragmentation. Used for both listener and prober receive buffers.
  ProbeUDP now clamps UDP_PACKET_SIZE to udpMaxPacketSize to prevent
  silent truncation if someone configures a size > MTU.
  (skamboj on udp_probe.go:54)

Guard count=0:
  ProbeUDP returns an error immediately if count <= 0 instead of
  dividing by zero. (skamboj on udp_probe.go:176)

UDP error counter:
  Added goldpinger_udp_errors_total counter (labels: goldpinger_instance,
  host). CountUDPError is called on dial failures and send errors.
  (skamboj on udp_probe.go:115)

Test: random source port for full loss:
  TestProbeUDP_FullLoss now binds an ephemeral port and closes it,
  instead of assuming port 19999 is free. (skamboj on udp_probe_test.go:56)

Test: partial loss validation:
  New TestProbeUDP_PartialLoss uses a lossy echo listener that drops
  every Nth packet to validate loss calculations are exact:
    drop every 2nd → 50.0%, every 3rd → 33.3%,
    every 5th → 20.0%, every 10th → 10.0%
  (skamboj on udp_probe_test.go:96)

Test: zero count:
  New TestProbeUDP_ZeroCount verifies error is returned for count=0.

Test results:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:88: avg UDP RTT: 0.0816 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PartialLoss
=== RUN   TestProbeUDP_PartialLoss/drop_every_2nd_(50%)
    udp_probe_test.go:134: loss: 50.0% (expected 50.0%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_3rd_(33.3%)
    udp_probe_test.go:134: loss: 33.3% (expected 33.3%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_5th_(20%)
    udp_probe_test.go:134: loss: 20.0% (expected 20.0%)
=== RUN   TestProbeUDP_PartialLoss/drop_every_10th_(10%)
    udp_probe_test.go:134: loss: 10.0% (expected 10.0%)
--- PASS: TestProbeUDP_PartialLoss (8.00s)
=== RUN   TestProbeUDP_ZeroCount
--- PASS: TestProbeUDP_ZeroCount (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:37:52 +00:00
Cooper Ry Lees
832bc7b598 Add UDP probe metrics: packet loss, hop count, and RTT
Add an opt-in UDP echo probe that runs alongside the existing HTTP
ping. Each goldpinger pod listens on a configurable UDP port (default
6969). During each ping cycle, the prober sends N sequenced packets
to the peer's listener, which echoes them back. From the replies we
compute packet loss percentage, path hop count (from IPv4 TTL / IPv6
HopLimit), and average round-trip time.

New Prometheus metrics:
  - goldpinger_peers_loss_pct      (gauge)     — per-peer UDP loss %
  - goldpinger_peers_path_length   (gauge)     — estimated hop count
  - goldpinger_peers_udp_rtt_ms    (histogram) — UDP RTT in milliseconds

The graph UI shows yellow edges for links with partial loss, and
displays sub-millisecond UDP RTT instead of HTTP latency when UDP
is enabled. Stale metric labels are cleaned up when a pinger is
destroyed so rolled pods don't leave ghost entries.

Configuration (all via env vars, disabled by default):
  UDP_ENABLED=true      enable UDP probing and listener
  UDP_PORT=6969         listener port
  UDP_PACKET_COUNT=10   packets per probe
  UDP_PACKET_SIZE=64    bytes per packet
  UDP_TIMEOUT=1s        probe timeout

New files:
  pkg/goldpinger/udp_probe.go       — echo listener + probe client
  pkg/goldpinger/udp_probe_test.go  — unit tests

Unit tests:
```
=== RUN   TestProbeUDP_NoLoss
    udp_probe_test.go:51: avg UDP RTT: 0.0823 ms
--- PASS: TestProbeUDP_NoLoss (0.00s)
=== RUN   TestProbeUDP_FullLoss
--- PASS: TestProbeUDP_FullLoss (0.00s)
=== RUN   TestProbeUDP_PacketFormat
--- PASS: TestProbeUDP_PacketFormat (0.00s)
=== RUN   TestEstimateHops
--- PASS: TestEstimateHops (0.00s)
PASS
```

Cluster test (6-node IPv6 k8s, UDP_ENABLED=true):
```
Prometheus metrics (healthy cluster, 0% loss):
  goldpinger_peers_loss_pct{...,pod_ip="fd00:4:69:3::3746"} 0
  goldpinger_peers_path_length{...,pod_ip="fd00:4:69:3::3746"} 0

Simulated 50% loss via ip6tables DROP in pod netns on node-0:
  goldpinger_peers_loss_pct{instance="server",...} 60
  goldpinger_peers_loss_pct{instance="node-1",...} 30
  goldpinger_peers_loss_pct{instance="server2",...} 30

UDP RTT vs HTTP RTT (check_all API):
  node-0 -> server:  udp=2.18ms  http=2ms
  node-2 -> node-2:  udp=0.40ms  http=1ms
  server -> node-0:  udp=0.55ms  http=2ms

Post-rollout stale metrics cleanup verified:
  All 36 edges show 0% loss, no stale pod IPs.
```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cooper Ry Lees <me@cooperlees.com>
2026-03-27 16:05:32 +00:00
leundai
ba779f50e7 feat: Add deepwiki badge
Small enhancement to improve quick onboarding for the curious

Signed-off-by: leundai <leogalindofrias@gmail.com>
2025-07-12 14:48:02 -04:00
skamboj
dbd1f5f295 Merge branch 'master' into add-helm-chart 2024-05-13 15:41:59 -04:00
ABC Taylor
562df92c3a Add default namespace default to ServiceAccount definition, to catch case where users find-replace default with another namespace but don't change it for the ServiceAccount
Signed-Off-By: ABC Taylor <abc@abctaylor.com>
2024-04-11 08:37:09 +01:00
Derek Brown
4af6666853 feat: add helm chart
Signed-off-by: Derek Brown <derektbrown@users.noreply.github.com>
2023-09-25 15:51:14 -07:00
Will Daly
1f3ad0acc9 Remove deprecated rbac.authorization.k8s.io/v1beta1
This commit updates the README and examples to use
rbac.authorization.k8s.io/v1 instead, which has been available
since K8s 1.8

rbac.authorization.k8s.io/v1beta1 was deprecated in K8s 1.17
and removed in K8s 1.22.

Reference:
https://kubernetes.io/docs/reference/using-api/deprecation-guide/#rbac-resources-v122

Signed-off-by: Will Daly <widaly@microsoft.com>
2023-05-03 11:29:42 -07:00
tgetachew
14ea96999a add external probes
Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>

make timeout flags backwards compatible

Signed-off-by: kitfoman <thaddeusgetachew@gmail.com>
2022-05-08 22:02:09 -04:00
Tyler Lloyd
05ab610f10 cleanup IPv6 references
only check node IPs when determining hostIP

IP_VERSIONS not IP_FAMILIES

Signed-off-by: Tyler Lloyd <tyler.lloyd@microsoft.com>
2022-05-08 22:02:03 -04:00
Mikolaj Pawlikowski
76f054ba50 Fix the build badge
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2022-02-04 10:51:43 +00:00
Mike Tougeron
6aee150cd0 Multi-arch builds for goldpinger
Signed-off-by: Mike Tougeron <tougeron@adobe.com>
2022-01-15 17:12:42 -08:00
Tyler Lloyd
5b080c7087 update readme for IPv6 example
Signed-off-by: Tyler Lloyd <Tyler.Lloyd@microsoft.com>
2021-11-03 16:58:38 -04:00
Johannes M. Scheuermann
97ec159852 Correct example in the readme
Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
2020-07-22 14:00:45 +02:00
Mikolaj Pawlikowski
6e29c16148 Specify the size
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:24:17 +01:00
Mikolaj Pawlikowski
f83c1de387 MOAR hyperlinks
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:22:44 +01:00
Mikolaj Pawlikowski
e24f789b68 Hyperlink all the things
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:22:05 +01:00
Mikolaj Pawlikowski
3a922f4278 Some more polish
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:20:16 +01:00
Mikolaj Pawlikowski
24d74544e0 Well, I clearly don't know my emoticons
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:19:11 +01:00
Mikolaj Pawlikowski
28f7655170 Add a note about Chaos Engineering and the authors
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:17:28 +01:00
Mikolaj Pawlikowski
e9d3f8cd2b Refresh the readme a little bit
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-06-10 22:17:09 +01:00
Mikolaj Pawlikowski
8790d3e7c4 Update README to use v3.0.0 of Goldpinger
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-05-08 11:44:30 +01:00
Mikolaj Pawlikowski
34d84b233c Merge branch 'master' into migrate-to-go-modules 2020-04-02 17:56:00 +01:00
Mikolaj Pawlikowski
6844a8d2b4 Merge branch 'master' into master 2020-04-02 17:36:43 +01:00
Sachin Kamboj
436c1a7243 Update the README to remove references to dep
Signed-off-by: Sachin Kamboj <skamboj1@bloomberg.net>
2020-04-01 21:00:31 -04:00
Sandeep Mendiratta
0870833cf5 minor typo correction in heat map page. Also updated version and Readme with new version
Signed-off-by: Sandeep Mendiratta <smendiratta@yahoo.com>
2020-03-22 16:28:22 -05:00
Mikolaj Pawlikowski
94dc18c9c2 Update the version in README
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2020-03-11 13:05:22 +00:00
Joe Lei
60436daf18 fix prometheus ruler expr
Signed-off-by: Joe Lei <thezero12@hotmail.com>
2020-02-18 10:59:43 +08:00
suleiman abualrob
6315c71a2c Update README.md
Kubernetes master node by most of default installation are taint, so goldPinger DaemonSet will not deployed to master, in order to make it run on master nodes also, you have to tolerate the taint

Signed-off-by: suleimanWA <suleiman-94@hotmail.com>
2019-11-26 20:42:27 +02:00
suleiman abualrob
ec2155878a Update README.md
If you have Prometheus in your environment, adding these annotation will let  Prometheus auto-discovery fetch your metrics automatically from service-name:port/metrics

Signed-off-by: suleimanWA <suleiman-94@hotmail.com>
2019-11-26 19:03:14 +02:00
Ángel Barrera Sánchez
e552b236a0 Change documentation example
Signed-off-by: Ángel Barrera Sánchez <angel@sighup.io>
2019-11-14 17:19:42 +01:00
Mikolaj Pawlikowski
75315a872a Better wording in the README
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-09-06 14:58:10 +01:00
Mikolaj Pawlikowski
433a6b8b88 Add a note about the DNS usage
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-09-06 14:51:22 +01:00
Danny Kulchinsky
477ba69a72 Add livenessProbe and readinessProbe to README
Signed-off-by: Danny Kulchinsky <danny.kul@gmail.com>
Signed-off-by: Danny Kulchinsky <dannyk@tuenti.com>
2019-03-17 21:01:15 -04:00
Mikolaj Pawlikowski
c006eede86 Merge branch 'master' into stn/rendezvous-hashing 2019-03-13 17:11:23 +00:00
stuart nelson
771f303062 Add rendezvous hash for selecting subset of nodes
Select a user-defined number of pods via
rendezvous hash. This is important for larger
clusters, where the metric cardinality explosion
is too much for a single prometheus to handle.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2019-03-13 15:30:18 +01:00
Mikolaj Pawlikowski
b7c1d2dfb4 add extra info in README, about the -vendor image tag
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-03-13 14:29:54 +00:00
Mikolaj Pawlikowski
82a0d6ae8c Merge branch 'master' into docker-push-updates 2019-03-12 23:05:32 +00:00
Mikolaj Pawlikowski
057e360c5b Update README.md
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2019-03-12 22:59:32 +00:00
Otto Yiu
2c4e069c08 fix README.md to point to right metric for AlertManager example
Signed-Off-By: Otto Yiu <otto@live.ca>
2019-01-31 09:23:31 -08:00
Mikolaj Pawlikowski
c6b22741d7 Update README.md
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-21 10:18:23 +01:00
Mikolaj Pawlikowski
9b30561ccb Add a note about sudo usage for beginners
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-20 21:34:14 +01:00
Mikolaj Pawlikowski
97808d9365 Update README.md
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-20 18:51:11 +01:00
0xflotus
cbee29b97e did you mean 'compiling'?
Signed-off-by: 0xflotus <0xflotus@gmail.com>
2018-12-20 11:59:20 +01:00
Mikolaj Pawlikowski
8f738e70a7 fix the reference to the Dockerfile in the buld folder
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-19 12:05:10 +00:00
Mikolaj Pawlikowski
6640f0d5c8 more typos
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-19 12:00:14 +00:00
Mikolaj Pawlikowski
53c1b0e78e add link to michiel's profile in readme
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-19 11:58:33 +00:00
Mikolaj Pawlikowski
28cc2894eb added the menu, fixed few typos
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-19 11:57:09 +00:00
Mikolaj Pawlikowski
bdcf179d5f rename the base Dockerfile to Dockerfile-simple
Signed-off-by: Mikolaj Pawlikowski <mikolaj@pawlikowski.pl>
2018-12-19 11:55:13 +00:00
Mikolaj Pawlikowski
0ddc7f0fe6 Merge branch 'master' into update-readme 2018-12-18 15:53:16 +00:00