Bug? Gateway monitoring shows gateway in a group is up when it's down

Started by bringbacklanparties, July 21, 2025, 11:07:49 PM

Previous topic - Next topic
Hi all,

I think I'm experiencing a bug in the way gateway monitoring is working for me. I'm running:

OPNsense 25.1.11-amd64
FreeBSD 14.2-RELEASE-p4
OpenSSL 3.0.17

There's been recent activity related to bug fixes in gateway monitoring, but nothing I've seen reporting this specific issue. Failover and failback states for the gateways were added as features in May, so I think I may have encountered a bug related to that recent update as I'm using those features.

Here's what I'm doing:

1. Create two Wireguard VPN peers and instances.
2. Create interfaces and gateways for the instances. Confirm that they're up.
  For both gateways, enable:
  - upstream gateway
  - far gateway
  - disable host route
  - failover states
  - failback states
  Also, set up gateway monitoring for these gateways, using the public IP addresses of the gateway endpoints for the monitoring. Establish routes to the public IP addresses of the gateway endpoints for dpinger to work correctly.
3. Establish firewall and NAT rules. (In my case I'm also passing DNS traffic through the tunnels using Unbound in resolver mode and am using a kill switch for the WAN interface, for DNS and web traffic.)
4. Pass DNS, ICMP, and web traffic through the tunnels one at a time and confirm the connection works as expected.
5. Enable default gateway switching.
6. Establish a gateway group and interface group for the two VPN interfaces, with one Tier 1 and the second Tier 2. Establish firewall rules involving the gateway group.
7. Confirm that with both interfaces active, DNS and web traffic passes through the Tier 1 VPN tunnel.
8. Disable the interface for the Tier 1 instance. DNS and web traffic should pass through the Tier 2 tunnel, but instead the connection hangs for all traffic.

At this point I've observed that in System -> Diagnostics -> Services, the dpinger service for the Tier 1 gateway monitor is green. Restarting that dpinger service causes it to switch to red, and now traffic passes out the Tier 2 gateway correctly.

Additionally, getting back to the state in Step 7:
pfctl -s states --vv | grep [VPN tunnel A endpoint IP address] -A3 shows icmp and udp traffic still configured to go out the Tier 1 gateway. (Which is why the connection hangs, as the interface for Tunnel A is disabled. Also TCP traffic should also be going out and it is -- I think that issue is just that I'm a newbie with respect to using these commands.)
pluginctl -r return_gateways_status | head -n 28 shows the two VPN gateways still active.
(Expected behavior: at this point that command should show that the Tunnel A / Tier 1 gateway is down.)
pluginctl -c monitor also stops the dpinger service for the Tier 1 gateway.
And now pluginctl -r return_gateways_status | head -n 28 shows the Tier 1 gateway is down, and
pfctl -s states --vv | grep [VPN tunnel A endpoint IP address] shows no traffic going out the Tier 1 gateway anymore. And traffic is now going out the Tier 2 gateway.

This looks like a bug, right? Is anyone else experiencing this behavior?

Here's my system log:

Date Severity Process Line
2025-07-21T16:38:43-04:00 Warning opnsense /usr/local/sbin/pluginctl: The required [VPN tunnel A gateway] IPv4 interface address could not be found, skipping.
2025-07-21T16:38:43-04:00 Notice opnsense /usr/local/sbin/pluginctl: plugins_configure monitor (execute task : dpinger_configure_do(1))
2025-07-21T16:38:43-04:00 Notice opnsense /usr/local/sbin/pluginctl: plugins_configure monitor (1)

[executed  `pluginctl -c monitor`]

2025-07-21T16:38:06-04:00 Notice configctl event @ 1753130285.75 exec: system event config_changed response: OK
2025-07-21T16:38:06-04:00 Notice kernel <6>wg19: link state changed to DOWN
2025-07-21T16:38:06-04:00 Notice configctl event @ 1753130285.75 msg: Jul 21 16:38:05 OPNsense.corp.example.com config[90489]: config-event: new_config /conf/backup/config-1753130285.7446.xml
2025-07-21T16:38:04-04:00 Notice configctl event @ 1753130283.48 exec: system event config_changed response: OK
2025-07-21T16:38:04-04:00 Notice configctl event @ 1753130283.48 msg: Jul 21 16:38:03 OPNsense.corp.example.com config[25225]: config-event: new_config /conf/backup/config-1753130283.4679.xml

[disabled the Wireguard instance for VPN tunnel A]

Disabling both gateways correctly leads to hanging, and then re-enabling them one at a time and in any order works as expected without issue.

What's the workaround for this? Do I need to run a cron job in the background that restarts the monitors?

I'm happy to provide more detail or logs as needed.

Thanks,

bringbacklanparties

Here's a link related to that new failover / failback feature addition. And here's another related bug report.

As a workaround, I was able to create a cron job that restarts the gateway monitors every minute.

I created a file, /usr/local/opnsense/service/conf/actions.d/actions_restart_monitors.conf:

[run]
command:/usr/local/sbin/restart_monitors.sh
parameters:
type:script
message:Restarting gateway monitors
description:Restart gateway monitors

And the script in /usr/local/sbin/restart_monitors.sh is just:
/usr/local/sbin/pluginctl -c monitor
(Make sure that script has execute permissions: chmod 755 /usr/local/sbin/restart_monitors.sh)

Then run service configd restart and test with configctl restart_monitors run
And with it working, restart the web GUI or just reboot, and go into System -> Settings -> Cron and add a new Cron job to restart the gateway monitors every minute.

I tried having my script to restart the monitors run two calls to /usr/local/sbin/pluginctl -c monitor, sleeping for 30 seconds in between, but that seemed to cause the webGUI to delay during startup, so I think the sleep command might get called synchronously by something important during startup.

Anyone know a nice way to get OPNSense to run cron jobs more than once per minute? I'd prefer a faster-than-30-second wait on average for my connection to re-establish during failover.

Thanks!

July 22, 2025, 04:14:26 PM #3 Last Edit: July 22, 2025, 04:15:57 PM by bringbacklanparties
Hang on, I wasn't thorough with my testing after putting that cron job in place. I'm able to start the monitors just fine, but now I can see that when I disable the Tier 1 gateway, the Wireguard tunnel designated as Tier 2 is not showing up as the default gateway (via netstat -rn | grep wg), despite default gateway switching being enabled, the Tier 2 gateway being designated as an upstream gateway, and both the Tier 1 and Tier 2 gateways having "Failover" and "Failback" checked in their settings. Also I'm not getting traffic going through the Tier 2 gateway. (Also, now there is no default gateway.)

July 22, 2025, 05:07:53 PM #4 Last Edit: July 22, 2025, 06:57:35 PM by bringbacklanparties
I did some more digging around and can reproduce this glitch without designating a gateway group and just using ordinary default gateway switching.

I also restored an earlier configuration that did not use a gateway group. With this configuration, Gateway A for VPN Interface A is what would normally be Tier 1. It's got a priority of 225 and is preferred. Gateway B for Interface B is what would normally be Tier 2 (but there's no gateway group). It's got a priority of 227 and is next best. Default gateway switching is enabled. Both are upstream and far gateways. Failover and failback is enabled for both.

The firewall rules are not particularly relevant because I'm sure there's nothing going wrong with them, but basically I have separate rules for passing web traffic from my local network out Gateway A (the priority), and as a second rule I pass LAN traffic out Gateway B. Then I have two floating rules for passing outbound traffic from the firewall through Gateway A as a priority, and Gateway B under that. So, when testing what I think is a bug here without using a gateway group, I disable the instance for VPN Interface A under my Wireguard settings, click "Apply" and also disable and re-enable Wireguard, then disable the LAN firewall rules for Gateway A and the floating firewall rule for passing outbound traffic from the firewall through Gateway A. (For the NAT rules, I have separate NAT rules for the VPN Tunnel A interface and the VPN Tunnel B interface and they're both enabled and remain so for the duration of this test. So those are uninteresting.)

Starting with both gateways and VPN instances enabled, I have DNS and web traffic properly passing through VPN Tunnel A as seen through the live log. I can see with netstat -rn | grep wg that the gateway address for Tunnel A is listed as the default gateway. Next, I disable Interface A under my Wireguard VPN settings. Just as before with a gateway group, the gateway for Tunnel A is still green and the monitor shows no change in activity and the Tunnel A gateway still shows up in System -> Gateways -> Configuration as active. So, under System -> Diagnostics -> Services, I restart the dpinger for Gateway A and now it shows up as stopped. In System -> Gateways -> Configuration, the status of Gateway A is now red, the gateway is enabled, and Gateway B is now listed as "active". However when I run netstat -rn | grep wg I see that Gateway B does not show up as a default gateway (and there is no default gateway). And no traffic is going through Gateway A or Gateway b now (my connection is hanging, for both web traffic and DNS). Manually disabling Gateway A in System -> Gateways -> Configuration then causes Gateway B to show up as a default gateway as seen by netstat -rn | grep wg , and DNS and web traffic is now flowing through VPN Tunnel B.

I'm not clear, when not explicitly using a gateway group with failover, if the intended behavior of the software is for Gateway A to have to be manually disabled in order for Gateway B to show up as the default gateway. I thought that I checked this behavior for OPNSense 21.1.7_2 before upgrading to 21.1.11, and I did not have to disable Gateway A explicitly for Gateway B to take over as the default gateway when Interface A went down. Regardless, with a gateway group established, there should definitely be no need to disable Gateway A (and indeed that's impossible, at least through the GUI). Looks like it's time for me to revert to a previous version of OPNSense as restarting the monitors isn't enough to get the default gateway to switch.

Is anyone else able to reproduce this problem?

Update: the behavior persists in version 25.1.12. It's actually worse: I restored my configuration that uses the gateway group (the initial post here), then tested disabling the VPN instance for Tunnel A. The monitors did not detect the gateway as down and there was no default gateway. Restarting the dpinger for Gateway A detected it as down, but Gateway B was not designated the default gateway. So that's the behavior I was in version 25.1.11. Now previously I then disabled Gateway A and the default gateway was assigned to Gateway B. This time I disabled Gateway A and still there was no default gateway. Running pluginctl -c monitor did not change that. Not even reloading all services in the CLI restored a gateway group. I had to reboot the firewall, and then Tunnel B was detected as the default gateway and traffic now passes through Tunnel B normally. Yuck!