Gateway failover group and default route often not reverting to primary gateway

Started by Maurice, April 01, 2025, 07:56:59 PM

Previous topic - Next topic
I'm currently experiencing issues with a straightforward dual WAN system: Two gateways, both marked upstream, the primary has priority 1, the secondary priority 2. Gateway monitoring is enabled for both. Default gateway switching is enabled globally. There's also a gateway group where the primary gateway is in tier 1 and the secondary is in tier 2. Trigger level is member down.

When the primary gateway fails, failover to the secondary works fine, both for the default route as well as for the policy rules using the gateway group.

Not so when the primary gateway comes back online. It reliably gets marked as active in System: Gateways: Configuration, but the default route (System: Routes: Status) as well as the policy rules (Firewall: Diagnostics: Statistics: rules) frequently (but not always) stick to the secondary gateway indefinitely.

I remember having had similar issues in the past and that significant improvements have been made in this context. But apparently this hasn't been truly resolved for all scenarios. Are there known open issues in this area? Can't find anything obvious on GitHub at the moment.

The reason why I'm currently noticing this is the primary WAN having more frequent outages, so this might not be a recently introduced issue.

Primary WAN: DHCPv6, request prefix only, interface address configured via optional prefix ID / interface ID setting
Secondary WAN: SLAAC

Cheers
Maurice
OPNsense virtual machine images
OPNsense aarch64 firmware repository

Commercial support & engineering available. PM for details (en / de).

New observation:
Sometimes the system's default route reverts to the primary gateway correctly.
But the policy rules (pass in quick on .. route-to (..) inet6 ..) stick to the secondary gateway.
Clicking Apply in System: Gateways: Configuration fixes the policy rules.
OPNsense virtual machine images
OPNsense aarch64 firmware repository

Commercial support & engineering available. PM for details (en / de).

Hi Maurice,

Have you tried selecting "Failover States" and "Failback States" in System -> Gateways -> Configuration -> [edit gateway]? Those may be new features as of May (at least the documentation was committed in May). Also I have been experiencing similar problems with gateway failover, but with the default route not going to the secondary gateway when the primary goes down. Here is the post, and some bug reports connected to gateway monitors and default gateway switching. It could be that our problems have the same cause.

Yes, I've enabled Failover States (on both gateways) and Failback States (on the secondary gateway) when these features became available.

I've since stopped using gateway groups and now rely on default gateway switching exclusively. This is somewhat more reliable - the system default route correctly fails back to the primary gateway most of the time. But sometimes it still sticks to the secondary gateway (which can be fixed by clicking Apply in System: Routes: Configuration).

Unfortunately, the issue is very intermittent and hard to reproduce.
OPNsense virtual machine images
OPNsense aarch64 firmware repository

Commercial support & engineering available. PM for details (en / de).