OPNsense - Multi-WAN - established VPN connection still using Tier 2 Gateway

Started by schtebo, August 30, 2023, 09:47:26 PM

Previous topic - Next topic
Hi everyone,
I have successfully made an OPNsense - Multi-WAN configuration. **yeahh** Thank you for great documentation.
The tests were also successful, only with established VPN connections I have a strange behavior.

I have 2 gateways in a gateway group

Tier 1 100Mbps
Tier 2 5Mbps

If I boot the OPNsense and all gateways work as expected, the VPN connections are fast and I feel (Reporting -> Traffic) like I'm going through the Tier 1 gateway.
However, if a failure occurs on Tier 1, tier 2 gateway on the gateway group takes over as expected.
Everything as expected so far.

However, if Tier 1 Gateway is available again, the established VPN connection is still using Tier 2 Gateway.
New connections are established via Tier 1.
Is there a way to "force" all also existing connections to use Tier 1 Gateway as well?

Thank you


I'm sorry for that. We run on:

Version:
OPNsense 23.7.2-amd64
FreeBSD 13.2-RELEASE-p2

More details:
Trigger level in gateway group is set to "packet loss"
All other values/options are set to default.

It's a Zero Trust Tunnel by Cloudflare:
https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/install-and-setup/tunnel-guide/remote/

Ok, thanks. 23.7.3 should not change that picture then.

FWIW, if both gateways are online the sessions might stick to secondary just because of stateful firewaling and have no reason to be force-closed. The problem eventually sorts itself.

We could add some sort of "swing back" state killing here optionally but all it will do is disrupt existing and working connections most likely.


Cheers,
Franco

great news :) upgraded right now to OPNsense 23.7.3-amd64.
i will check and report if the behaviour is better now.
thank you very much!

I would very much welcome this option, as on my side the Tier 2 gateway is limited (100GB / month) and after that, all connections are slowed down to 64kbit/s.
In my case, a short interruption is much better than reaching the monthly limit.

Thank you very much I really appreciate your work!


Same problem on multiple systems.
OPNsense 23.7.10_1-amd64
I think this is because gateway groups are not selectable in OpenVPN settings -> Interface. It is possible to set a gateway group in settings on pfSense, VPN is switching back to main from failover as expected.
I've read multiple forum posts with similar issues, it seems that common solution is to create separate client instances for every WAN and failover between them. Unfortunately this is not possible with my setup.
This could be solved by creating a cronjob that will ping via WAN and restart VPN instance if necessary, however in my book i'd call it an ugly hack.
Is there a reason why OPNsense will not allow setting OpenVPN interface as gateway group?

Regards,
Igor

thank you so much :) --> new update (OPNsense 25.1.6) contains feature request.

https://forum.opnsense.org/index.php?topic=47125.msg0;boardseen#new
o system: kill gateways states for failback scenario when a higher priority gateway goes back online

We had this request so often that we decided to find a solution for it.

Theres also docs online now explaining the setup:

https://docs.opnsense.org/manual/how-tos/multiwan.html#failover-and-failback-states

Happy we could help :)
Hardware:
DEC740


Unfortunately, doesn't work for me for WireGuard. It still sticks to the secondary WAN despite the primary WAN going up again. Only restarting WireGuard forces it to fail back to the primary.

Probably not a firewall issue, but a WireGuard issue?

Not using gateway groups, but default gateway switching.
OPNsense virtual machine images
OPNsense aarch64 firmware repository

Commercial support & engineering available. PM for details (en / de).

Did you verify that other states were indeed killed on the failback, e.g., clients with sessions towards the internet?

Essentially I assume Wireguard to follow the default route of the system to initiate the connection.

Though if the other side thinks the socket is still WAN2:51820 and not WAN1:51820 it will probably send the packet there and initiate another handshake.

I think it depends if there are firewall rules that allow a connection to WAN1 and WAN2 on the wireguard ports, or if only outgoing connections are allowed?

I did not test this with wireguard specifically though.
Hardware:
DEC740

Other states do indeed get killed on failback. I tested this with an SSH session which gets reset when the primary WAN goes up again.

The default route successfully fails back to the primary WAN. WireGuard doesn't seem to be bothered though.

There is no firewall rule which allows incoming connections to the affected wg instance. Also, the secondary WAN is an LTE connection and the ISP blocks all inbound connections anyways.
OPNsense virtual machine images
OPNsense aarch64 firmware repository

Commercial support & engineering available. PM for details (en / de).

As far as I know it only kills states that have a gateway attached in pf. Maybe wireguard does not and thus its states are not killed.
Hardware:
DEC740

Hi,
Version: OPNsense 25.1.8_1-amd64
Problem: I think some states get stuck associated to the wrong gateway so state killing fails or kills erroneously.

I've got 2 gateways failing over successfully with the following configured:

Primary has "Failover States" checked.
Secondary has "Failover States" checked and "Failback States" checked.
The gateways are in a group together.
"System -> Settings -> General -> Allow default gateway switching" is checked.
"Firewall -> Settings -> Advanced -> Bind states to interface" is checked.
opt1/vlan0.99 is the Secondary
opt2/vlan0.100 is the Primary

Test case:

I'm running 4 continuous pings to 8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1.
I disconnect the cable to the Primary and failover triggers and all pings move to Secondary:

Failover:

<13>1 2025-06-13T15:52:24+00:00 OPNsense.localdomain opnsense 2465 - [meta sequenceId="1"] /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
<13>1 2025-06-13T15:52:25+00:00 OPNsense.localdomain opnsense 2465 - [meta sequenceId="2"] /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt2
<13>1 2025-06-13T15:52:25+00:00 OPNsense.localdomain opnsense 2465 - [meta sequenceId="3"] /usr/local/etc/rc.routing_configure: ROUTING: keeping inet default route to 85.*.*.*
<13>1 2025-06-13T15:52:25+00:00 OPNsense.localdomain opnsense 6518 - [meta sequenceId="4"] /usr/local/etc/rc.syshook.d/monitor/20-recover: ROUTING: killing states for deferred gateway WAN_SECONDARY_DHCP [cb8015ec-602b-4474-8844-032c38713239]
<13>1 2025-06-13T15:52:25+00:00 OPNsense.localdomain opnsense 6518 - [meta sequenceId="5"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (1,[])
<13>1 2025-06-13T15:52:25+00:00 OPNsense.localdomain opnsense 6518 - [meta sequenceId="6"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (execute task : dpinger_configure_do(1,[]))
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 24568 - [meta sequenceId="7"] /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 24568 - [meta sequenceId="8"] /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_PRIMARY_DHCP
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 24568 - [meta sequenceId="9"] /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt1
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 24568 - [meta sequenceId="10"] /usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to 10.0.0.1
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 28798 - [meta sequenceId="11"] /usr/local/etc/rc.syshook.d/monitor/20-recover: ROUTING: killing states for unreachable gateway WAN_PRIMARY_DHCP [8b028840-6150-4c90-bf79-1c562f2f0109]
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 28798 - [meta sequenceId="12"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (1,[WAN_PRIMARY_DHCP])
<13>1 2025-06-13T15:52:36+00:00 OPNsense.localdomain opnsense 28798 - [meta sequenceId="13"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (execute task : dpinger_configure_do(1,[WAN_PRIMARY_DHCP]))

Then, I reconnect the cable but only half the pings move back to the Primary (1.1.1.1 and 8.8.4.4 move back whereas 8.8.8.8 and 1.0.0.1 stick to the Secondary).

Failback:

<13>1 2025-06-13T15:54:01+00:00 OPNsense.localdomain opnsense 21512 - [meta sequenceId="1"] /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
<13>1 2025-06-13T15:54:01+00:00 OPNsense.localdomain opnsense 21512 - [meta sequenceId="2"] /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt2
<13>1 2025-06-13T15:54:01+00:00 OPNsense.localdomain opnsense 21512 - [meta sequenceId="3"] /usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to 85.*.*.*
<13>1 2025-06-13T15:54:01+00:00 OPNsense.localdomain opnsense 27262 - [meta sequenceId="4"] /usr/local/etc/rc.syshook.d/monitor/20-recover: ROUTING: killing states for deferred gateway WAN_SECONDARY_DHCP [57ca2e8b-5579-4256-83ed-0c6a641a2226]
<13>1 2025-06-13T15:54:01+00:00 OPNsense.localdomain opnsense 27262 - [meta sequenceId="5"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (1,[])
<13>1 2025-06-13T15:54:01+00:00 OPNsense.localdomain opnsense 27262 - [meta sequenceId="6"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (execute task : dpinger_configure_do(1,[]))
<13>1 2025-06-13T15:54:12+00:00 OPNsense.localdomain opnsense 58537 - [meta sequenceId="7"] /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
<13>1 2025-06-13T15:54:13+00:00 OPNsense.localdomain opnsense 58537 - [meta sequenceId="8"] /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt2
<13>1 2025-06-13T15:54:13+00:00 OPNsense.localdomain opnsense 58537 - [meta sequenceId="9"] /usr/local/etc/rc.routing_configure: ROUTING: keeping inet default route to 85.*.*.*
<13>1 2025-06-13T15:54:13+00:00 OPNsense.localdomain opnsense 61983 - [meta sequenceId="10"] /usr/local/etc/rc.syshook.d/monitor/20-recover: ROUTING: killing states for deferred gateway WAN_SECONDARY_DHCP [50a1bf4b-de29-4eae-8336-2be347d4b8d4]
<13>1 2025-06-13T15:54:13+00:00 OPNsense.localdomain opnsense 61983 - [meta sequenceId="11"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (1,[])
<13>1 2025-06-13T15:54:13+00:00 OPNsense.localdomain opnsense 61983 - [meta sequenceId="12"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (execute task : dpinger_configure_do(1,[]))

Then, I disconnect the Primary again. Failover occurs:

<13>1 2025-06-13T16:05:22+00:00 OPNsense.localdomain opnsense 62144 - [meta sequenceId="1"] /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
<13>1 2025-06-13T16:05:22+00:00 OPNsense.localdomain opnsense 62144 - [meta sequenceId="2"] /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt2
<13>1 2025-06-13T16:05:22+00:00 OPNsense.localdomain opnsense 62144 - [meta sequenceId="3"] /usr/local/etc/rc.routing_configure: ROUTING: keeping inet default route to 85.*.*.*
<13>1 2025-06-13T16:05:22+00:00 OPNsense.localdomain opnsense 64584 - [meta sequenceId="4"] /usr/local/etc/rc.syshook.d/monitor/20-recover: ROUTING: killing states for deferred gateway WAN_SECONDARY_DHCP [64c0f9eb-2187-4d6f-a76d-137ab7d4b98c]
<13>1 2025-06-13T16:05:22+00:00 OPNsense.localdomain opnsense 64584 - [meta sequenceId="5"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (1,[])
<13>1 2025-06-13T16:05:22+00:00 OPNsense.localdomain opnsense 64584 - [meta sequenceId="6"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (execute task : dpinger_configure_do(1,[]))
<13>1 2025-06-13T16:05:33+00:00 OPNsense.localdomain opnsense 84039 - [meta sequenceId="7"] /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
<13>1 2025-06-13T16:05:34+00:00 OPNsense.localdomain opnsense 84039 - [meta sequenceId="8"] /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_PRIMARY_DHCP
<13>1 2025-06-13T16:05:34+00:00 OPNsense.localdomain opnsense 84039 - [meta sequenceId="9"] /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt1
<13>1 2025-06-13T16:05:34+00:00 OPNsense.localdomain opnsense 84039 - [meta sequenceId="10"] /usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to 10.0.0.1
<13>1 2025-06-13T16:05:34+00:00 OPNsense.localdomain opnsense 88040 - [meta sequenceId="11"] /usr/local/etc/rc.syshook.d/monitor/20-recover: ROUTING: killing states for unreachable gateway WAN_PRIMARY_DHCP [e4b28c89-dc67-4b97-be4e-08b9dabc9b1e]
<13>1 2025-06-13T16:05:34+00:00 OPNsense.localdomain opnsense 88040 - [meta sequenceId="12"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (1,[WAN_PRIMARY_DHCP])
<13>1 2025-06-13T16:05:34+00:00 OPNsense.localdomain opnsense 88040 - [meta sequenceId="13"] /usr/local/etc/rc.syshook.d/monitor/20-recover: plugins_configure monitor (execute task : dpinger_configure_do(1,[WAN_PRIMARY_DHCP]))


The 2 pings flowing through the Secondary are killed anyway. I think because their state is still associated with the Primary.
All pings now flow through Secondary.

I reconnect Primary and Failback occurs. 3 pings still flow through secondary and 1 ping flows through Primary.