Difference between default gateway order and gateway groups?

Started by CJ, June 08, 2024, 06:16:21 PM

Previous topic - Next topic
I'm working to configure OPNsense to fail over to my backup WAN when the primary WAN has connection issues.

The gateway documentation page implies that all I need to do is set the priorities and upstream correctly and it will automatically happen with nothing further if the connection goes down.  If I want it to happen due to latency or packet loss, I need to enable Gateway Switching.  https://docs.opnsense.org/manual/gateways.html

The multiWAN page implies that in order to get failover to work then I need to create gateway groups and then edit all of my firewall rules to point to the new groups.  It makes no mention of gateway priorities or upstream.  https://docs.opnsense.org/manual/how-tos/multiwan.html

I configured a gateway group and set it one of my network segments, but then I tested by marking the main gateway as down, the default gateway switched over to the backup gateway and everything continued working.  Removing the marked down setting from the gateway caused the default to switch back to it and everything transferred back to it.

What is the benefit of using gateway groups for a failover configuration?  From what I can tell, if I wanted to do load balancing then the groups would be required but I'm not sure what I'm missing for a simple failover.

Using gateway priority it is written to the routing table for the default route.
Using gateway groups you need to do policy based routing since default route will still rely on GW priority.
I don't know if there are other differences...
i am not an expert... just trying to help...

One particular issue i've had with default gateway switching was that VPN tunnels would fail over to the secondary gateway when the primary went down, but they would stay on there even when the primary gateway came back.

There are some other quirks around default gateway switching and for me the question is whether or not i care about the firewall itself losing internet connectivity while the primary line is down. If the answer is no, then using gateway groups and policy routing works better in my experience

So is this a bug? Because when using gateway switching, it technically should switch back to the primary gateway when it comes back up. Hopefully, someone from the opnsense team can confirm this.

If "bug" here means "all established connections continue to work uninterrupted" then yes.

I can understand that the behaviour can be undesirable, but you also have to be realistic.


Cheers,
Franco

June 12, 2024, 09:28:33 AM #5 Last Edit: June 12, 2024, 09:32:44 AM by kevindd992002
I get it. But why isn't the gateway priority kicking in when the primary connection goes back online? In pfsense, you can use a failover gateway group as a default gateway so switching happens with that. Is there some disadvantage with that implementation that made that feature not available in opnsense?

Talking about realistic, if a 5G connection is set as a secondary WAN in opnsense and a fiber optic connection as primary, wouldn't you want the default gateway to swtich back to the fiber optic connection when it comes back online (assuming it swtiched over to the 5G connection)?

> In pfsense, you can use a failover gateway group as a default gateway so switching happens with that.

[citation needed]

> But why isn't the gateway priority kicking in when the primary connection goes back online

Why should it? The connection is established and working.

> if a 5G connection is set as a secondary WAN

You may confuse your understanding of your network and how it should behave with the basic approaches taken with the failover. A dead link will make it switch. Switching a healthy link requires additional metric(s) and a design in code and user configuration. Other firewalls may have done this. If the expectation from the approach taken matches the implementation done there is another question. :)


Cheers,
Franco

Quote from: CJ on June 08, 2024, 06:16:21 PM
I configured a gateway group and set it one of my network segments, but then I tested by marking the main gateway as down, the default gateway switched over to the backup gateway and everything continued working.  Removing the marked down setting from the gateway caused the default to switch back to it and everything transferred back to it.

What is the benefit of using gateway groups for a failover configuration?  From what I can tell, if I wanted to do load balancing then the groups would be required but I'm not sure what I'm missing for a simple failover.

Maybe exactly this

Quote
WAN Failover
WAN failover automatically switches between WAN connections in case of connectivity loss (or high latency) of your primary ISP. As long as the connection is not good all traffic will be routed of the next available ISP/WAN connection and when connectivity is fully restored so will the routing switch back to the primary ISP.

For me it means you have "preemption" when using GW groups vs when you have not.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

> and when connectivity is fully restored so will the routing switch back to the primary ISP.

The docs are correct, but lack finesse in the wording WRT established connections and perhaps the use of sticky-address option.


Cheers,
Franco

Quote from: Seimus on June 12, 2024, 10:32:30 AM
Maybe exactly this

Quote
WAN Failover
WAN failover automatically switches between WAN connections in case of connectivity loss (or high latency) of your primary ISP. As long as the connection is not good all traffic will be routed of the next available ISP/WAN connection and when connectivity is fully restored so will the routing switch back to the primary ISP.

For me it means you have "preemption" when using GW groups vs when you have not.

This is the part I'm confused about.  In the limited testing I've done, when the original gateway was restored all of the routing switched back to it.  Or are you saying that groups are the only way to have the switch happen due to latency or packet loss and not complete failure?  Because the docs imply that that's determined by the "Allow default gateway switching" setting.  My connection has been remarkably stable since I set up the second link so I can't confirm that it's working correctly.

Quote from: franco on June 12, 2024, 11:00:11 AM
> and when connectivity is fully restored so will the routing switch back to the primary ISP.

The docs are correct, but lack finesse in the wording WRT established connections and perhaps the use of sticky-address option.

I would assume established connections would continue on the backup gateway but any new connections would be on the primary.  Or do you mean something else?

I'm not too concerned about established connections as they will eventually end and things will migrate to the primary then.

Quote from: franco on June 12, 2024, 10:09:20 AM
> In pfsense, you can use a failover gateway group as a default gateway so switching happens with that.

[citation needed]

I'm speaking from experience but here's the citation:

https://docs.netgate.com/pfsense/en/latest/routing/gateways.html#managing-the-default-gateway

You can use gateway groups (https://docs.netgate.com/pfsense/en/latest/routing/gateway-groups.html) for default gateway switching. And by default, gateway groups "keep states on gateway recovery" which means that it will keep the existing states on the backup gateway "until they reconnect".

Is this what you're trying to say as well? The existing connections don't get affected by the primary gateway recovery but new connections will go through the recovered primary gateway? If so, then that is totally fine by me and I don't see any issues there.

Quote from: franco on June 12, 2024, 10:09:20 AM
> But why isn't the gateway priority kicking in when the primary connection goes back online

Why should it? The connection is established and working.

Because it is the "primary" connection. The established connections shouldn't get affected but new connections should go through the primary connection again.

Quote from: franco on June 12, 2024, 10:09:20 AM
If the expectation from the approach taken matches the implementation done there is another question. :)


Cheers,
Franco


Not really sure what you mean by this?

Just experienced my first connection issue on the primary WAN.  The user experience was that everything hung or showed offline for a bit but then reconnected fine.  Faster recovery than waiting for the primary WAN to recover on it's own usually takes but still noticeable.

The logs tell an interesting story.  It only takes five seconds to from packet loss to being marked as down but the routing reconfiguration happens at the same time as them being marked down, so it's hard to confirm if the gateway switching setting worked or not.  I'm not sure why it says that it's keeping IP6 on WAN2 as WAN is higher priority.

A little over two minutes later WAN switches to delay for IPv4 with no change on IPv6, but the routing changes for both back to WAN.  Then ten seconds later WAN goes completely clean and the routing configuration is run again but with no changes.

0s
MONITOR: WAN_DHCP (Alarm: none -> loss RTT: 11.4 ms RTTd: 1.9 ms Loss: 12.0 %)
MONITOR: WAN_DHCP6 (Alarm: none -> loss RTT: 35.7 ms RTTd: 54.1 ms Loss: 12.0 %)

5s
ALERT: WAN_DHCP (Alarm: loss -> down RTT: 11.5 ms RTTd: 1.9 ms Loss: 21.0 %)
ALERT: WAN_DHCP6 (Alarm: loss -> down RTT: 34.4 ms RTTd: 53.5 ms Loss: 21.0 %)
reconfiguriging routing due to gateway alarm
/usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
/usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
/usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on WAN2
/usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to WAN2_DHCP_GW
/usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on WAN2
/usr/local/etc/rc.routing_configure: ROUTING: keeping inet6 default route to WAN2_DHCP6_GW

130s
ALERT: WAN_DHCP (Alarm: down -> delay RTT: 485.4 ms RTTd: 1473.6 ms Loss: 0.0 %)
reconfiguriging routing due to gateway alarm
/usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
/usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on wan
/usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to WAN_DHCP_GW
/usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on wan
/usr/local/etc/rc.routing_configure: ROUTING: setting inet6 default route to WAN_DHCP6_GW

141s
MONITOR: WAN_DHCP (Alarm: delay -> none RTT: 10.8 ms RTTd: 0.8 ms Loss: 0.0 %)
ALERT: WAN_DHCP6 (Alarm: down -> none RTT: 34.9 ms RTTd: 71.8 ms Loss: 1.0 %)
reconfiguriging routing due to gateway alarm
/usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
/usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on wan
/usr/local/etc/rc.routing_configure: ROUTING: keeping inet default route to WAN_DHCP_GW
/usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on wan
/usr/local/etc/rc.routing_configure: ROUTING: keeping inet6 default route to WAN_DHCP6_GW


Also, there's a typo in the logs.  "reconfiguriging"

So just to be sure, when the primary WAN comes back, all new connections will go through that link and all established/old connections stay in the backup link, correct?