OPNsense Forum

Archive => 20.1 Legacy Series => Topic started by: drivera on March 08, 2020, 05:28:02 pm

Title: Default gateway not being assigned properly on failover
Post by: drivera on March 08, 2020, 05:28:02 pm
Hi!

I've been having problems with failover for some time, and I think I've more clearly figured out the circumstances (if not the root cause). These problems have carried over to 20.1.2 which is why I'm bringing this thread back up to this forum.

I have two circuits - MAIN and BKUP - which are the only upstream circuits available, and are clearly marked as such, and properly given priority among them (MAIN has a priority of 1, BKUP has a priority of 2). All other gateways - including the ones generated from some OpenVPN clients I have configured - are not marked as upstream, and have a priority of 255 (the default value).

I've disabled any and all routing configuration or customization on those VPN links (i.e. Don't pull routes is checked, and pull-filter ignore "redirect-gateway" is added into the Advanced configuration section), instead opting for policy-based routing rules within the firewall to forward traffic as appropriate. So far so good, and everything works as intended.

So... on to the scenario...

Whenever the MAIN circuit fails, BKUP immediately takes over and the system's default gateway is selected to route over the BKUP Gateway. However, the main circuit crash also causes the OpenVPN clients' connections to die, which means they need to be brought back up. When they are, their interfaces are also taken down and brought back up (which would make sense as this is how OpenVPN works), and I suspect that this triggers a recalculation of the default gateway "somewhere, by someone" (not sure what part of what code does that yet).

This will result in either the default gateway being left blank, or the gateway being erroneously assigned to the one from one of those VPN links. This makes no sense for several reasons, the biggest one being that none of those gateways is marked as upstream, and thus should not be eligible for selection as the default gateway.

Needless to say that when the default gateway is incorrectly configured, traffic will not be forwarded properly and internet service all but grinds to a halt.

There is a fairly simple - albeit manual solution: log into the firewall's UI, open (edit) any gateway (most commonly the BKUP circuit's gateway), and save it without making any changes. When I apply the changes, this will apparently trigger the default gateway computation code and cause the correct default gateway to be selected and configured.

However: the whole point of having failover is so that the system itself can automatically switch between circuits correctly, without human intervention.

I've been struggling with this one for months.

The biggest questions I have are:

So... any ideas? Also: let me know if you think this is more appropriate to be reported as an issue in GitHub.

Thanks!
Title: Re: Default gateway not being assigned properly on failover
Post by: russella on March 09, 2020, 05:55:45 pm
I don't know if this will help. (Please Note: I don't use VPNs). But my failover setup has worked perfectly for at least a couple of years with all versions of OPNsense from 17 on up and I noticed a difference between your setup and mine. I notice you have set the Single gateway priorities to different values. In my setup I have them both set to exactly the same value (255). I set the priority in the Gateway Group settings differently with the primary set to Tier 1 and the backup set to Tier 2.
Title: Re: Default gateway not being assigned properly on failover
Post by: drivera on March 10, 2020, 03:14:55 pm
Right...the different priorities are because I want one service to be preferred over the other. The MAIN circuit 200/10 while the BKUP circuit is 10/2 (i.e. only for emergencies). Thus, what I want to have happen is to have the MAIN circuit be used for internet access whenever it's available, and only fall back to BKUP if there's no other choice (i.e. some access is better than none).

I've come up with scripts that can monitor the default gateway configuration and I could definitely add code to trigger the gateway (re-)calculation code...if I only knew how to do that (documentation is scarce on this).

I'll keep poking around to see if I can make it work... I'm sure the problem has to do with how routes are recalculated when interfaces are added/removed (i.e. when OpenVPN clients go up/down).

Cheers...
Title: Re: Default gateway not being assigned properly on failover
Post by: drivera on March 13, 2020, 02:16:57 am
It seems the fix from this ticket did the trick: https://github.com/opnsense/core/issues/3961 (https://github.com/opnsense/core/issues/3961)

Gateways are now assigned always, and I've worked around the problem of "ineligible" gateways being selected for default gateway duty by marking them as down (i.e. disable monitoring). This isn't ideal, but does the trick.

Cheers!
Title: Re: Default gateway not being assigned properly on failover
Post by: russella on March 14, 2020, 07:14:04 pm
"Right...the different priorities are because I want one service to be preferred over the other. The MAIN circuit 200/10 while the BKUP circuit is 10/2 (i.e. only for emergencies). Thus, what I want to have happen is to have the MAIN circuit be used for internet access whenever it's available, and only fall back to BKUP if there's no other choice (i.e. some access is better than none)."

Yeah, I thought that was why you were doing it. But its not the right way to achieve it. For both load balancing and failover you should use Gateway Groups.

After setting up your gateways (System->Gateways->Single) you should then create a gateway group (System->Gateways->Group).

For a failover group set your primary (MAIN) to Tier 1 and the backup (BKUP) to Tier 2. In your case I would set the Trigger Level to 'Member Down' (Supposedly triggers with 100% packet loss).

You may wish to consider the other Trigger Level options (e.g. High Latency, Packet Loss or Both). Although I can't find any documentation to confirm it. I believe the Trigger Levels for High Latency or Packet Loss are the higher 'To' values you set on the Gateways->Single page (i.e. if you accepted the defaults then a Latency above 500 milliseconds OR a Packet Loss above 20%).

Once a gateway is marked as down, if there are no other gateways in the same tier it will failover to the next tier.

If you have multiple primaries or backups and want to load balance these in a failover scenario you would put the primaries on the same Tier (e.g. MAIN1 and MAIN2 on Tier 1 and BKUP1 and BKUP2 on Tier2). You have up to 5 tiers to play with so you could have a backup for your backup if you wanted (e.g. put BKUP2 on Tier 3).

Also, if you happen to want asymmetric load balancing on a tier you achieve that by setting the Weight value on the Single Gateway settings (The higher the weight value, the more traffic goes via that gateway)

There's a bit more to do after that as you need to set the Gateway in the Firewall->Rules->LAN "Default allow LAN to any rule" to the failover group gateway you created under System->Gateways->Group. Take a look at this link (I found it invaluable) for more information: https://www.thomas-krenn.com/de/wiki/OPNsense_Multi_WAN#Failover (https://www.thomas-krenn.com/de/wiki/OPNsense_Multi_WAN#Failover)
Title: Re: Default gateway not being assigned properly on failover
Post by: drivera on March 14, 2020, 07:19:44 pm
I used to use Gateway groups, but moved away from them as there is no group-level functionality for backrouting (i.e. respond to packets on the interface they're received on).

This is why I'm doing things the way I am. That bit was already addressed by earlier support tickets.

Thanks!
Title: Re: Default gateway not being assigned properly on failover
Post by: russella on March 21, 2020, 01:39:04 pm
Have you tried using Sticky connections (Firewall->Settings->Advanced).
Title: Re: Default gateway not being assigned properly on failover
Post by: drivera on March 21, 2020, 05:10:29 pm
Trust me: I tried everything and removing gateway groups was the only solution the devs offered because BSD doesn't support routeback rules for gateway groups.

Sticky connections only works for outgoing. I need routeback to work for incoming.