OPNsense Forum

Archive => 19.7 Legacy Series => Topic started by: drivera on January 19, 2020, 12:08:54 am

Title: Failover glitches prevent it from working propery
Post by: drivera on January 19, 2020, 12:08:54 am
Hi!

I've noticed that during failover, after a few minutes from the initial failover the default gateway configuration will get cleared even though failover had successfully ocurred. The result of this is that routing to the internet no longer works despite there being an active, healthy secondary gateway available. I'm using multiple upstream gateways with differing priorities and except for this glitch the configuration seems to work as intended.

The only way to recover this is to log onto the UI, edit one of the gateways (the healthy one, for instance), save it without making any changes, and clicking on "Apply Changes". This will trigger the code that recalculates the correct gateway and fix the configuration.

Sometimes (very often) this has to be done two or three times for it to take, and normal network functionality to be restored.  If this isn't done the gateway configuration will remain incorrect until the primary circuit returns. Obviously this defeats the purpose of any failover configuration.

However, once the primary circuit comes back to life everything returns to normal on its own.

Maybe the issue is related to the fact that the primary circuit is still online (still has an IP and the link is still UP), but it's effectively dead because some segment downstream is dead? Thus, the circuit's configured upstream gateway is down (and correctly detected as such) even though the interface isn't dead per-se. Perhaps that's what's confusing the gateway calculation algorithm?

I've written a script I use to monitor the gateway configuration which I could easily enough turn into a monitoring daemon (of sorts) that could trigger the gateway calculation/reconfiguration code when it detects that the default gateway has been left empty.  However: I don't know how to do that from the O/S CLI. Any ideas?

Is there documentation anywhere regarding the scripts/commands that are available at the CLI level to invoke OPNSense functionality?

Perhaps that daemon would only trigger the "repair" when it detects that one of the (higher-priority) upstream gateways is both enabled and "down" (i.e. we're in a failover state) ... this way it would minimize interference with normal operation when everything is OK....?

Thoughts?

Thanks!
Title: Re: Failover glitches prevent it from working propery
Post by: drivera on January 19, 2020, 12:40:38 am
Ok, new behavior: the default gateway configuration code is now considering (and configuring) non-upstream gateways for default gateway.

I have an OpenVPN connection to Express VPN (for content streaming) configured in the firewall, and all the necessary rules to use it only for my streaming devices. Clearly it's not configured as an upstream gateway since it's dependent on either of the actual two physical circuits which are marked as upstream.

I've just had a soft outage of the primary service (everything is up, but something went wrong in the ISP's network that routing is borked), and guess what?  The ExpressVPN gateway was chosen as the default gateway for the system despite the fact that the secondary gateway was still up and in good health!!!

Eventually the system righted itself without intervention, but still: this highlights the fact that the default gateway selection and configuration algorithm is broken as it clearly makes sense to only consider healthy upstream gateways as candidates for default gateway (right?).

Cheers!
Title: Re: Failover glitches prevent it from working propery
Post by: mimugmail on January 19, 2020, 07:11:28 am
Did you also check the routing table? When you do PBR for OpenVPN where only some clients use it you usually click "Don't add routes", so this can't be chosen as a system default gateway. Maybe it's just a display error because no other gateway is online.

How do you setup gateway monitoring? Screenshots would be nice.
Title: Re: Failover glitches prevent it from working propery
Post by: drivera on January 19, 2020, 03:46:52 pm
Hi!

Gateway monitoring is set up using a script that basicall polls netstat -rn every 0.01 seconds and records the result, reporting whenever the result varies from poll to poll.

The "Don't add/remove routes" box is unchecked, but "Don't pull routes" is checked, and I added <pull-filter ignore "redirect-gateway"> to the advanced configuration so it wouldn't set itself as the default gateway when coming up.

I've checked "Don't add/remove routes", and tested, with no change in behavior.  This is the output log for the gateway monitor script:

Code: [Select]
$ ./monitor-gateway
2020/01/19 08:12:51: Gateway monitoring Started
2020/01/19 08:12:51: GATEWAY=[186.159.241.1]
<MANUALLY FORCED AN OUTAGE HERE>
2020/01/19 08:18:08: GATEWAY=[empty]
2020/01/19 08:18:08: GATEWAY=[192.168.200.1]
2020/01/19 08:18:42: GATEWAY=[empty]
2020/01/19 08:18:42: GATEWAY=[192.168.200.1]
2020/01/19 08:19:59: GATEWAY=[empty]
<MANUAL INTERVENTION VIA SAVE/APPLY-CHANGES>
2020/01/19 08:22:51: GATEWAY=[192.168.200.1]
<MANUAL TAKEDOWN OF NON-FUNCTIONAL VPN LINK>
2020/01/19 08:26:39: GATEWAY=[empty]
<MANUAL RESTART OF VPN LINK>
2020/01/19 08:27:14: GATEWAY=[192.168.200.1]

Also, when I finally fixed the induced outage, I had to manually re-enable the main gateway to get it to work again, so no auto-failback this time :(

Clearly, something is amiss here...

Cheers.
Title: Re: Failover glitches prevent it from working propery
Post by: drivera on January 19, 2020, 04:46:47 pm
You asked for screenshots, but I'll do you one better: a sanitized (I hope :P) config.xml!

The configuration reflects the system's current state. Maybe you can spot what's wrong better than I.  The wiring is simple: two upstream links (one per ISP), one LAN link. I sanitized the users' passwords as well - you may have to copy that section from another working configuration.

Let me know what else I can provide to help debug this.

Thanks!
Title: Re: Failover glitches prevent it from working propery
Post by: mimugmail on January 19, 2020, 05:51:15 pm
Please just screenshots, config.xml is too overkill for debug via mobile :)
Title: Re: Failover glitches prevent it from working propery
Post by: drivera on January 19, 2020, 09:50:26 pm
You can find a ZIP file with all the relevant screenshots (General Settings, Gateway page, individual gateways, Rules, NAT, Firewall Settings, and Express VPN client) here.  It's ~2.1MB so I couldn't attach it here directly.

https://drive.google.com/file/d/1DW2tnGd7UNcqZQVm7Ig6d-_gTVt0AELi/view?usp=sharing (https://drive.google.com/file/d/1DW2tnGd7UNcqZQVm7Ig6d-_gTVt0AELi/view?usp=sharing)

Cheers!
Title: Re: Failover glitches prevent it from working propery
Post by: mitchellp on January 21, 2020, 04:54:38 am
In screenshot 4, I'm not a fan of the rules for MAIN_GW and BKUP_GW. They wont be getting hit for anything useful, instead the default route at the bottom is getting used which does not have a fail-over gateway group.
I would either change those two rules to destination: ! LAN net, and move them under VPN stuff but above default route stuff, or change the default route to use a fail-over gateway group containing both gateways.
Title: Re: Failover glitches prevent it from working propery
Post by: drivera on January 21, 2020, 02:48:45 pm
I used to have a failover gateway group, but that caused a routing issue because gateway groups can't have routeback" rules added to netfilter. That's why I changed it to how it is now. With a gateway group it would be impossible to reach the firewall via anything but the currently-active circuit, which was not the intent.

This is why I'm not using gateway groups.
Title: Re: Failover glitches prevent it from working propery
Post by: drivera on January 21, 2020, 03:34:17 pm
I had another outage today and I noticed that the MAIN interface's DHCP-provided (by the ISP) address wasn't getting reloaded/reapplied automatically upon recovery. I had to manually go in and click "reload". I realize this may be an ISP issue, though.  Still, maybe it's all related?

Cheers...