multi-wan failover problem

Started by hescominsoon, August 11, 2022, 09:29:11 PM

Previous topic - Next topic
I think there are different issues getting mixed up.

1.)
There seems to be an issue since 22.7 - at least for me - with the primary WAN gateway staying offline when the primary WAN interface is up again. The reason appears to be that the host route for the monitoring IP doesn't get added as soon as the interface is up again. I resolved this by adding the host routes for the monitoring IPs manually. This fixed the issue for me and the default route switches back to the gateway of the primary WAN link (with "Allow default gateway switching" ticked). Does this issue exist for you as well?

2.)
Quote from: ProximusAl on August 13, 2022, 05:06:29 PM
I just assumed OPNSense never did it, but did think it strange.
I guess what you are mentioning here is that the states are kept. With earlier releases of OPNsense it was possible to untick "Disable State Killing on Gateway Failure". However, this setting does not exist anymore. See here: https://forum.opnsense.org/index.php?topic=28179. I went the same route as you and wrote a script to handle this (and some other things) as soon as the default gateway switches. Aside from running this in a cron job, you can place it in /usr/local/etc/rc.syshook.d/monitor/ to be run on monitor events. See here: https://docs.opnsense.org/development/backend/autorun.html. I think you don't need to down the interface. I use pfctl -k <wan_ip> to kill the states (where wan_ip is the IP of WAN2 gateway after switching back to WAN1) and it works for me. Flushing all states seems not necessary.

August 13, 2022, 06:36:57 PM #16 Last Edit: August 13, 2022, 06:45:30 PM by ProximusAl
Thanks for replying...

So on 1)

Not been an issue for me. When the interface is back up, new connections start to go back to primary wan. I did not have to create any static routes manually, all taken care of, 8.8.8.8 WAN1 and 8.8.4.4 WAN2 on the monitors.

2)

I can try this, but I definitely had issues with states "coming back" (after just killing the states on that interface) on the wrong interface which is why I went for the nuclear option. What I found is that OPNSense itself continued down WAN2 even though WAN1 was up and running for DNS lookups etc, and the only way I found to fail it back was downing the interface.  EDIT: I think it was WireGuard that kept its state on WAN2 using kmod

Thanks for answering!

1) Well, that's interesting. Have you ticked "Allow default gateway switching"?

2) Ok, it seems to work for me. How did you try to kill them? I guess you can't use the interface with pfctl (haven't tried yet) as the states are floating by default (can be changed by "Bind states to interface").

Quote from: ProximusAl on August 13, 2022, 06:36:57 PM
EDIT: I think it was WireGuard that kept its state on WAN2 using kmod

You mean traffic coming from wg clients kept being routed to the internet via WAN2?

Yes, I had to have allow dgw switching.

2) To be honest I can't remember, it possibly was using your method, but I distinctly remember WireGuard and SIP (VoIP) always coming back on the second WAN which for me is 5G mobile data, so that's a no go for me.

Funnily enough, we had a power cut this morning, and the irony of it is, although my WAN1 is on a UPS, the DOCSIS cabinet on the other end clearly doesn't, as it goes bye bye. Everything worked, failed over, kid could still play roblox, but when the power returned, my script kicked him off roblox as it pushed him back to WAN1 :D

My method, although "feels dirty" does work, but my previous EdgeRouter did handle it a bit better, but to be fair, I'm glad to be shot of the EdgeRouter now. OPNSense just works a treat for me, and I've upgrade my entire internal network to 2.5G now (EdgeRouter 1Gb only)

Quote from: tcpip on August 13, 2022, 06:53:56 PM
Quote from: ProximusAl on August 13, 2022, 06:36:57 PM
EDIT: I think it was WireGuard that kept its state on WAN2 using kmod

You mean traffic coming from wg clients kept being routed to the internet via WAN2?

No, I think WireGuard kept listening on WAN2, rather than WAN1.

My SIP phone definitely kept going outbound via WAN2 even though I kept killing its state. Kept coming back.

What would happen if I didn't kill all the states in my script but instead just downed the WAN2 interface?

Could you see any issues with that?

I never thought of trying that at all.

Quote from: ProximusAl on August 13, 2022, 06:56:24 PM
My method, although "feels dirty" does work

I guess it's fine as long as it works for you :D

Quote from: ProximusAl on August 13, 2022, 07:08:46 PM
What would happen if I didn't kill all the states in my script but instead just downed the WAN2 interface?

Could you see any issues with that?

I don't see any issues with that, it just seems to be the sledgehammer approach for just killing states. Downing an interface does a bit more if you look into interface_bring_down function in /usr/local/etc/inc/interfaces.inc (I suppose this is where the magic happens). To clear the states it runs /sbin/pfctl -i <interface> -Fs (so it seems to work with the interface parameter). But whatever works for you.

hmmm may be same issue
https://forum.opnsense.org/index.php?topic=29757.0

once the wan link is down or for a long time it seems to be tagged as down indefinitely

so it seems 22.7 needs some work.  it's either a bsd issue, a middleware issue or a combination of the two.  This unfortunately means we will be leaving a brand new opnsense firewall at 22.1 forever...when and IF this issue gets fixed we might try going forward.  It's also strange that is generates a nearly 5 second outage going either way in 22.7 when it's nearly instant on 22.1.

Quote from: tong2x on August 14, 2022, 11:43:06 AM
hmmm may be same issue
https://forum.opnsense.org/index.php?topic=29757.0

once the wan link is down or for a long time it seems to be tagged as down indefinitely

It could be the same issue. Have you checked the routes?

Quote from: hescominsoon on August 14, 2022, 07:42:32 PM
so it seems 22.7 needs some work.  it's either a bsd issue, a middleware issue or a combination of the two.  This unfortunately means we will be leaving a brand new opnsense firewall at 22.1 forever...when and IF this issue gets fixed we might try going forward.  It's also strange that is generates a nearly 5 second outage going either way in 22.7 when it's nearly instant on 22.1.

If you're facing the gateway issue I described before, configuring static routes should serve as a workaround. If this isn't the issue you're facing, I didn't understand your problem. However, I guess the gateway issue will be resolved soon: https://github.com/opnsense/core/issues/5956. OPNsense is great and I have a lot of respect for the devs.

I think it is, what I'm doing now is clicking edit in gateways and changing nothing, for the monitor IP to go online.
will try that static route approach, as it is bother some to keeps doing it.

hope the patcht/fix we dont have to wait long.
thanks

Quote from: tong2x on August 15, 2022, 12:32:43 AM
I think it is, what I'm doing now is clicking edit in gateways and changing nothing, for the monitor IP to go online.
will try that static route approach, as it is bother some to keeps doing it.

hope the patcht/fix we dont have to wait long.
thanks

Ironically, I just had a real-world test of this.  Power went out.  I had battery backups on my main internet connection and OPNsense but not my failover connection.

When everything came back up, the failover status never changed back to "online". I did exactly what you did; edited the "System:Gateways:Single" listing; made no changes and just saved.  Voila, back online.

Surely this can't be by design?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

August 19, 2022, 05:56:24 AM #28 Last Edit: August 19, 2022, 05:59:38 AM by tong2x
no it is a bug in 22.x, something must have changed in the code, there is already a patch, but has not yet been included in 22.7.2.

Quote# opnsense-patch e8d42b6
patch created by @franco
needs to be executed in the console, have already applied it and seems to have fixed the issue in may test.
franco said it will be included in 22.7.3, pending test reports also


https://github.com/opnsense/core/issues/5956

That should not be necessary in terms of cycling the interface.