OPNsense Forum

Archive => 22.7 Legacy Series => Topic started by: hescominsoon on August 11, 2022, 09:29:11 PM

Title: multi-wan failover problem
Post by: hescominsoon on August 11, 2022, 09:29:11 PM
in the previous release i could unplug one wan port and the system would fail over to number 2 without issue.  When the primary was restored it would fail back to the primary.  This is now NOT happening in the newest release.  This has apparently been a bug before..but now short of rebooting hte firewall it will not restore states back to the primary after a fail over to secondary.  any ideas?
Title: Re: multi-wan failover problem
Post by: axsdenied on August 11, 2022, 10:08:28 PM
I actually plan setting this configuration up over the weekend.  Will post here if I run into the same issue.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 11, 2022, 10:42:07 PM
so after further testing:

this is on official hardware the dec3850.
If we either pull the primary wan connection physically or disable it in the web gui failover to the secondary takes 5 seconds.  It used to be instant.  Another wrinkle is when the primary is restored it refuses to switch back,  Hitting save 0n an interface has no effect.  Disabling the secondary causes a 5 second loss of connectivity.  Otherwise a reboot is required.
Title: Re: multi-wan failover problem
Post by: axsdenied on August 12, 2022, 12:13:19 AM
I assume the monitor IP's are setup correctly so that it knows to switch back?  And you set the thresholds?
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 12, 2022, 05:45:48 PM
everything worked correctly in the previous version.  only upon upgrading to 22.7 did it break.
Title: Re: multi-wan failover problem
Post by: tcpip on August 12, 2022, 06:34:19 PM
I also ran into this issue. Try setting static routes for the monitored IPs via the corresponding gateway. This solved it for me.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 12, 2022, 11:33:47 PM
i'm not familiar with that..what parameters in the static route would i use?
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 12, 2022, 11:53:33 PM
Quote from: hescominsoon on August 12, 2022, 11:33:47 PM
i'm not familiar with that..what parameters in the static route would i use?
so set a static route on each gateway to the monitoring ip addresses if i am reading this correctly....
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 13, 2022, 12:15:47 AM
trying to figure out why this is not working..:)
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 13, 2022, 12:33:39 AM
I think I found the issue.
Wan1 has a route set to the monitoring ip of 1.1.1.1 AND the ip of 8.8.8.8 3even though 8.8.8.8. is cearly seutp for monitoring on wan2.  Looks like a bug either in freebsd or the opnsense code.

Title: Re: multi-wan failover problem
Post by: hescominsoon on August 13, 2022, 12:37:39 AM
duh forgot the cidr notation..got it.
Title: Re: multi-wan failover problem
Post by: tcpip on August 13, 2022, 01:10:20 AM
I think the issue is that the route for the monitoring IP of the WAN link gets removed as soon as the link is down. Therefore the monitoring checks don't work anymore. At least this is the case when I disconnect my primary WAN link. Setting the routes manually seems to be a decent workaround. However, I agree that it looks like a bug. I did't find time yet to dig deeper into the issue and file an issue on Github.

How is your multi WAN setup configured? Do you just use gateway switching or employ the gateway groups? Keep in mind that switchting back from WAN2 to WAN1 does not force all existing connections to switch back. The pf states are kept.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 13, 2022, 03:20:50 AM
gateway groups as failover.  what's weird is in 22.1 it would fail bac to the primary ip after ab out a minute.  i didn't have to do anything.  in 22.7 i now have to either forcibly disable the secondary wan or reboot the firewall for it to fal back.  if this non-going back to the primary is expected behavior..this is not the solution for me and my clients and will have to go back to another product.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 13, 2022, 04:55:06 AM
Quote from: hescominsoon on August 13, 2022, 03:20:50 AM
gateway groups as failover.  what's weird is in 22.1 it would fail bac to the primary ip after ab out a minute.  i didn't have to do anything.  in 22.7 i now have to either forcibly disable the secondary wan or reboot the firewall for it to fal back.  if this non-going back to the primary is expected behavior..this is not the solution for me and my clients and will have to go back to another product.

we installed 22.1 back on the appliance and restored the config.  It now reverts back to primary within seconds as verified by ipchicken on a desktop behind the opnsense.  No need for a static route eithe.  I think 22.7 needs a ton of work at this point.
Title: Re: multi-wan failover problem
Post by: ProximusAl on August 13, 2022, 05:06:29 PM
Interesting. I'm new to OPNSense and started with 22.7 beta.

You can see in my post here how I deal with the failover back to primary.

https://forum.opnsense.org/index.php?topic=29749.0

I just assumed OPNSense never did it, but did think it strange.

Maybe I should have started with 22.1, but I guess if I did that, I'd have the same issue as you (Technically still do)

EDIT: I should clarify "NEW" connections do use the primary ip when it's back, but OPNSense itself is reluctant to fail ack, hence why I down the interface
Title: Re: multi-wan failover problem
Post by: tcpip on August 13, 2022, 06:28:23 PM
I think there are different issues getting mixed up.

1.)
There seems to be an issue since 22.7 - at least for me - with the primary WAN gateway staying offline when the primary WAN interface is up again. The reason appears to be that the host route for the monitoring IP doesn't get added as soon as the interface is up again. I resolved this by adding the host routes for the monitoring IPs manually. This fixed the issue for me and the default route switches back to the gateway of the primary WAN link (with "Allow default gateway switching" ticked). Does this issue exist for you as well?

2.)
Quote from: ProximusAl on August 13, 2022, 05:06:29 PM
I just assumed OPNSense never did it, but did think it strange.
I guess what you are mentioning here is that the states are kept. With earlier releases of OPNsense it was possible to untick "Disable State Killing on Gateway Failure". However, this setting does not exist anymore. See here: https://forum.opnsense.org/index.php?topic=28179 (https://forum.opnsense.org/index.php?topic=28179). I went the same route as you and wrote a script to handle this (and some other things) as soon as the default gateway switches. Aside from running this in a cron job, you can place it in /usr/local/etc/rc.syshook.d/monitor/ to be run on monitor events. See here: https://docs.opnsense.org/development/backend/autorun.html (https://docs.opnsense.org/development/backend/autorun.html). I think you don't need to down the interface. I use pfctl -k <wan_ip> to kill the states (where wan_ip is the IP of WAN2 gateway after switching back to WAN1) and it works for me. Flushing all states seems not necessary.
Title: Re: multi-wan failover problem
Post by: ProximusAl on August 13, 2022, 06:36:57 PM
Thanks for replying...

So on 1)

Not been an issue for me. When the interface is back up, new connections start to go back to primary wan. I did not have to create any static routes manually, all taken care of, 8.8.8.8 WAN1 and 8.8.4.4 WAN2 on the monitors.

2)

I can try this, but I definitely had issues with states "coming back" (after just killing the states on that interface) on the wrong interface which is why I went for the nuclear option. What I found is that OPNSense itself continued down WAN2 even though WAN1 was up and running for DNS lookups etc, and the only way I found to fail it back was downing the interface.  EDIT: I think it was WireGuard that kept its state on WAN2 using kmod
Title: Re: multi-wan failover problem
Post by: tcpip on August 13, 2022, 06:49:07 PM
Thanks for answering!

1) Well, that's interesting. Have you ticked "Allow default gateway switching"?

2) Ok, it seems to work for me. How did you try to kill them? I guess you can't use the interface with pfctl (haven't tried yet) as the states are floating by default (can be changed by "Bind states to interface").
Title: Re: multi-wan failover problem
Post by: tcpip on August 13, 2022, 06:53:56 PM
Quote from: ProximusAl on August 13, 2022, 06:36:57 PM
EDIT: I think it was WireGuard that kept its state on WAN2 using kmod

You mean traffic coming from wg clients kept being routed to the internet via WAN2?
Title: Re: multi-wan failover problem
Post by: ProximusAl on August 13, 2022, 06:56:24 PM
Yes, I had to have allow dgw switching.

2) To be honest I can't remember, it possibly was using your method, but I distinctly remember WireGuard and SIP (VoIP) always coming back on the second WAN which for me is 5G mobile data, so that's a no go for me.

Funnily enough, we had a power cut this morning, and the irony of it is, although my WAN1 is on a UPS, the DOCSIS cabinet on the other end clearly doesn't, as it goes bye bye. Everything worked, failed over, kid could still play roblox, but when the power returned, my script kicked him off roblox as it pushed him back to WAN1 :D

My method, although "feels dirty" does work, but my previous EdgeRouter did handle it a bit better, but to be fair, I'm glad to be shot of the EdgeRouter now. OPNSense just works a treat for me, and I've upgrade my entire internal network to 2.5G now (EdgeRouter 1Gb only)
Title: Re: multi-wan failover problem
Post by: ProximusAl on August 13, 2022, 06:58:43 PM
Quote from: tcpip on August 13, 2022, 06:53:56 PM
Quote from: ProximusAl on August 13, 2022, 06:36:57 PM
EDIT: I think it was WireGuard that kept its state on WAN2 using kmod

You mean traffic coming from wg clients kept being routed to the internet via WAN2?

No, I think WireGuard kept listening on WAN2, rather than WAN1.

My SIP phone definitely kept going outbound via WAN2 even though I kept killing its state. Kept coming back.
Title: Re: multi-wan failover problem
Post by: ProximusAl on August 13, 2022, 07:08:46 PM
What would happen if I didn't kill all the states in my script but instead just downed the WAN2 interface?

Could you see any issues with that?

I never thought of trying that at all.
Title: Re: multi-wan failover problem
Post by: tcpip on August 13, 2022, 11:22:08 PM
Quote from: ProximusAl on August 13, 2022, 06:56:24 PM
My method, although "feels dirty" does work

I guess it's fine as long as it works for you :D

Quote from: ProximusAl on August 13, 2022, 07:08:46 PM
What would happen if I didn't kill all the states in my script but instead just downed the WAN2 interface?

Could you see any issues with that?

I don't see any issues with that, it just seems to be the sledgehammer approach for just killing states. Downing an interface does a bit more if you look into interface_bring_down function in /usr/local/etc/inc/interfaces.inc (I suppose this is where the magic happens). To clear the states it runs /sbin/pfctl -i <interface> -Fs (so it seems to work with the interface parameter). But whatever works for you.
Title: Re: multi-wan failover problem
Post by: tong2x on August 14, 2022, 11:43:06 AM
hmmm may be same issue
https://forum.opnsense.org/index.php?topic=29757.0

once the wan link is down or for a long time it seems to be tagged as down indefinitely
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 14, 2022, 07:42:32 PM
so it seems 22.7 needs some work.  it's either a bsd issue, a middleware issue or a combination of the two.  This unfortunately means we will be leaving a brand new opnsense firewall at 22.1 forever...when and IF this issue gets fixed we might try going forward.  It's also strange that is generates a nearly 5 second outage going either way in 22.7 when it's nearly instant on 22.1.
Title: Re: multi-wan failover problem
Post by: tcpip on August 14, 2022, 08:47:59 PM
Quote from: tong2x on August 14, 2022, 11:43:06 AM
hmmm may be same issue
https://forum.opnsense.org/index.php?topic=29757.0

once the wan link is down or for a long time it seems to be tagged as down indefinitely

It could be the same issue. Have you checked the routes?

Quote from: hescominsoon on August 14, 2022, 07:42:32 PM
so it seems 22.7 needs some work.  it's either a bsd issue, a middleware issue or a combination of the two.  This unfortunately means we will be leaving a brand new opnsense firewall at 22.1 forever...when and IF this issue gets fixed we might try going forward.  It's also strange that is generates a nearly 5 second outage going either way in 22.7 when it's nearly instant on 22.1.

If you're facing the gateway issue I described before, configuring static routes should serve as a workaround. If this isn't the issue you're facing, I didn't understand your problem. However, I guess the gateway issue will be resolved soon: https://github.com/opnsense/core/issues/5956 (https://github.com/opnsense/core/issues/5956). OPNsense is great and I have a lot of respect for the devs.
Title: Re: multi-wan failover problem
Post by: tong2x on August 15, 2022, 12:32:43 AM
I think it is, what I'm doing now is clicking edit in gateways and changing nothing, for the monitor IP to go online.
will try that static route approach, as it is bother some to keeps doing it.

hope the patcht/fix we dont have to wait long.
thanks
Title: Re: multi-wan failover problem
Post by: axsdenied on August 19, 2022, 04:19:00 AM
Quote from: tong2x on August 15, 2022, 12:32:43 AM
I think it is, what I'm doing now is clicking edit in gateways and changing nothing, for the monitor IP to go online.
will try that static route approach, as it is bother some to keeps doing it.

hope the patcht/fix we dont have to wait long.
thanks

Ironically, I just had a real-world test of this.  Power went out.  I had battery backups on my main internet connection and OPNsense but not my failover connection.

When everything came back up, the failover status never changed back to "online". I did exactly what you did; edited the "System:Gateways:Single" listing; made no changes and just saved.  Voila, back online.

Surely this can't be by design?
Title: Re: multi-wan failover problem
Post by: tong2x on August 19, 2022, 05:56:24 AM
no it is a bug in 22.x, something must have changed in the code, there is already a patch, but has not yet been included in 22.7.2.

Quote# opnsense-patch e8d42b6
patch created by @franco
needs to be executed in the console, have already applied it and seems to have fixed the issue in may test.
franco said it will be included in 22.7.3, pending test reports also


https://github.com/opnsense/core/issues/5956 (https://github.com/opnsense/core/issues/5956)
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 19, 2022, 11:09:34 AM
That should not be necessary in terms of cycling the interface. 
Title: Re: multi-wan failover problem
Post by: franco on August 19, 2022, 11:23:15 AM
Quote from: hescominsoon on August 19, 2022, 11:09:34 AM
That should not be necessary in terms of cycling the interface.

Your lack of context is staggering, but thanks anyway for this comment.

More work on the ticket was done. Thanks for all the feedback an testing. :)


Cheers,
Franco
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 19, 2022, 11:59:25 AM
you are welcome.  My lack of feedback is because i am not running 22.7 in a failover environment now as i cannot afford to be testing the software in said environment as the client will not put up with that.  22.1 works as the use case calls for..and as the thread shows..22.7 doesn't(Edited..not in it's current release form)..a.  I DO have 22.7 running here with a single wan and that, of course, works perfectly.  My statement of..you should not have to cycle the interface goes all the way back to my first post about this...if i had the ability to test this further..i would.  I don't..so i won't. OPnsense is a good product..but this one issue burned me badly when alll other times it worked perfectly.  Normally i upgrade here..beat on it..and deploy7.  I tested it here in a similar..but not exact environment..then upgraded to 22.7 on the 3850 on the client firewall only to have OPnSense fall flat..and continue to do so.  In order to fix it..a reinstall to 22.1 was required.  Due to the lack of a vga port that machine wilt stay on 22.1...probably forever until I can guarantee failover works as it should.  At this point what is probably going to happen is i will have to replace that firewall..at my expense..with a pfsense box that i KNOW does failover correctly.  I wanted to move to opnsense for my critical business clients due to PFSense's well stated intentions of going closed source and paid only...it looks like for non-critical applications i will continue to use opnsense...but for other applications it's either pfsense or something else.  This regression has caused me to look like an idiot to my partner AND also to the client we spent many hours trying to get rid of a sonicwall firewall to replace it with this 3850.  I am now contemplating having to buy the hardware from my partner and eat several hours of time to get the sonicwall working or replace the 3850 with a PFSense machine with TAC.

I am not saying opnsense is a bad product but this failover issue left me looking like a complete idiot.  I have never had a firewall upgrade blow up this badly..in full view of both my partner and a major client..at the same time. 

I appreciate the entire opnsense teams time and efforts...and i know this will be resolved eventually.  This is going to cost me a good deal of money both in having to either replace the hardware with something else...OR eating many hours of time trying to convince the partner and client this is a viable solution for their needs. 

I actually hae two custom Opnsense firewalls i am configuring for a different client who does not have a failover requirement...and will happily deploy those firewalls(on custom hardware).  I will continue to run Opnsense here at my office as well.  Opnsense is a solid product but this incident has made me change my use cases for the product.

i hope that provides the context you are looking for franco. 

(Edit: noticed the patch...great job..just cannot test it at the client as reinstalling from the serial console is really a pain..if it had a vga port...Once .3 is released i'll test here at my office in my multi-wan setup and then if it works right..we MIGHT decide to upgrade the 3850....)
Title: Re: multi-wan failover problem
Post by: franco on August 19, 2022, 12:24:13 PM
Fair enough. And I think 22.7 is a little premature if you want to indeed not "look like an idiot" in a business setting. This is NOT intended as a fire and forget replacement at this stage. Sometimes early releases can be, but this one may not be. It's been 3 weeks since release. It last for over 5 more months. I'm sure you know how this works.


Cheers,
Franco
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 19, 2022, 04:25:56 PM
I know how it works now....not too long ago release meant release...not it's kinda done but the users are the final beta test which has infested the rest of the software community.  Firewalls and other critical infrastructure..imo...should hold themselves to a higher standard and open source ones used to hold themselves to even higher ones.  You know what they say about assumptions...Assumptions always bite you in the ass.
Title: Re: multi-wan failover problemd
Post by: Vesalius on August 19, 2022, 04:39:57 PM
Quote from: hescominsoon on August 19, 2022, 04:25:56 PM
I know how it works now....not too long ago release meant release...not it's kinda done but the users are the final beta test which has infested the rest of the software community.  Firewalls and other critical infrastructure..imo...should hold themselves to a higher standard and open source ones used to hold themselves to even higher ones.  You what they say about assumptions...especially code quality across the entire spectrum now...Assumptions always bite you in the ass.
trying to understand why you would not use the free 22.4 1-year business license that comes with your hardware purchase. It's there to keep you, your business, your customers and OPNsense safe. Let free home users beat on 22.7 for many months until OPNsense is convinced all the unforeseen show stoppers are cleaned up. Then you be will justified in coming at them if you experience this in the business release.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 19, 2022, 04:50:13 PM
read my previous post.  if you cannot understand why from that...i cannot assist any further..:)
Title: Re: multi-wan failover problem
Post by: tong2x on August 19, 2022, 06:06:50 PM
well at least... we learn something rigth?
never put a new releases in production without testing...
if you do... better have a backup plan...
there are reasons why there are users still on older 21.x or 20.x release

everyone knows that for all releases there will always be bug... thing is opnsense/franco is at it solving/fixing the bug.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 19, 2022, 07:26:24 PM
for today's software...yes...unfortunately release is no longer the actual release.
Title: Re: multi-wan failover problem
Post by: Vesalius on August 19, 2022, 11:51:59 PM
Quote from: hescominsoon on August 19, 2022, 07:26:24 PM
for today's software...yes...unfortunately release is no longer the actual release.
Been that way from the start when a company has a free early consumer release and a delayed business release. One of those 2 is more battle-tested at the expense of the other.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 20, 2022, 01:48:28 PM
The summary is, I am going with a different product in the future. 
Title: Re: multi-wan failover problem
Post by: tcpip on August 20, 2022, 02:32:22 PM
What prevents you from using the business release in a business context?

Also the community releases are usually perfectly stable for any home deployment. If you have an issue, just report on Github.
Title: Re: multi-wan failover problem
Post by: hescominsoon on August 21, 2022, 07:32:58 PM
allow me to restate...the lack of support from Deciso during US business hours...hence the move to another vendor for business clients.

(original text: lack of support hours available from Deciso in the US...hence the move to another vendor for business clients.)
Title: Re: multi-wan failover problem
Post by: tong2x on August 23, 2022, 03:04:30 AM
the patch works, and I did not pay anything for it...
that should be the main point now
yes it did took a day or 2 and a report by @tcpip in github

anyway this is about, solving the multiwan-wan failover / gateway fail problem and it is solved
we also learned NOT to put first release/community version to production servers without testing.
Title: Re: multi-wan failover problem
Post by: franco on August 23, 2022, 01:56:23 PM
Quote from: hescominsoon on August 21, 2022, 07:32:58 PM
lack of support hours available from Deciso in the US...hence the move to another vendor for business clients.

That's a flat out lie. We do have happy support customers in the US and all you need is to acquire a contract.

If you intent to keep spreading misinformation I have no alternative to taking action as a moderator.


Cheers,
Franco
Title: Re: multi-wan failover problem
Post by: hescominsoon on September 11, 2022, 11:31:45 PM
Quote from: franco on August 23, 2022, 01:56:23 PM
Quote from: hescominsoon on August 21, 2022, 07:32:58 PM
lack of support hours available from Deciso in the US...hence the move to another vendor for business clients.

That's a flat out lie. We do have happy support customers in the US and all you need is to acquire a contract.

If you intent to keep spreading misinformation I have no alternative to taking action as a moderator.


Cheers,
Franco
my intent is not misinformation.  Sorry you see it that way.  According to you site support is 9-5 central european time..which does not line up with business hours in the US.  so i will reword..there is not support from decosio during US business hours..which is something i require from a vendor.  My apologies for my error in wording there.  Original post has been corrected with the original text placed in parentheses for the record.
Title: Re: multi-wan failover problem
Post by: tong2x on December 29, 2022, 08:29:56 AM
there seems to be a recurrence of the issue but for now I'm not quite sure to replicate.
connection is a fiber internet, it is always on.
but not as frequent as the issue before (also not as easily replicable)

what I notice is that if the connection, goes down for a "long" time, the gateway will somehow be tagged down indefinitely. I tried restating the system:gateway service. but it is still tagged as down.
so again, I click edit, do nothing and just click save. the connection will be good again.

i'm on OPNsense 22.7.10_2-amd64
Title: Re: multi-wan failover problem
Post by: axsdenied on December 29, 2022, 09:26:02 PM
I have this same issue as well.  Used to fail back with no problem but now it doesn't. Trying to deal with this issue manually isn't working either as described in my post below.

https://forum.opnsense.org/index.php?topic=31402.msg151400#msg151400
Title: Re: multi-wan failover problem
Post by: tong2x on December 30, 2022, 01:08:44 AM
in may case, there is IP, it is just tagged as down gateway
Title: Re: multi-wan failover problem
Post by: axsdenied on December 30, 2022, 05:01:27 PM
Sometimes I do get an IP but no other data; i.e gateway, dns, etc.