OPNsense Forum

Archive => 23.1 Legacy Series => Topic started by: axsdenied on August 10, 2023, 04:34:38 AM

Title: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 04:34:38 AM
Observed Behavior: Tier 1 GW has enough packet loss to be marked as down.  Tier 2 GW kicks in and everything transitions nicely.  Tier 1 connection goes back to online/green, 0% packet loss, and SOMETIMES connections fall back to Tier 1, sometimes they don't.  I have not been able to pin down when it does versus it doesn't.

I've seen various posts on this but haven't seen relevant solutions.  Anyone have any thoughts?

Bonus notes:

Configuration Notes:

Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 05:31:02 PM
If I have the wrong expectation, and their isn't a forced function to kick them back to Tier 1, I would love to know that as well :)
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: iMx on August 10, 2023, 08:15:43 PM
Do you have a firewall rule with the specified gateway group setting, i.e to send traffic to the correct gateway group? 

Or are you just relying on the default gateway switching?

EDIT: Oh, missed the below initially... so you do, to the first point :)
"Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group"

The below would only really have been relevant if you were just relying on gateway switching:

- What is the routing table (netstat -rn) pre/post fail over?
- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: iMx on August 10, 2023, 08:26:41 PM
If you go to:

Systems -> Gateway -> Single

Mark the Tier 2 as down (Disable) when it's active, apply, I assume it would then fail back to Tier 1?
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 10, 2023, 08:37:06 PM
Is this on 23.1 or 23.7? Because the alert handler changed in 23.7 due to problems in 23.1 enabling combinations previously not working, but as things will have it it was also hitting another bug uncovered in the monitoring status code, see https://github.com/opnsense/core/issues/6728#issuecomment-1673060746


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 09:00:29 PM
Quote from: franco on August 10, 2023, 08:37:06 PM
Is this on 23.1 or 23.7? Because the alert handler changed in 23.7 due to problems in 23.1 enabling combinations previously not working, but as things will have it it was also hitting another bug uncovered in the monitoring status code, see https://github.com/opnsense/core/issues/6728#issuecomment-1673060746


Cheers,
Franco

On version 23.1.11
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 09:01:38 PM
Quote from: iMx on August 10, 2023, 08:26:41 PM
If you go to:

Systems -> Gateway -> Single

Mark the Tier 2 as down (Disable) when it's active, apply, I assume it would then fail back to Tier 1?

Per my notes yes.  If I force, whether physically or with marking it down, GW2 down it falls back.
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 09:04:49 PM
Quote from: iMx on August 10, 2023, 08:15:43 PM
Do you have a firewall rule with the specified gateway group setting, i.e to send traffic to the correct gateway group? 

Or are you just relying on the default gateway switching?

EDIT: Oh, missed the below initially... so you do, to the first point :)
"Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group"

The below would only really have been relevant if you were just relying on gateway switching:

- What is the routing table (netstat -rn) pre/post fail over?
- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?

Regarding "- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?"

Neither GW is checked for upstream.  Given it wasn't in the multi-wan guidance I wasn't sure if this applied to this situation.

I don't have the netstat data, but can simulate the scenario and capture it if necessary.
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: iMx on August 10, 2023, 09:08:27 PM
My understanding, for default gateway switching you need:

- Specify Priority, lower numerical value is higher priority
- Tag both as 'Upstream'

"This will select the above gateway as a default gateway candidate."

The 2 fail-over mechanisms are different:

- Firewall rule -> gateway group, uses gateway groups.
- Default gateway switching, the priority/upstream tags in System -> Gateway -> Single

Default gateway switching is going to impact services running on the firewall itself and rules where there is no gateway/gateway group specified.
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 10, 2023, 09:12:31 PM
Quote from: axsdenied on August 10, 2023, 09:00:29 PM
On version 23.1.11

Ok then it might be the exact reason why it was rewritten for 23.7. If you want to test on 23.7.1 I'd recommend using the patch mentioned as well:

# opnsense-patch d1d255a24

And reboot for full effect...


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 09:18:06 PM
Quote from: iMx on August 10, 2023, 09:08:27 PM
My understanding, for default gateway switching you need:

- Specify Priority, lower numerical value is higher priority
- Tag both as 'Upstream'

"This will select the above gateway as a default gateway candidate."

The 2 fail-over mechanisms are different:

- Firewall rule -> gateway group, uses gateway groups.
- Default gateway switching, the priority/upstream tags in System -> Gateway -> Single

Default gateway switching is going to impact services running on the firewall itself and rules where there is no gateway/gateway group specified.

Here are the priorities.  I can certainly try with "upstream" selected to see if thats necessary but I'm weary about that given it's not in the documentation? EDIT: I realize WAN looks like it could be a private IP for the gateway, it is not :)
(https://i.imgur.com/5EoVE19.png)
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 09:19:19 PM
Quote from: franco on August 10, 2023, 09:12:31 PM
Quote from: axsdenied on August 10, 2023, 09:00:29 PM
On version 23.1.11

Ok then it might be the exact reason why it was rewritten for 23.7. If you want to test on 23.7.1 I'd recommend using the patch mentioned as well:

# opnsense-patch d1d255a24

And reboot for full effect...


Cheers,
Franco

For clarity you mean to upgrade to 23.7 and then run that patch?
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 10, 2023, 09:25:32 PM
Yeah 23.1.11_1 upgrade will take you to 23.7.1_3 directly and the patch goes on top. But don't rush the upgrade if you don't have to. Just that it's futile talking about 23.1 when this already changed in 23.7.

Here is the original issue report:

https://github.com/opnsense/core/issues/6231


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 10, 2023, 09:29:18 PM
Quote from: franco on August 10, 2023, 09:25:32 PM
Yeah 23.1.11_1 upgrade will take you to 23.7.1_3 directly and the patch goes on top. But don't rush the upgrade if you don't have to. Just that it's futile talking about 23.1 when this already changed in 23.7.

Here is the original issue report:

https://github.com/opnsense/core/issues/6231

Cheers,
Franco

Got it; I'm running on ZFS so I can try it and just fall back if more things break.  No biggie!
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 11, 2023, 04:27:49 PM
Ok updated to 23.7.1_3, swapped to development type, applied patch d1d255a24 and rebooted.  Will report back after I've had a real event or time to simulate the scenario.

I also created BE's for 23.1.11, 23.1.11_1 and 23.7.1_3 prepatch just in case!
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 11, 2023, 04:38:13 PM
Ok well that didn't take long. Had a real event occur minutes after I posted my previous reply.

Still not seeing a full fallback to Tier 1. See image below.  This was taken a few minutes after Tier 1 came back online.  Light green is WAN (Tier 1), Dark green is WAN2 (Tier 2).

I even tried forcing the WAN2 down and it still has traffic routed through it. See 2nd image.

Img 1.
(https://i.imgur.com/55skhwD.png)

Img 2.
(https://i.imgur.com/gY8goG1.png)
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 11, 2023, 05:05:08 PM
Not sure where I got it my head that I needed to be on the development branch to apply patches but I caught my error.  Everything above is and applies to the dev branch.

I've since reverted back to the community branch and have applied the patch to it and will continue to test.
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 14, 2023, 01:26:37 PM
> I've since reverted back to the community branch and have applied the patch to it and will continue to test.

So how's that test going?


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 15, 2023, 06:04:34 AM
So far so good, but I haven't had a chance to simulate it.  Will do this week!

Side question: Did you guys do any memory optimization as well? I've noticed overall usage, with my config, hovering around 2.5GB.  In 23.1 series it would slowly ramp up to 5 to 6GB.
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 15, 2023, 08:40:46 AM
Not that I'm aware of.


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 20, 2023, 07:55:39 AM
Ok I went to simulate a test by marking the gateway as down but nothing shifted.  I can physically unplug the primary WAN to test as well but thought I'd share this.

(https://i.imgur.com/jaUiyRV.png)
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 21, 2023, 01:28:37 PM
"force_down" handling previously is a bit difficult to say given its niche value. Monitoring-induced downtimes already work and cable disconnects will work on 23.7.2.

I've added a commit to include force_down for testing as it would make sense to consolidate. If it works we can discuss adding it to 23.7.3.

https://github.com/opnsense/core/commit/7f1d8c66d3


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 25, 2023, 05:12:43 PM
Upgraded to 23.7.2 and tried simulating a fallback:

Everything fell back smoothly after WAN when down but after it came back up, existing sessions stayed with WAN2 and never went back to WAN.

Should I re-apply the patch and try again?
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 25, 2023, 07:06:06 PM
On 23.7.2 there is nothing to reapply.

Do you have sticky connections enabled?


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 25, 2023, 11:43:08 PM
Sticky connections is not enabled.  Overtime, about an hour or 2 the connections did move over.  Just not immediately.

Is it designed to wait for sessions to end or expire before moving?
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: franco on August 26, 2023, 08:43:57 PM
Yep. Stateful tracking. You can try to experiment with rules that do not keep state (advanced rule settings). It might move over immediately, but it depends on the client liking that or not.


Cheers,
Franco
Title: Re: Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable
Post by: axsdenied on August 26, 2023, 09:31:08 PM
If that's by design, which makes logical sense for greatest session stability, then I had the wrong expectations.

Is there an option to force then back, much like connections are forced when WAN goes down for triggers?  Most of the clients and apps I use respond well to being forced over with the exception of Discord and Hulu (when you have the TV package - they do a IP "home" check. It also seems to never release it's session, or at least that's the behavior it exhibits)