Multi-Wan Setup Failback from Tier 2 to Tier 1 unreliable

Started by axsdenied, August 10, 2023, 04:34:38 AM

Previous topic - Next topic
Observed Behavior: Tier 1 GW has enough packet loss to be marked as down.  Tier 2 GW kicks in and everything transitions nicely.  Tier 1 connection goes back to online/green, 0% packet loss, and SOMETIMES connections fall back to Tier 1, sometimes they don't.  I have not been able to pin down when it does versus it doesn't.

I've seen various posts on this but haven't seen relevant solutions.  Anyone have any thoughts?

Bonus notes:

  • Clearing the entire state table does NOT cause connections to fall back
  • Physically removing the connection to Tier 2 GW or rebooting that device DOES cause all connections to fall back to Tier 1 connection smoothly
  • My nvidiashield pro, which is constantly streaming most of the day, NEVER falls back to Tier 1 GW unless I force via the method above

Configuration Notes:

  • OPNsense version 23.1.11
  • IPv4 only, IPv6 disabled
  • Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group
  • GW 1 is set to Tier 1
  • GW 2 is set to Tier 2
  • GW Group Trigger Level was "Packet Loss".  I'm now testing "member down"
  • Monitor IP of GW 1 is 8.8.8.8
  • Monitor IP of GW 2 is 8.8.4.4
  • Allow default gateway switching is enabled
  • System DNS Servers - 9.9.9.9 assigned to GW 1, 149.112.112.112 assigned to GW2
  • NAT: Outbound Mode is set to "Hybrid outbound NAT rule generation"
  • I do NOT use Wireguard
  • I do NOT use Suricata
  • I do NOT use any plugins related to routing or DNS
  • I DO host OpenVPN server for "road warrior" purposes - no active connections
  • Sticky Connections is NOT enabled

OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

If I have the wrong expectation, and their isn't a forced function to kick them back to Tier 1, I would love to know that as well :)
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Do you have a firewall rule with the specified gateway group setting, i.e to send traffic to the correct gateway group? 

Or are you just relying on the default gateway switching?

EDIT: Oh, missed the below initially... so you do, to the first point :)
"Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group"

The below would only really have been relevant if you were just relying on gateway switching:

- What is the routing table (netstat -rn) pre/post fail over?
- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?

If you go to:

Systems -> Gateway -> Single

Mark the Tier 2 as down (Disable) when it's active, apply, I assume it would then fail back to Tier 1?

Is this on 23.1 or 23.7? Because the alert handler changed in 23.7 due to problems in 23.1 enabling combinations previously not working, but as things will have it it was also hitting another bug uncovered in the monitoring status code, see https://github.com/opnsense/core/issues/6728#issuecomment-1673060746


Cheers,
Franco

Quote from: franco on August 10, 2023, 08:37:06 PM
Is this on 23.1 or 23.7? Because the alert handler changed in 23.7 due to problems in 23.1 enabling combinations previously not working, but as things will have it it was also hitting another bug uncovered in the monitoring status code, see https://github.com/opnsense/core/issues/6728#issuecomment-1673060746


Cheers,
Franco

On version 23.1.11
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: iMx on August 10, 2023, 08:26:41 PM
If you go to:

Systems -> Gateway -> Single

Mark the Tier 2 as down (Disable) when it's active, apply, I assume it would then fail back to Tier 1?

Per my notes yes.  If I force, whether physically or with marking it down, GW2 down it falls back.
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: iMx on August 10, 2023, 08:15:43 PM
Do you have a firewall rule with the specified gateway group setting, i.e to send traffic to the correct gateway group? 

Or are you just relying on the default gateway switching?

EDIT: Oh, missed the below initially... so you do, to the first point :)
"Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group"

The below would only really have been relevant if you were just relying on gateway switching:

- What is the routing table (netstat -rn) pre/post fail over?
- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?

Regarding "- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?"

Neither GW is checked for upstream.  Given it wasn't in the multi-wan guidance I wasn't sure if this applied to this situation.

I don't have the netstat data, but can simulate the scenario and capture it if necessary.
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

My understanding, for default gateway switching you need:

- Specify Priority, lower numerical value is higher priority
- Tag both as 'Upstream'

"This will select the above gateway as a default gateway candidate."

The 2 fail-over mechanisms are different:

- Firewall rule -> gateway group, uses gateway groups.
- Default gateway switching, the priority/upstream tags in System -> Gateway -> Single

Default gateway switching is going to impact services running on the firewall itself and rules where there is no gateway/gateway group specified.

Quote from: axsdenied on August 10, 2023, 09:00:29 PM
On version 23.1.11

Ok then it might be the exact reason why it was rewritten for 23.7. If you want to test on 23.7.1 I'd recommend using the patch mentioned as well:

# opnsense-patch d1d255a24

And reboot for full effect...


Cheers,
Franco

August 10, 2023, 09:18:06 PM #10 Last Edit: August 10, 2023, 09:28:09 PM by axsdenied
Quote from: iMx on August 10, 2023, 09:08:27 PM
My understanding, for default gateway switching you need:

- Specify Priority, lower numerical value is higher priority
- Tag both as 'Upstream'

"This will select the above gateway as a default gateway candidate."

The 2 fail-over mechanisms are different:

- Firewall rule -> gateway group, uses gateway groups.
- Default gateway switching, the priority/upstream tags in System -> Gateway -> Single

Default gateway switching is going to impact services running on the firewall itself and rules where there is no gateway/gateway group specified.

Here are the priorities.  I can certainly try with "upstream" selected to see if thats necessary but I'm weary about that given it's not in the documentation? EDIT: I realize WAN looks like it could be a private IP for the gateway, it is not :)
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: franco on August 10, 2023, 09:12:31 PM
Quote from: axsdenied on August 10, 2023, 09:00:29 PM
On version 23.1.11

Ok then it might be the exact reason why it was rewritten for 23.7. If you want to test on 23.7.1 I'd recommend using the patch mentioned as well:

# opnsense-patch d1d255a24

And reboot for full effect...


Cheers,
Franco

For clarity you mean to upgrade to 23.7 and then run that patch?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Yeah 23.1.11_1 upgrade will take you to 23.7.1_3 directly and the patch goes on top. But don't rush the upgrade if you don't have to. Just that it's futile talking about 23.1 when this already changed in 23.7.

Here is the original issue report:

https://github.com/opnsense/core/issues/6231


Cheers,
Franco

Quote from: franco on August 10, 2023, 09:25:32 PM
Yeah 23.1.11_1 upgrade will take you to 23.7.1_3 directly and the patch goes on top. But don't rush the upgrade if you don't have to. Just that it's futile talking about 23.1 when this already changed in 23.7.

Here is the original issue report:

https://github.com/opnsense/core/issues/6231

Cheers,
Franco

Got it; I'm running on ZFS so I can try it and just fall back if more things break.  No biggie!
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

August 11, 2023, 04:27:49 PM #14 Last Edit: August 11, 2023, 04:29:29 PM by axsdenied
Ok updated to 23.7.1_3, swapped to development type, applied patch d1d255a24 and rebooted.  Will report back after I've had a real event or time to simulate the scenario.

I also created BE's for 23.1.11, 23.1.11_1 and 23.7.1_3 prepatch just in case!
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD