Observed Behavior: Tier 1 GW has enough packet loss to be marked as down. Tier 2 GW kicks in and everything transitions nicely. Tier 1 connection goes back to online/green, 0% packet loss, and SOMETIMES connections fall back to Tier 1, sometimes they don't. I have not been able to pin down when it does versus it doesn't.
I've seen various posts on this but haven't seen relevant solutions. Anyone have any thoughts?
Bonus notes:
- Clearing the entire state table does NOT cause connections to fall back
- Physically removing the connection to Tier 2 GW or rebooting that device DOES cause all connections to fall back to Tier 1 connection smoothly
- My nvidiashield pro, which is constantly streaming most of the day, NEVER falls back to Tier 1 GW unless I force via the method above
Configuration Notes:
- OPNsense version 23.1.11
- IPv4 only, IPv6 disabled
- Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group
- GW 1 is set to Tier 1
- GW 2 is set to Tier 2
- GW Group Trigger Level was "Packet Loss". I'm now testing "member down"
- Monitor IP of GW 1 is 8.8.8.8
- Monitor IP of GW 2 is 8.8.4.4
- Allow default gateway switching is enabled
- System DNS Servers - 9.9.9.9 assigned to GW 1, 149.112.112.112 assigned to GW2
- NAT: Outbound Mode is set to "Hybrid outbound NAT rule generation"
- I do NOT use Wireguard
- I do NOT use Suricata
- I do NOT use any plugins related to routing or DNS
- I DO host OpenVPN server for "road warrior" purposes - no active connections
- Sticky Connections is NOT enabled
If I have the wrong expectation, and their isn't a forced function to kick them back to Tier 1, I would love to know that as well :)
Do you have a firewall rule with the specified gateway group setting, i.e to send traffic to the correct gateway group?
Or are you just relying on the default gateway switching?
EDIT: Oh, missed the below initially... so you do, to the first point :)
"Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group"
The below would only really have been relevant if you were just relying on gateway switching:
- What is the routing table (netstat -rn) pre/post fail over?
- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?
If you go to:
Systems -> Gateway -> Single
Mark the Tier 2 as down (Disable) when it's active, apply, I assume it would then fail back to Tier 1?
Is this on 23.1 or 23.7? Because the alert handler changed in 23.7 due to problems in 23.1 enabling combinations previously not working, but as things will have it it was also hitting another bug uncovered in the monitoring status code, see https://github.com/opnsense/core/issues/6728#issuecomment-1673060746
Cheers,
Franco
Quote from: franco on August 10, 2023, 08:37:06 PM
Is this on 23.1 or 23.7? Because the alert handler changed in 23.7 due to problems in 23.1 enabling combinations previously not working, but as things will have it it was also hitting another bug uncovered in the monitoring status code, see https://github.com/opnsense/core/issues/6728#issuecomment-1673060746
Cheers,
Franco
On version 23.1.11
Quote from: iMx on August 10, 2023, 08:26:41 PM
If you go to:
Systems -> Gateway -> Single
Mark the Tier 2 as down (Disable) when it's active, apply, I assume it would then fail back to Tier 1?
Per my notes yes. If I force, whether physically or with marking it down, GW2 down it falls back.
Quote from: iMx on August 10, 2023, 08:15:43 PM
Do you have a firewall rule with the specified gateway group setting, i.e to send traffic to the correct gateway group?
Or are you just relying on the default gateway switching?
EDIT: Oh, missed the below initially... so you do, to the first point :)
"Relevant Firewall Rules: IPv4 Lan Network Pass rule to Gateway group"
The below would only really have been relevant if you were just relying on gateway switching:
- What is the routing table (netstat -rn) pre/post fail over?
- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?
Regarding "- Systems -> Gateway -> Single, what priority are both gateways set to? Are they both tagged as 'upstream'?"
Neither GW is checked for upstream. Given it wasn't in the multi-wan guidance I wasn't sure if this applied to this situation.
I don't have the netstat data, but can simulate the scenario and capture it if necessary.
My understanding, for default gateway switching you need:
- Specify Priority, lower numerical value is higher priority
- Tag both as 'Upstream'
"This will select the above gateway as a default gateway candidate."
The 2 fail-over mechanisms are different:
- Firewall rule -> gateway group, uses gateway groups.
- Default gateway switching, the priority/upstream tags in System -> Gateway -> Single
Default gateway switching is going to impact services running on the firewall itself and rules where there is no gateway/gateway group specified.
Quote from: axsdenied on August 10, 2023, 09:00:29 PM
On version 23.1.11
Ok then it might be the exact reason why it was rewritten for 23.7. If you want to test on 23.7.1 I'd recommend using the patch mentioned as well:
# opnsense-patch d1d255a24
And reboot for full effect...
Cheers,
Franco
Quote from: iMx on August 10, 2023, 09:08:27 PM
My understanding, for default gateway switching you need:
- Specify Priority, lower numerical value is higher priority
- Tag both as 'Upstream'
"This will select the above gateway as a default gateway candidate."
The 2 fail-over mechanisms are different:
- Firewall rule -> gateway group, uses gateway groups.
- Default gateway switching, the priority/upstream tags in System -> Gateway -> Single
Default gateway switching is going to impact services running on the firewall itself and rules where there is no gateway/gateway group specified.
Here are the priorities. I can certainly try with "upstream" selected to see if thats necessary but I'm weary about that given it's not in the documentation? EDIT: I realize WAN looks like it could be a private IP for the gateway, it is not :)
(https://i.imgur.com/5EoVE19.png)
Quote from: franco on August 10, 2023, 09:12:31 PM
Quote from: axsdenied on August 10, 2023, 09:00:29 PM
On version 23.1.11
Ok then it might be the exact reason why it was rewritten for 23.7. If you want to test on 23.7.1 I'd recommend using the patch mentioned as well:
# opnsense-patch d1d255a24
And reboot for full effect...
Cheers,
Franco
For clarity you mean to upgrade to 23.7 and then run that patch?
Yeah 23.1.11_1 upgrade will take you to 23.7.1_3 directly and the patch goes on top. But don't rush the upgrade if you don't have to. Just that it's futile talking about 23.1 when this already changed in 23.7.
Here is the original issue report:
https://github.com/opnsense/core/issues/6231
Cheers,
Franco
Quote from: franco on August 10, 2023, 09:25:32 PM
Yeah 23.1.11_1 upgrade will take you to 23.7.1_3 directly and the patch goes on top. But don't rush the upgrade if you don't have to. Just that it's futile talking about 23.1 when this already changed in 23.7.
Here is the original issue report:
https://github.com/opnsense/core/issues/6231
Cheers,
Franco
Got it; I'm running on ZFS so I can try it and just fall back if more things break. No biggie!
Ok updated to 23.7.1_3, swapped to development type, applied patch d1d255a24 and rebooted. Will report back after I've had a real event or time to simulate the scenario.
I also created BE's for 23.1.11, 23.1.11_1 and 23.7.1_3 prepatch just in case!
Ok well that didn't take long. Had a real event occur minutes after I posted my previous reply.
Still not seeing a full fallback to Tier 1. See image below. This was taken a few minutes after Tier 1 came back online. Light green is WAN (Tier 1), Dark green is WAN2 (Tier 2).
I even tried forcing the WAN2 down and it still has traffic routed through it. See 2nd image.
Img 1.
(https://i.imgur.com/55skhwD.png)
Img 2.
(https://i.imgur.com/gY8goG1.png)
Not sure where I got it my head that I needed to be on the development branch to apply patches but I caught my error. Everything above is and applies to the dev branch.
I've since reverted back to the community branch and have applied the patch to it and will continue to test.
> I've since reverted back to the community branch and have applied the patch to it and will continue to test.
So how's that test going?
Cheers,
Franco
So far so good, but I haven't had a chance to simulate it. Will do this week!
Side question: Did you guys do any memory optimization as well? I've noticed overall usage, with my config, hovering around 2.5GB. In 23.1 series it would slowly ramp up to 5 to 6GB.
Not that I'm aware of.
Cheers,
Franco
Ok I went to simulate a test by marking the gateway as down but nothing shifted. I can physically unplug the primary WAN to test as well but thought I'd share this.
(https://i.imgur.com/jaUiyRV.png)
"force_down" handling previously is a bit difficult to say given its niche value. Monitoring-induced downtimes already work and cable disconnects will work on 23.7.2.
I've added a commit to include force_down for testing as it would make sense to consolidate. If it works we can discuss adding it to 23.7.3.
https://github.com/opnsense/core/commit/7f1d8c66d3
Cheers,
Franco
Upgraded to 23.7.2 and tried simulating a fallback:
Everything fell back smoothly after WAN when down but after it came back up, existing sessions stayed with WAN2 and never went back to WAN.
Should I re-apply the patch and try again?
On 23.7.2 there is nothing to reapply.
Do you have sticky connections enabled?
Cheers,
Franco
Sticky connections is not enabled. Overtime, about an hour or 2 the connections did move over. Just not immediately.
Is it designed to wait for sessions to end or expire before moving?
Yep. Stateful tracking. You can try to experiment with rules that do not keep state (advanced rule settings). It might move over immediately, but it depends on the client liking that or not.
Cheers,
Franco
If that's by design, which makes logical sense for greatest session stability, then I had the wrong expectations.
Is there an option to force then back, much like connections are forced when WAN goes down for triggers? Most of the clients and apps I use respond well to being forced over with the exception of Discord and Hulu (when you have the TV package - they do a IP "home" check. It also seems to never release it's session, or at least that's the behavior it exhibits)