[SOLVED] Multi-gateway rules not working after 24.7 upgrade

Started by pjw, August 01, 2024, 06:59:00 PM

Previous topic - Next topic
I have a setup with two WAN uplinks, and I've had routing/firewall setup with two gateway groups to split traffic and support failover.  It looks like this has broken again since the upgrade?

What I have is on OPT1, which is where all my LAN traffic comes into, I have two firewall rules, in this order in the GUI:

- Incoming to OPT1, destination to an IP group (set of destination IP's) => gateway group 2
- Incoming to OPT1, destination to anywhere => gateway group 1

I made sure to check the gateway groups that they're prioritized correctly.

What I'm seeing is the first rule doesn't seem to be hit, which it was pre-upgrade.  Now what is perplexing is this did work on the initial upgrade, but since the hotfix to 24.7_9, it appears broken.  I've tried disabling the rules and re-enabling them, moving them around, all with "Reapply" in between.  I've also tried rebooting, nothing is working.

Curious if anyone else is seeing similar issues and have any ideas how to resolve.

Further information:

I went ahead and disabled my one "main" gateway in the settings, System => Gateways => Configuration, and applied it.  I saw my secondary gateway become active, and the gateway disappeared from the active gateways view.  Even though it was disabled, traffic is still being routed to it no matter what.  This is really confusing, like the UI is completely ignoring the gateway state and is just routing to the one with a higher priority metric, even if it's disabled.

I ended up trying another thing by disabling the interface for the "main" gateway (disabled the port).  After doing that and re-enabling the interface, it seems my multi-wan is working again for now.

This is still broken.  I had one of the WAN links fail overnight (this is not uncommon) and the multi-WAN setup properly failed things over to my primary.  But it refuses to fail back, and is routing 100% of traffic now out of the primary, and ignoring the firewall rules.

I'm seeing this as well after a recent upgrade to 24.7. It wasn't an issue on 24.1.

I just performed the most recent upgrade to get up to 24.7.1.  This issue still remains where the multi-wan setup just doesn't work, and the higher-metric gateway is always chosen no matter what.

I'm happy to try a patch or anything to help get this fixed.

August 10, 2024, 03:47:09 PM #5 Last Edit: August 10, 2024, 03:51:50 PM by tracerrx
I can confirm that this is happening... I'm not sure its always sticking to the lowest numbered gateway though... Mine failed from primary (252) to secondary (253)... After primary was back up there was nothing I could do to push traffic back to it.. even when disabling the secondary (253) gateway traffic still flowed through secondary (253) and not primary (252).

The only way to restore traffic back to the primary (252) gateway was to reboot... This was definitely introduced in 24.7. 

And yes, all changes were "Applied"... I even re-started the interface multiple times. 

In addition, for whatever reason, when multiple gateways are enabled, sometimes after reboot they show down on the dashboard, and the only way to get them in an "UP" status is to edit the gateway, change nothing, and apply.

One final note, with multi wan on on 24.1 and starlink you needed "Disable Host Route" checked to be able to use gateway monitoring.  On 24.7 Disable Host Route must be UNCHECKED.  It doesnt seem to matter for the xfinity/comcast (primary) gateway.

I've been able to replicate all the above amongst multiple sites with the same multi-wan setups.

I have a similar, related issue I believe. VPN network similar to the setup at https://github.com/FingerlessGlov3s/OPNsensePIAWireguard - some containers on proxmox are assigned to the VpN only network, but still accessible by other hosts on the main internal network. 24.7 broke all routes to the VpN subnet , haven't been able to restore access.

Same here - multiwan load balancing not working since 24.1.10_9. Upgrading to 24.7 did not solve the issue. Even on 'load balancing' set up, behavior is like "failover" set up meaning  traffic will only route thru' active WAN.


Quote from: apunkt on August 16, 2024, 10:45:28 AM
I copy that

i just found this out this morning

i was noticing zero traffic on a tunnel and wondered why...   

and yep it was all going out Wan.       

i had a tier 1 and teir 2 and those were entirely ignored

August 18, 2024, 07:29:56 AM #10 Last Edit: August 18, 2024, 07:34:27 AM by patrick3000
Is there any update on this problem? I upgraded to 24.7 about a week ago, and I learned today that my multi-wan setup with a gateway group no longer works.

I have two gateway interfaces--one is called WAN and the other WAN2. I have them in a gateway group, with WAN being the primary gateway and WAN2 the secondary gateway that is only supposed to be used if WAN fails.

Today, the secondary gateway, WAN2, went down due to an outage at the ISP, and I lost internet in my house even though the primary gateway, WAN, was still active and able to send and receive packets.

After WAN2 came back online, I duplicated the problem manually by yanking the cable on WAN2, and again, I lost internet in the house.

Interestingly, when the primary gateway, WAN, fails, then failover to WAN2 happens as it should, but when WAN2 fails, there is a loss of internet entirely. All of this worked in 24.1

I then searched and found this thread. Unfortunately, I do not see any solution here. Unless there is one, I'm going to downgrade to 24.1, which I can do easily because I run OPNsense in a VM on Truenas SCALE and can roll back to a snapshot. Of course, downgrading is not my first choice since I like the new dashboard in 24.7, but it's more important to have multi-wan working.

I have no workaround so far from anyone, and I've not heard or seen any mention that it's being worked on to fix.

I'm unfortunately not brave enough to try a downgrade, even though my config is backed up in a few places. I can't afford extended downtime since I work remotely. But I rely on the ability to split my traffic between both WANs, since I push my work traffic over one uplink, and the rest of 5he house over the other. I can't do that now.

This was broken at one point in 24.1 as well, and then an update fixed it shortly before the 24.7 upgrade was released. I'm hoping a dev sees this and knows what needs to happen, and an update pops out really soon.

Thanks. I have downgraded to version 24.1.10_8, and multi-WAN works properly again. After the downgrade, I yanked the cable into the modem for each gateway respectively, and OPNsense properly failed over to the other gateway with uninterrupted internet.

Hopefully, this problem will get fixed at some point. and I will then upgrade again.

Trying a bump on this since even after the recent 24.7.2 updates, this is still not working.  I don't know if this has something to do with the second WAN uplink having a higher metric or not, but this setup works/worked pre-big upgrade, and still does not.  I really am hesitant to downgrade, since recreating my config if a restore doesn't work seems a bit terrifying to me.

I'm really open to trying any patches, desk builds, command-line hacks, anything, to try and get this working again.  Any help is greatly appreciated.

For some reason, I have found nothing about this issue except this thread. It definitely worked prior to the upgrade to 24.7, and absolutely does not in 24.7, at least when I tried it last week.

As noted, I downgraded to 24.1.10, and it's back to working, but I was able to do so by rolling back to a snapshot.

One tip: If you downgrade manually to 24.1.10, make sure you have a config file ready that was created in 24.1.10 or earlier. At least in most similar setups (and I assume OPNsense is the same way), restoring from config only works if the config file was created from the same or earlier version to which it's restored.

Of course, downgrading is only a temporary solution. It's not feasible to remain with 24.1.10 permanently, so hopefully there is some interest in a workaround or patch in 24.7 for this, because it's beyond my technical skills to fix it on my own.