Multi-WAN failover not bringing link back up properly

Started by pjw, October 25, 2024, 04:55:35 PM

Previous topic - Next topic
I have a multi-WAN setup with two uplinks (one to broadband, one to Starlink).  I have rules in place to split traffic between them, home traffic to broadband, work traffic to Starlink.  Works great still.

Note to all of this: this setup with my gateway groups and all my firewall rules have been running fine since the previous major release, and this major release.

What I'm seeing though is the link health check will kick in sometimes because Starlink will have a hiccup, and will fail the link and initiate failover in the gateway group.  What won't happen is the piece monitoring for the Starlink side of things to come back up won't bring up that interface again.  I have to log into the UI and toggle off the interface my Starlink is plugged into, Apply, and then toggle on and Apply to turn the interface back on.  Then poof, link is back and we're happy.  I've tried power cycling my Starlink router (it's in bridge mode) and that has not helped.

Worth noting that I tested bringing my broadband link down, and the same thing happens.  I have to manually toggle the port for the gateway to be brought back online.

I seem to recall right after the 24.7 rollout that some folks were having issues with getting the links back up on a failover scenario.  I had different problems (since resolved) so I never paid attention to it.  But it does seem like there is still an issue here.

Happy to try anything or share any details of my config if anyone is willing and able to help debug.

Same here, I use multi-wan to manage multiple VPN connections but once one become offline it never goes back online. I have to manually switch off and on the gateway (or change something else that seems to restart the ping process).

I'm hoping to bump this since I'm continuing to have this issue plus what it appears to be a regression.  I'm updated to the latest 24.7.8 release.

I had my second WAN link go down this morning, and I had to still bring down the interface in OPNsense, apply it, and re-enable the interface, and apply it, to make it see the WAN link was actually up.  This time I tried rebooting the Starlink just to see if that link toggle might do the trick, but still, no dice.  The OPNsense seems to just refuse trying to bring the link back without manual intervention.

What seems to be a regression is after manually toggling the interface and bringing the gateway back up, my connections that are supposed to be headed over that gateway group do not fail back.  This was the case after my initial 24.7 upgrade, and somewhere between now and then, it was fixed.  Now it is broken again.  The only way I can fix this is to manually fail the main WAN link, or reboot my OPNsense mid-day.  Neither is a great solution.

I'm hoping a dev sees this and can either indicate these are known issues, or if they need additional information to help troubleshoot.  I'm more than happy to provide anything I can if it helps get these issues under control.

Perhaps, you might consider looking into
https://forum.opnsense.org/index.php?topic=44049.msg219587
   -->   https://github.com/opnsense/core/issues/8064:
"[ BE 23.10 ... BE 24.10_7-amd64 ] Automatic fail-over to a Fallback Gateway still fails"

Just in case anyone else is following this and wasn't aware, this does appear to be fixed in one of the last 2 updates.

December 29, 2024, 11:20:47 AM #5 Last Edit: December 29, 2024, 11:22:31 AM by FredFresh
I changed configuration, switching from the gateway group style (suggested by OPNSENSE documents to manage wireguard connections) to a WAN gateway style (wireguard gateway are now eligible as default/active WAN routing).

The issue is greatly mitigated, but sometimes still happens. In my case, the switch off/in of the offline gateway is not enough, I have to perform 1 or more TRACEROUTE to the gateway addresse (i.e. 10.2.x.y).

The other thing I observed is that, in case I switch off/on the modem, half of the time the pubblic IP of the main WAN gateway is not updated and I have to force an update using interfaces/overview/commands/reload button.

@pjw, are you using gateway group?

PS: I forgot to mention that with the new configuration, the siwtch netween one conenction/wireguard VPN and the the other is immediate, instead before it required up to 5 minutes.