WAN gateway not going back up after Internet outage

Started by kevindd992002, August 20, 2024, 10:25:45 AM

Previous topic - Next topic
Quote
But then I had a thought and tested it by unplugging the coax cable on the primary WAN instead of the NIC,  The failover to backup worked fine but when the coax was reconnected there was presumably no reconnect event and the primary was not detected up again.

I am not exactly hopeful here. apinger, dpinger, nonpinger... all suck. This stuck condition is something I've observed for ages on one particular site using "the other project's sense". That site is so inconveniently located that it was "solved" by using a IP watchdog socket. It will eventually power-cycle the ISP gear - which will trigger link down / up - enough to respawn the dpinger zombie back to life.

What if we offer a lightweight way to trigger the "fix stuck down monitors" code introduced here via cron job? Just the monitor part:

https://github.com/opnsense/core/commit/0c9d8c94049

Ok, so that's sort of go back to "unconditionally restarted all the time"?  :D

Fair point, although it's not randomly restarting due to monitor events. It's scheduled restarting via user wishes. And we don't have to guess what schedule the user prefers.

Over the years users seem to have grown fond of cron-based workarounds.


Cheers,
Franco

Well I don't have anything against that (will just cause some additional log noise). Better than having it in perpetually stuck state.

I made an error with the previous patch. Here is the revised version with cron job and all:

https://github.com/opnsense/core/issues/7027#issuecomment-2314857927

Only use the cron job if the issue persists with this patch applied over the next days.


Cheers,
Franco

I had this happen again today. Internet connection went down overnight because of maintenance. I woke up to no Internet. I had to go gateways, edit the gateway and save.

Is there any update to this? Is this happening to pfsense too?

The patch in question was added to 24.7.x already. You can add the "Manual gateway switch" cron job to adjust the situation.

I don't know about the other *sense. You trade a bug for another either way I think, but whatever works works.


Cheers,
Franco

Quote from: franco on October 29, 2024, 08:35:38 AM
The patch in question was added to 24.7.x already. You can add the "Manual gateway switch" cron job to adjust the situation.

I don't know about the other *sense. You trade a bug for another either way I think, but whatever works works.


Cheers,
Franco

I'm running 24.7.4_1 when this happened yesterday. Is the "manual gateway switch" created mainly as a workaround for this issue?


Hi,

I am correctly understanding thet the road map of this https://github.com/opnsense/core/issues/7027#issuecomment-2462108325 should lead to a solution with the Opnsense rev 25.7 ? Thank you.

Eagerly awaiting 25.7!

Is there a workaround in the meantime, aside from patching or manually restarting the gateway monitor? Is it better for now to disable gateway monitoring?

I have three WAN connections, two fibre and one Starlink. The latter is a last resort backup to prevent complete outages. I have had a couple of times where the monitor observes/thinks some of these WAN connections are down and they don't recover, even though they have.

I enabled the 'manual gateway switch' cronjob and set it for every five minutes. It has successfully restored a WAN connection that was incorrectly marked as offline.


Try the new failover/failback options as described in the documentation. https://github.com/opnsense/docs/commit/1b5e6684c8

Both are available now as of recent 25.1.x.


Cheers,
Franco