WAN gateway not going back up after Internet outage

Started by kevindd992002, August 20, 2024, 10:25:45 AM

Previous topic - Next topic
After an Internet outage, my WAN gateway does not go back up until I disable and re-enable it in the opnsense GUI. This has happened twice already and I see a lot of these in the logs since the Internet went down:

2024-07-23T08:32:34   Warning   dpinger   WAN_GW 8.8.4.4: sendto error: 65   

I don't have any custom time settings in the gateway. The only custom setting I have there is the monitor IP (8.8.4.4) and everything else is at their default.

Thoughts?


I am having this same issue.  I have a multi-WAN setup.  When my primary WAN goes down I have to reset the WAN1 Gateway Monitor for OPNsense to see the gateway as backup.  I am not using PPPoE.  interface is setup to pull an IPv4 address from my ISP, Cox.  Using a Neatgear CM1000 cable modem.  Gateway switching works just doesn't switch back until I reset the gateway monitor service.

Dual-WAN is setup using the OPNsense guide https://docs.opnsense.org/manual/how-tos/multiwan.html

Perhaps worth trying this patch https://github.com/opnsense/core/issues/7027#issuecomment-2298661225

In reality it's hard to guess when the running dpinger is actually able to come back or really stuck in a permanent down state.. in the second case it would probably be better if it exited so it could be restarted gracefully.


Cheers,
Franco

I'm not sure if you have a multi-wan setup, but gateway groups with multi-wan appear to be broken in 24.7. The symptoms are similar to yours, so it may be related. After an internet outage, WAN does not come back online. Here is a thread: https://forum.opnsense.org/index.php?topic=41915.0.

Because of this, I have downgraded to 24.1 for the time being.

Happy to hear this - I've got the same problem also with PPPoE multi WAN.
I only moved to multi-WAN around the same time I went to 24.7 so I was assuming I'd stuffed something up.

I checked 'Disable Host Route' about a week ago on just the primary WAN interface and haven't had the problem happen since. 
I've just applied the patch Franco mentioned above, and I also applied the PPPoE "Call for Testing" one last week although I left 'disable host route' switched on. 
In my case to get things working again after one of these episodes just required going to System-Gateways-Configuration and just hitting the Apply button.

Disable Host Route is unchecked again as of just now and I will advise what happens.
cheers all, love your work.


August 21, 2024, 09:16:26 AM #6 Last Edit: August 21, 2024, 09:19:46 AM by PencilHCV
Hi!
I also have Multi-Wan Failover and OPNSense 24.7.1 and tested this morning by unplugging the WAN1 network cable and after a few seconds my backup Internet was up and running without a problem. There were no problems either when I reconnected the cable for WAN1.
My OPNSense Server is bare metal server, not Virtual. To provide some more information about my environment
I configured Wan Failover I followed this video:

https://www.youtube.com/watch?v=CcXYiFj9mBA

Best regards,
HCV

It is not PPPoE. It's IPoE (DHCP). And this is only single WAN. I'm guessing the problem with the multi-WAN being described here is the same as the one I'm having as they both use dpinger anyways. I'm reading other threads saying that this has been fixed a long time ago but apparently it is not.

I'm new to opnsense (coming from using pfsense for a very long time). So if I apply that patch now, do I need to "remove" it or something when the next version (the one that has this fix incorporated) of opnsense comes?

I'll post the revised version of my patch suggestion:

https://github.com/opnsense/core/commit/0c9d8c94

# opnsense-patch 0c9d8c94

Patches are removed on updates. Running patch a second time removes it too (and the third time adds it again and so on and so forth).

This patch is not magical. I think it mostly helps multi-WAN setups to recover stuck gateways. With a single WAN the dpinger process may be stuck indefinitely and it won't tell us if it fails to send (which is where it can get stuck) or if the other end fails to respond (that's ok)


Cheers,
Franco

So then how do we solve dpinger being stuck indefinitely for single WAN? Are you saying that isn't an issue?

Sent from my SM-S916B using Tapatalk


Isn't that the question? I don't have all the answers.


Cheers,
Franco

Testing the other day with the first patch seemed to go ok with failing over and then back when unplugging and plugging the WAN cable.  I had a real outage today though and everything failed over to the backup WAN perfectly but then did not come back when the primary was healthy again.
I've just applied Franco's patch 0c9d8c94 now, and tested by unplugging and plugging again and that also worked fine.   

But then I had a thought and tested it by unplugging the coax cable on the primary WAN instead of the NIC,  The failover to backup worked fine but when the coax was reconnected there was presumably no reconnect event and the primary was not detected up again.
After that unplugging and replugging the NIC also didn't fail to the primary until I just clicked Apply in the System-Gateways-Configuration page and the primary came back to life.

I'm still pondering about what to do with sendto error: 65 stuck in dpinger logging it but indefinitely trying to recover in vain. It would be beneficial if it exited at some point, but that also means more work on our end. I don't like more work. ;)


Cheers,
Franco

Quote from: franco on August 24, 2024, 10:18:49 AM
Isn't that the question? I don't have all the answers.


Cheers,
Franco

It is. I just don't know if this issue can be considered a bug that needs to be in the opnsense team pipeline or not. Dors this mean that this also happens in pfsense? I don't remember having this issue when I was still using pfsense. No offense meant.

It's in the eternal pipeline of trying to get it fixed. What I remember from refactoring this code is that in the past these monitors were unconditionally restarted all the time which likely masks the problem for most and nowadays we see the problem spots persistently. In theory that is good. But in practice it looks like dpinger stuck forever:

https://github.com/dennypage/dpinger/blob/664f5c7aa617fa71834a90b132baf7188ca84a2b/dpinger.c#L366-L369

The error is treated like a ping loss which it is not (at least not for an indefinite amount of time). The only clue it gives is a log line and when you search for it online you find a lot of forum threads about how this is unclear why it's stuck and how to supposedly fix even for pfSense. :)

I can actually reproduce the error condition here by restarting my WAN connectivity and dpinger logs the error, but it quickly recovers. That is what we want. Now we only have to know why it cannot recover in some cases...


Cheers,
Franco