Randomly loss Internet connectivity, must reboot or Reload all Services CLI

Started by FLguy, January 29, 2025, 03:57:30 PM

Previous topic - Next topic
Hello all,

I have been running OPNsense for a few years without issues.  The firewall can access two internet providers, 1G Fiber and 1G Cable.  I have had both providers for years as well.  Back in November, we started having problems losing access to the internet.  Yet both providers were OK.  Then I noticed it was not a complete internet loss, as some clients could still ping 1.1.1.1.  But DNS wasn't resolving internet hostnames.  Internal hostnames were fine.

Before this problem started, I wasn't running a complete Multi-WAN configuration:

System : Settings : General : Gateway switching was not checked

Firewall : Settings : Advanced : Disable force gateway was not checked


I do have both gateways monitoring their default gateway.  The Fiber gateway is the preferred gateway and has a lower priority.

When the problem first started, I could disable the fiber gateway to restore internet access.  At first, I thought the issue was DNS (Unbound).  Reenable the fiber gateway, and everything was fine. 

So, I enabled the Multi-WAN settings above, but the problem still occurs.  Every few days, internet access is lost, DNS can't resolve external FQDN, some clients can ping 1.1.1.1, and others can't.  I updated the software to 24.7.11_2 back in Nov. The firewall had ~24.7.1.  I have removed the Multi-WAN settings.  Nothing is fixing this problem.

If I disable the fiber gateway in System: Gateways: Configuration
or reboot the firewall
or Reload all services in cli

That will fix the problem. 

I have tried other things like disabling the Cable WAN interface and running only on the fiber connection. 

I need some guidance on what to look at.

Thank you for any time and support,
Nick

The internet Gremlins are out to get me since writing this post.  The problem has happened 4 times now.  Now I just had an active ssh session open to the firewall to use option 11) Reload all services.  Once I do everything goes back to normal.  I'm not sure what to do at this point.  This firewall has been perfect for over 2 years.  Now, something is wrong with it.  Rebooting doesn't solve the problem long term.  Back in Nov, it was once every few weeks, then Dec - Jan, every 2 to 4 days.  Now, it happens every 8 to 12 hours.  :(

Randomly loss connections, DNS does not resolve external FQDN.  My work computer is actively on Cisco VPN and is unaffected by this problem.  That IPSec VPN from my laptop stays active, and all the internet is backhauled over that session. When this problem starts the rest of my house falls apart. 

I would love to know what I should look at and a possible path to resolving or troubleshooting the issue. 

Thank you for your advance. 

Hello all,

I still have this problem, but it's gotten worse.  The problem is either with my Fiber ISP, ONT ("modem") device, or my opnsense router.  If I reboot My Opnsense router or reload services, the problem disappears for about 5 to ~90 minutes.  Then I lose internet access over that link.  Right now, I disable the gateway for that provider, and I have no issues running on my Cable provider. 

Any advice on what I can look at to see what is causing this issue?  My hardware is an i5 Protectli.  I just purchased another router from Aliexpress to test if it is hardware-related.  Once it gets here, I am going to install a fresh copy of opnSense and apply my current configuration to it.  I doubt it is hardware-related, But if the problem stays on the router, it will have to be ISP, ONT, or my Opnsense configuration.   

I would like to know if anyone has suggestions on what logs to look for or any troubleshooting advice.  I have a support ticket with my fiber provider, of course they don't see any problems.  But I'm asking for a new ONT device.  I have done everything I can think of.  Part of me feels it's opnsense because I can reload services from CLI, and the internet is functional for another 5 to 90 minutes. 

Also, another thing to mention is that when I reload services from the CLI menu, about 40 to 50% of the time, it will "hang" or stay on "Configuring WANF interface..." (Which is the interface connected to my Fiber ONT device.)  With my cable provider, which is named just WAN?  It will never hang on Configuring that interface.

Thanks for your time,
Nick


I'm still having this issue, only with my Fiber provider. I have been running off my cable provider, but it had issues on Monday, So I had to move to my fiber provider, and I can't go a full 24 hours without losing the internet completely. I really believe it's OpnSense now, as if I reload the services via CLI, it will always hang on the interface connected to the fiber provider. After 20 to 30 Seconds, the internet starts working.

I'm still looking at the logs I should look at.  Anything to troubleshoot? 

OpnSense was great for years, I hate to move away from it.  But I might try pfSense. 

Any advance would be great.

I won't be able to help you because I don't run with multi WAN despite having two fiber lines. But you are not providing any of your settings so it would take a lot of questioning to get to anything of use to assist.
First of all I'd suggest to go over the manual for multi WAN although I imagine you've done that already. I won't hurt https://docs.opnsense.org/manual/how-tos/multiwan.html
Second, provide all technical details of your setup sans sensitive info. People can't guess your setup!
What settings? All those from the manual: gateways, interfaces, firewall rules. What bits you have in your infra (pi-hole maybe, AdGuard), what services running on OPN? You see there's a lot of moving parts, and there's something not right. OR your ISP service is flaky. But from the description, the setup needs checking.

@cookiemonster, I appreciate the reply. In my original post, I did try to list the Multi-WAN settings (all the settings I thought would matter), with some highlighted in bold:
Quote from: FLguy on January 29, 2025, 03:57:30 PMBefore this problem started, I wasn't running a complete Multi-WAN configuration:
System : Settings : General : Gateway switching was not checked
Firewall : Settings : Advanced : Disable force gateway was not checked

I do have both gateways monitoring their default gateway.  The Fiber gateway is the preferred gateway and has a lower priority.
When the problem first started, I could disable the fiber gateway to restore internet access. 
If I disable the fiber gateway in System: Gateways: Configuration
or reboot the firewall
or Reload all services in cli
That will fix the problem.
I have tried other things like disabling the Cable WAN interface

Sorry, I didn't list my interfaces nor rules, as they are default for the most part.  No Pi-hole or AdGuard.  Funny thing is, really not many moving parts to my setup at all. 

That said, this problem isn't a MultiWAN issue (as I haven't discussed Multi-WAN since the first post). When I'm trying to troubleshoot this problem, my cable interface is entirely disabled.



What can make OPNsense stop routing traffic over the internet, where reloading all services fixes the problem?  Are there any log messages I should look out for?

Thanks for your time and support.

If your work VPN stays up and running, you don't have internet connectivity loss...
The fact that some clients can ping 1.1.1.1 while others can't is a clue. Any pattern across both groups?
Ping and trace route and DNS lookups from OPN itself?
WAN interface configuration on the fiber side? DHCP?

Personally, at some point, I would simplify for a while.
If possible, when the issue occurs, turn off the VPN (in case traffic is routed over there).
I can't run multi-WAN and I have no clue what happens when OPN is not configured for it BUT you have 2 WAN interfaces present...
One simplification here is to undo all multi-WAN config and reassign WAN to the physical interface of your choice until you have figured out what's going on.

When my DNS was failing, just like yours, my work laptop with its VPN was all good, only knew from the family complaining. I know the feeling.
That I fixed when I traced the problem. It took me a while to come up with the solution but it won't help you because is to do with a stub resolver I run, so does not apply.
So similar suggestion, simplify. Stay with one WAN. Back to your original posts yes, sorry. I missed that you had experienced the problem also on single WAN.

So you're going to have to connect to your OPN when it happens and look in logs whilst your house screams at you, or dig in the logs knowing from what date/time you restarted them but there's no avoiding diagnostics.

It does smell like either interfaces flapping states or dns. Or dns due to the former.
You might want to consider resetting it from scratch for a single WAN, avoid the risk of having a stray setting left from that.
Either way, stick to one WAN and provider until is solid before adding the second WAN in the setup, both physical and logical (for the settings as mentioned above).
With only one then, Back to basics:
- Verify your WAN settings. What to look for? Is it setup as per your ISP, that sort of thing? The gateway to monitor is stable?, etc.
- Your interface. Is it realtek? look for clues in the message buffer or /var/log/dmesg.today
- DHCP from your ISP, anything of note there?
Just off the top of my head.
Point is you need to narrow down the problem.

@cookiemonster, NO APOLOGY needed.  1000% appreciate the trying support.  I have tried to simplify the configuration as much as possible to the point where I am disabling either the cable or fiber interface via Interfaces > WAN-C (or WAN-F), unchecking Enable Interface.  When I run on the cable provider, I never have the problem unless the cable provider has an issue.  I can see that because the router monitor is red for that link in Gateways. 

Then, when I have time and desire to troubleshoot this issue, I will disable WAN-C and re-enable WAN-F.  Most times, it will last over 24 hours or more.  Then, randomly, everything breaks.  Of course, in this simplified configuration, my work laptop does lose VPN access.  I'm a network engineer by trade.  ;)  Not a good look.  haha 

Yes both WAN interfaces are using DHCP.  I will start looking at the dmesg.today log file.

Thanks very much for the support here. I was running this configuration for years, but something happened last November, and I haven't been right since. Of course, the fiber provider wasn't helpful. 

LOL, in the middle of submitting my last post above, the problem happened.

It is a Layer 2 or Layer 3 issue with my fiber provider.  My routing table still had the default route, but I could not ping the gateway from two different computers on two different VLANs.  Wired and Wireless.  Yet in Gateways, OPNsense shows that my gateway monitor is alive, which is my default route peer IP.  So internal clients can't reach the fiber provider peer IP, yet OPNsense thinks everything is fine. 

I'm digging into the details now.

So System log shows nothing, other then my actions (config changes) to resolve the problem.  Now dmesg.today is confusing me, as I see a reboot:

Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 0 0 0 0 done
All buffers synced.
Uptime: 16d1h20m50s
---<<BOOT>>---
Copyright (c) 1992-2023 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.

But I don't remember rebooting the router.  But it's possible I did.  Here is the interesting part, right before the above boot messages.  I have this message:

arpresolve: can't allocate llinfo for 1.2.3.4 on igb2
Appears over 100 times. Of course, 1.2.3.4 is a redaction of my Fiber provider gateway.  IGB2 is my fiber intrface.  :)  SOMETHING.  What does this message mean?!!?  haha

I'm still looking into the details, but it doesn't appear to be a DHCP issue, as I have messages at 10:30 a.m. today that the WAN interface renews DHCP without any issues. My problem today started at 22:26. 

But this arpresolve message is my problem, I see many people with pfSense have the same issues.  So far, static ARP and Static IP resolve the problem, but still no root cause for why this issue... 

You now mention VLANs. That's new, and something to check. See if you have mixed tagged and untagged traffic. If yes, correcting that might be a gremlin to quash.
The arp table being incorrect puts you back (I think) on the multi WAN setup. A fallback should take care of that and if I believe it means it is trying to reach a gateway outside of your WAN net. Don't you have to setup a separate gateway for each WAN for a multiWAN, with some setting (I don't know which settings are correct) with the "Far gateway" an important setting?
Someone with better knowledge of multiWaN should be able to advice.
I wonder if your far gateways are both set for the one ISP and hence can't reach it when it fails over to the other. Just a thought. Could well be completely off the mark.

+1 on the error message being multi-WAN related.
Per franco on another thread: The error comes from trying to reach a gateway outside of your WAN subnet.
If you still have the GW group (even though you disable one of the interfaces) and the active one goes down...
Or possibly you get a bogus WAN IP (seems unlikely).

That's one of the reasons I suggested to reassign WAN instead of enabling/disabling WAN-C/WAN-F. No GW group.
Of course, I don't know how different your interfaces configs are, so it may be too cumbersome.

I wouldn't entirely rely on the GW monitor. Again, you can go to Interfaces > Diagnostics and check things out.
There's also ssh and command line...

Quote from: cookiemonster on March 31, 2025, 02:47:10 PMYou now mention VLANs.
My bad, VLAN and "LAN segments" have no issues reaching each other and the cable internet when it is active. This is 100% a WAN issue, only with my fiber provider. Every post/thread I read on the Arpresolve message in both pfSense and OPNsense reads like my experience.  Most if not all of them are random in nature and started out of the blue.  None of these threads has found a root cause to this issue.  I have seen workarounds.  I want to confirm that I get this message(s) before I repair the issue next. 

No, for far gateway, I can easily see that as a cause of this problem. I even tried to disable gateway monitoring and Host Route. I have read many posts about disabling the gateway monitor (aka dpinger). 

Quote from: EricPerl on March 31, 2025, 07:39:51 PMPer franco on another thread: The error comes from trying to reach a gateway outside of your WAN subnet.
If you still have the GW group (even though you disable one of the interfaces) and the active one goes down...
Or possibly you get a bogus WAN IP (seems unlikely).

That's one of the reasons I suggested to reassign WAN instead of enabling/disabling WAN-C/WAN-F. No GW group.
Of course, I don't know how different your interfaces configs are, so it may be too cumbersome.

I wouldn't entirely rely on the GW monitor. Again, you can go to Interfaces > Diagnostics and check things out.
There's also ssh and command line...

I would like to see the message from Franco. 

So, there are no gateway groups. I did configure one to try to resolve this problem, but it didn't work, so I removed it. I believe gateway groups are more for PBR than anything else.  This issue is 100% a Layer 2 or Layer 3 with my fiber provider's ONT device and my router.  That arpresolve popping up on other setups, and the problem it causes.  Is exactly my issue.

Moving WAN-F to another interface is a good suggestion. I don't have any inbound rules to move. 

Without being told to look at the dmesg.today, I would still be in the dark.  Many posts with my exact experience.  VERY frustrating issue for everyone. 







Sorry I don't have knowledge of multi WAN. Last parting thought. My suspicion is that there is a misconfiguration that can only be found by addition rather than elimination at your stage. Or rather, that is the approach I would take. You know when you have made so many diagnostic changes that things are a rather convoluted and you can't remember what was what. Happens.
So I would setup OPN from scratch with ONE provider, one WAN only. The problematic one. That's your "safe place". Get it working. That will be telling you the setup is right, before moving to any multi WAN setup.