OpnSense stops routing all traffic when WAN drops

Started by anomaly0617, December 17, 2019, 04:50:40 PM

Previous topic - Next topic
December 17, 2019, 04:50:40 PM Last Edit: December 17, 2019, 05:36:57 PM by anomaly0617
Hi all,

I've got a weird one, but I've now seen it at different locations and it's concerning.

Please note that this one is very similar to this post, also by me, from a while back.

First, some specs because well, everyone loves specs:

System Information
Name frasvrfw.{redacted}
Versions OPNsense 19.7.6-amd64
FreeBSD 11.2-RELEASE-p14-HBSD
OpenSSL 1.0.2t 10 Sep 2019
Updates Click to check for updates.
CPU Type Intel(R) Xeon(R) CPU X3450 @ 2.67GHz (8 cores)
CPU usage
Load average 0.53, 0.38, 0.28
Uptime 1 days 03:06:07
Current date/time Tue Dec 17 10:33:25 EST 2019
Last config change Mon Dec 16 3:31:41 EST 2019
State table size 1 % ( 8526/814000 )
MBUF Usage 1 % ( 7100/506546 )
Memory usage 13 % ( 1099/8144 MB )
SWAP usage 0 % ( 0/8192 MB )
Disk usage 2% / [ufs] (1.4G/101G)


Interfaces
   GUEST 1000baseT <full-duplex> 192.168.25.254
   LAN 1000baseT <full-duplex> 192.168.3.1
   PLC 1000baseT <full-duplex> 192.168.20.1
   Phones 1000baseT <full-duplex> 192.168.9.1
   Printers 1000baseT <full-duplex> 192.168.6.1
   SAN 1000baseT <full-duplex> 192.168.10.1
   SANBACKUPS 1000baseT <full-duplex> 192.168.16.254
   SECUREWIFI 1000baseT <full-duplex> 192.168.5.1
   SECURITY 1000baseT <full-duplex> 192.168.50.1
   WAN 1000baseT <full-duplex> {redacted}


Ok, with that out of the way, I've got a Dell PowerEdge R210 running as an OpnSense firewall/gateway between multiple networks (separated into VLANs and their own subnets) with the OpnSense firewall at the center.

We've now had this happen twice -- 16-Dec-2019 between 3:20 AM and 7:20 AM, and about a month ago where there was a 35 minute power outage.... so I'm no longer thinking this is a "fluke."

The trigger:
The WAN connection drops at the provider end (in other words, between the internet provider's fiber router and their closest node)

The symptom:
OpnSense stops routing all traffic -- even the internal traffic between internal subnets such as from the LAN to the PRINTER networks. This continues to occur even after the WAN connection is restored and the internet is available again.

The workaround resolution:
Reboot the firewall and all the problems go away... until the next time the internet connection drops.

In both instances, the power in the area has gone out, and our on-site battery backups and generator have kept the building up and running throughout. So my OpnSense logs do not show that the firewall has rebooted. But if I go to Reporting -> Health and look at any metric, ie:

Packets -> LAN
Packets -> WAN
Packets -> IPSec
Quality -> Gateway

They are all flat-lined between those times.

So the question is, what would cause OpnSense to stop routing traffic between internal networks when the WAN connection drops, and is there a way to fix it, short of setting up a check_internet script that reboots the firewall if it can't get to something really common like google?

Thanks in advance, all!

December 17, 2019, 05:23:24 PM #1 Last Edit: December 17, 2019, 05:25:57 PM by chemlud
...haven't seen anything related here when WAN goes down. Sure that all your switches and clients are covered by the USV?

Can you reproduce the issue by simply pulling the plug on your fibre modem?
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

We haven't tried that yet. Running a multi-site business, management gets a little... testy... when we arbitrarily decide to take the network down for no perceivable reason. This may be something we test on a weekend in the early morning hours.

Did you ever figure this out? We get the same thing with our not so reliable internet. When the WAN goes down we lose internal traffic shortly after. Can't print, access internal Nextcloud, etc. Supposedly going to get fiber out here soon, which should be far more reliable than our WISP. But until then, we lose internet at least a few times a week.

Not so far. I'm using my check_internet.sh script to get around the problem at the moment, but one thing I've noticed is that the cron job does not persist from upgrade to upgrade, so I have to manually put it back in.

It's unfortunate that you are seeing the same problem, but also reassuring that I'm not the only one. :-/

We don't have any power outages or reboots happening. Our wisp has issues some 25 or so miles away on a mountain top and the internet goes down for anywhere between a few minutes and few hours. Once the internet is down, internal routing soon follows. I'm using Unbound and wondered if switching to DNSMasq would resolve it, but haven't felt like trying it. I wish I were more knowledgeable to find info from the logs that might help.

I had a similar Problem like you, but a little different.

For exampler: On remote Site, when I had to reboot a attached switch, then my VPN Tunnel went down until die Uplinkport to the Internal Switch is up again, or opnsense was rebootet. The Internet connection for opnsense itself was not affected.

I found out, that when I set the checkbox "System - Settings - General - "Allow default gateway switching" then the problem does'n come again. I did not understand why, but since that it's working.

Maybe that could be a solution for you to.

anomaly0617, did you figure out your internal routing issue? I got my internal routing working stable regardless of the WAN by leaving all interfaces selected on Unbound's outgoing interface setting. Re-tested it on a fresh install and limiting the outbound interfaces killed internal routing without WAN. I selected all again and unplugged the WAN and never lost internal routing.

Just checked and no, that doesn't resolve it for me. From what I can tell, at least on my installs, the fix seems to be in rebooting the firewall if it cannot ping a server on each of the networks and a publicly hosted website like google.com. The problem is, every time you upgrade the firewall, the crontab rule that kicks this script off gets removed, so I have to re-add it manually.

Thanks for the idea, though! It made me go through my Unbound configuration with a fine tooth comb!

Why should a cron job get deleted by updating OPNsense? I have custom cron jobs defined as templates via console and set up via GUI, they survive any update. Only if you do a fresh install you have to do the console part again, as it is not stored in config.xml.
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

Quote from: agrumpyhermit on January 08, 2020, 01:44:27 AM
We don't have any power outages or reboots happening. Our wisp has issues some 25 or so miles away on a mountain top and the internet goes down for anywhere between a few minutes and few hours. Once the internet is down, internal routing soon follows. I'm using Unbound and wondered if switching to DNSMasq would resolve it, but haven't felt like trying it. I wish I were more knowledgeable to find info from the logs that might help.


Are you adding that Cron job the official way using actions.d?.
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

Quote from: marjohn56 on February 16, 2020, 09:59:44 PM
Are you adding that Cron job the official way using actions.d?.

My guess is, probably not. I'm adding it via the console using crontab. There appears to be no GUI-based way to add a cron job that isn't from the pre-populated list. With that said, if there's a better way (actions.d) let me know how to do it and I'll change up my instructions to match once I try it. It would be really nice if I could get these to persist across upgrades!

Thanks!