OPNsense Forum

Archive => 19.1 Legacy Series => Topic started by: alh on March 27, 2019, 11:34:54 pm

Title: [SOLVED] OPNsense locks up completely when WAN allegedly goes down
Post by: alh on March 27, 2019, 11:34:54 pm
We have installed OPNsense on a SuperMicro system with roughly the following setup:

- 1 WAN uplink via a cable modem (monitored via Google DNS)
- both local users and LDAP users (LDAP is cloud-based)
- 2 LAN ports with different subnets
- DHCP services
- OpenVPN RoadWarrior services
- S2S IPsec
- FreeRADIUS

We have the serious problem that if the WAN port get's disconnected or OPNsense thinks that the gateway is down then the system locks up completely, e. g.:

- It is impossible to logon to the firewall anymore (I understand that LDAP-auth must fail but local users should ALWAYS work)
- The WAN interface does not come up anymore, the gateway stays down

After a forced reboot there are even stranger issues:

- DHCP ranges get messed up (e. g. x.x.x.50-x.x.x.100 & x.x.x.150-x.x.x.200 becomes x.x.x.50-x.x.x.254 & x.x.x.150-x.x.x.200 causing the service to fail because of overlapping ranges)
- OpenVPN services show as down in the GUI but are actually running, they need to be killed in the console and then restarted in the GUI
- The gateway/WAN does not come up anymore. It shows extremely high latency etc. while the uplink is perfectly fine and any other machine (macOS, Windows-notebooks) works. One has to click madly to and through and hope that some action brings the gateway up again

What could be the cause of this? Is the external LDAP the problem? However, this should not have all the other mentioned side effects. Why does the gateway not come up anymore? Any help appreciated since this happens every couple of days and costs hours to get Internet working again.
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: bartjsmit on March 28, 2019, 08:40:42 am
Can you try an internal LDAP replica? Using cloud LDAP sounds like it could introduce a circular dependency.

Bart...
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: alh on March 28, 2019, 11:42:52 am
I can try but isn't it a bit weird that the non-existence of an LDAP connection brings the whole system in turmoil?
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: rabievdm on March 28, 2019, 11:52:12 am
One thing I have noticed (not quite related to your issue) is that when the WAN interface goes down the WUI appears to not be available yet I can log on via SSH.

What I have worked out is that I have a couple of plugins for AV and site reputation checking and even though the firewall url has been added to the exclusion list there appears to be still some form of a look up attempted.

If I leave the page long enough it will eventually load al be it slowly.

On using a cloud based auth, it's a bit of a double edged sword, but agreed if you do have a fall back configured then it 'should' fail back to that. Just speculating here, unless you have other lookups occuring against ldap not just user auth for the firewall admin, then I could understand the firewall taking a nose dive as it tries to look up ID's and having to wait x seconds for a timeout.

Not sure if this is possible of practical, but having a local cached copy of your LDAP DB might also assist.

But agreed it does sound like unexpected behavior.
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: alh on March 28, 2019, 03:28:15 pm
What I also noticed is that the firewall says that our gateway is offline. However, I can perfectly connect via VPN and SSH via this gateway. I really don't understand what is happening here...
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: chemlud on March 28, 2019, 03:33:44 pm
apinger or dpinger (?) going mad? do you need gateway monitoring? try to switch off...
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: alh on March 28, 2019, 09:12:40 pm
We switched it off for now. But I wonder what is going to happen if one has a Multi-WAN setup. If dpinger goes mad, then what?

We will test this weekend what is going to happen with monitoring off when we pull the WAN plug.
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: chemlud on March 28, 2019, 10:08:12 pm
apinger has a history of such issues

https://redmine.pfsense.org/issues/3227

since 19.1 dpinger should be default, but you never mentioned your version of opnsense ;-)

https://www.c0urier.net/2019/opnsense-19-1-released

anything related to pinger in the logs? when monitoring shows interface as down?

you could try a different monitoring IP (e.g. 9.9.9.9), although this might be voodoo...
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: alh on March 28, 2019, 10:19:12 pm
Sorry, I assumed that posting in the 19.1 thread would suggest that we are on 19.1, 19.1.4 to be exact.

I only see these entries in the log:

Code: [Select]
dpinger: WANGW 1.1.1.1: Alarm latency 13415us stddev 3394us loss 22%
dpinger: GATEWAY ALARM: WANGW (Addr: 1.1.1.1 Alarm: 1 RTT: 13415ms RTTd: 3394ms Loss: 22%)

Before Cloudflare we used our ISP's Gateway. Same issue.
Title: Re: OPNsense locks up completely when WAN allegedly goes down
Post by: alh on April 14, 2019, 11:48:43 am
After some more analysis we think that we tracked down the cause of the issue. A mixture of ISP failure, cable modem error, misconfiguration and some unexpected OPNsense behaviour.

The setup was:

- Gateway monitoring enabled on the WAN-Gateway
- On the WAN interface we had as gateway "auto-detect"
- We had a VTI-IPsec to Azure configured

What probably happened:

- Upon connection loss to the WAN-Gateway went down
- Since the WAN interface was set to "auto-detect" the gateway it probably switch to the IPsec gateway (which was of course as well down but not monitored)
- Now all subsequent auth requests to the firewall took a long time, probably because the LDAP connection over the IPsec gateway took a long time to time out so it appeared that the firewall locked up completely
- To add to our problems the cable modem of the provider (a Fritz!Box 6490 Cable from Unitymedia) now lost its 'exposed host' configuration since the WAN interface of the firewall did not have the WAN gateway configured anymore but  the IPsec gateway. Therefore there was no chance for the WAN gateway to come up again
- Now the DHCP service of the OPNsense went completely crazy, didn't work anymore and even lost its config

What we learned from this:

- Unitymedia is probably not the best provider around, make sure you get a real cable modem from them and not some bridged consumer hardware (Fritz!Box)
- Despite what the official documentation: ALWAYS set the gateway on your WAN ports
- Also bear in mind that you ISP might block your monitoring pings (Unitymedia does this) so inquire beforehand and increase probe interval and time period if necessary

What is still unclear to us:

- Why did the DHCP service on LAN not work anymore and even lost its configuration (overlapping ranges as result)