OPNsense stops responding

Started by BigBob, September 07, 2020, 04:13:41 PM

Previous topic - Next topic
This one has me stumped.

I am new to OPNSense, but I've run PFSense for over 5 years and I know my way around it pretty well. I recently acquired a Protectcli FW6 devcie and decided to install OPNSense and give it a go.

Install went fine, Dual WAN failover worked great. Rules applied, Dynamic DNS and ACME set up without issue. All good. About 10 hours after installing it, I started to have DNS issues. Could not resolve anything. I checked unbound and it was not responding. I used the gui to restart unbound but that resolved nothing. I then tried to ping the device and got no reply, yet the webgui was still up and working. Traffic was still routing, I could ping externally by IP address and after I set up another DNS server, I could ping by name as well.

Basically after 10 hours, the OPNSense device itself stopped responding to anything but the webgui. I used the logs to check for rule issues and I could see IGMP echo requests hitting the device and being passed, but no reply received at the client. It failed from multiple machines as well. Restarting the device made no difference.

I then erased the disk in the Protectli device and reinstalled OPNSense 20.7. Ran the updates and set everything up again. Ping worked, nslookup worked, life was good. 12  hours later, same issue. Web browsing worked (seemed a little sluggish though) but ping fails, nslookup fails, ssh fails, webgui still functional. No errors that I could see. This time I powered off the device, waited 2 minutes and powered back on. Everything worked as normal again.

Heat? The thermal sensors from OPNSense showed 37C for the CPU which shouldn't be a problem? I am at a loss. I may install pfSense to see if the problem re-occurs.  My current pfSense device is back online and working fine (supermicro c2758).

What can I do to figure this out?

TL;dr: Device itself stops responding to anything but the webgui. Traffic is still routed and rules applied, but unbound, ping, and ssh don't work. Power off and on, everything works again.

If the device is in this mode, can you make some screenshots?

- gateway section
- dashboard
- pftop

Can you verify in that not working state:

- ping gateways
- trigger states reset
- ping again
- go to gatways section and change something save and apply and verify again if things working, please.

What specific version do you run?
Do you run IDS/IPS?
Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de

Thanks for your quick response. I probably can't put the device back online until the weekend. Kids are school from home and it has taken at least 10 hours to fail so it would be during school hours and that ain't good.

In both cases, I installed 20.7 from flash drive and immediately ran updates to get to 20.7.2 (amd64/OpenSSL). The only packages installed are ACME, dyndns, and wireguard. LetsEncrypt and dyndns were configured and working. Wireguard was installed but not enabled. Oh, and I loaded the 4 themes in the plugin repository as well.

The fact that I have to power off the device sounds a bit like hardware to me, but then again, why does everything else seem to be working?

I will definitely get those screenshots asap. I will try the steps you listed as well.

Do you have the latest available coreboot installed?
Some known bugs are fixed with some coreboot versions on APU devices. To be sure you do not run into such a problem on your hardware you should check if it is on latest version and check the release notes, maybe something is mentioned there?
Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de