Full system freeze on any config change

Started by rungekutta, January 24, 2021, 09:59:49 PM

Previous topic - Next topic
January 24, 2021, 09:59:49 PM Last Edit: January 24, 2021, 10:24:20 PM by rungekutta
Hi guys,
My OPNSense has started to freeze up completely on any config change I make through the GUI. I am not sure what has changed other than I installed the node exporter plugin.

The symptom is a stalled GUI and complete system freeze; it doesn't even respond to ping.

Log file around the time shows

[...]
2021-01-24T21:33:01 opnsense[53206] /usr/local/etc/rc.linkup: DEVD Ethernet attached event for lan
2021-01-24T21:33:01 kernel igb1: link state changed to UP
2021-01-24T21:32:58 kernel pflog0: promiscuous mode enabled
2021-01-24T21:32:58 kernel pflog0: promiscuous mode disabled
2021-01-24T21:32:57 opnsense[63288] /usr/local/etc/rc.linkup: DEVD Ethernet detached event for lan
2021-01-24T21:32:57 kernel igb1: link state changed to DOWN
2021-01-24T21:32:55 configctl[1695] event @ 1611520374.79 exec: system event config_changed
2021-01-24T21:32:55 configctl[1695] event @ 1611520374.79 msg: Jan 24 21:32:54 xxx.yyy.org.uk config[79201]: config-event: new_config /conf/backup/config-1611520374.7843.xml


I.e. it enter some re-initialization of all network interfaces after my config change? In any case it seems to render OPNsense dead and unresponsive afterwards. On a soft reset (using hardware reset button) it shuts down ok and when it come back up it has taken my config change and works as normal again.

Edit: should have said I am running 20.7.8 and upgraded recently. I saw the same behaviour on 20.7.7 just before I upgraded. Also, I notice now the freeze doesn't always happen. I am not sure yet what the pattern is though.

Where should I start to troubleshoot...?

Hi,

I experienced a very similar behavior a couple of weeks ago (you can see my thread on that topic). I observed the similar pflog0: promiscuous mode enabled / disabled causing the system to become totally unresponsive.

I needed to issue a cold reset of the ESXi on which my VM is running, in order to get something working again (restarting the VM did not solve the problem).

I've spent some time looking at log files but excepted the pflog0: promiscuous mode enabled message in dmesg output, nothing  :-[

The system is running well since then, but I still observe this message from time to time, just a coupe of disable / enable.

I must admit I lost a bit of trust in my system since that happened, I don't really like such strange behavior coming up from nowhere...

Let me know if you were able to find some clues...

Cheers

Hi, thanks for your reply. No more clues but also hasn't happened for a while. Maybe it's random, maybe it's related to which configure is changed. I agree it's a bit unnerving.

Still having this issue...

There is nothing other in log files, except those promiscuous mode enabled/disabled.

Just asking me WTF this can be related to...

The only thing I see, is also this strange arp message which is usually near the promiscuous mode enabled/disabled message :

Feb  4 04:26:28 mercure kernel: pflog0: promiscuous mode enabled
Feb  4 04:29:19 mercure kernel: pflog0: promiscuous mode disabled
Feb  4 04:29:19 mercure kernel: pflog0: promiscuous mode enabled
Feb  4 12:44:24 mercure kernel: arp: 192.168.10.30 moved from 3e:6e:14:db:09:a0 to 62:f6:45:ef:ad:9d on vmx0
Feb  4 20:17:44 mercure kernel: arp: 192.168.10.32 moved from 3e:6e:14:db:09:a0 to 62:f6:45:ef:ad:9d on vmx0
Feb  5 05:27:26 mercure kernel: pflog0: promiscuous mode disabled
Feb  5 05:27:26 mercure kernel: pflog0: promiscuous mode enabled
Feb  5 05:30:53 mercure kernel: pflog0: promiscuous mode disabled
Feb  5 05:30:53 mercure kernel: pflog0: promiscuous mode enabled
Feb  5 07:12:49 mercure kernel: arp: 192.168.10.30 moved from 3e:6e:14:db:09:a0 to 12:8f:43:36:fc:b3 on vmx0


No idea at all how I could investigate further...

Dear all,

Still having this weird error in my log files.

When it does occur, I can see the following message in the backend log :

2021-02-10T20:57:59 configd.py[906] [92fc4b09-3a9d-4b95-8913-5828f0215d6b] Reloading filter

Any idea what it is?

R.

Just investigating deeper into that, trying to correlate when it did happen, I think this only happens when I'm connected remotely using Wireguard  ???

Is there any chance this to be linked to a configuration mistake? However, leading to a full OPNSense system freeze does not make sense...

Cheers,

R.

This makes me nuts.... Just found OPNSense frozen again this morning, and this happened just again  >:(

When I think about my old Netgear FW with more than 900 days uptime... WTF....

Thought it was perhaps due to Wireguard, but it doesn't. I switched on my old OpenVPN Server this morning and my connection went down just now.

I'm convinced there is an issue linked to VMWare / network interfaces. As explained the only only weird thing are those
promiscuous mode enabled / disabled occurring hundreds of time before the freeze occurs.

I did love OPNSense user interface and features, but I fear I will have to see for an alternative, loosing my connection every 2 days is simply not possible...

R.

Maybe it's time to have a look at faulty RAM or network interfaces? Just saying...

From time to time I see this DEVD detach/attach for WAN, mostly directly after rebooting the boxes. No system freeze included though.
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

Quote from: Rajstopy on February 12, 2021, 09:48:15 AM
This makes me nuts.... Just found OPNSense frozen again this morning, and this happened just again  >:(

When I think about my old Netgear FW with more than 900 days uptime... WTF....

Thought it was perhaps due to Wireguard, but it doesn't. I switched on my old OpenVPN Server this morning and my connection went down just now.

I'm convinced there is an issue linked to VMWare / network interfaces. As explained the only only weird thing are those
promiscuous mode enabled / disabled occurring hundreds of time before the freeze occurs.

I did love OPNSense user interface and features, but I fear I will have to see for an alternative, loosing my connection every 2 days is simply not possible...

R.

900 days of uptime? No updates? Crazy shit. Some sort of fire and forget product?

Freeze means only network freeze? As this is a VM are you able to use the console while the network is not reacting anymore?
Are you using some sort of automatic snapshot creation? I have this issue when I try to snapshot the OPNsense VM with memory. I think it should be something in the combination of ESXi and the VM.
,,The S in IoT stands for Security!" :)

Quote900 days of uptime? No updates? Crazy shit. Some sort of fire and forget product?

I was kidding - to some extent  ;)

First of all thank you very much for your answer, much appreciated.

I guess there is obviously an issue in the combination of ESXi and VM as you stated. This issue is not linked to any snapshot - I've experienced this issue as well.

When it freezes, the answer is yes, only the network part. The console remains active but no interface is reachable. All other VM are running well in parallel.

Can you tell me what kind of configuration do you use for a WMWare point of view?

R.

Quote from: chemlud on February 12, 2021, 10:05:05 AM
Maybe it's time to have a look at faulty RAM or network interfaces? Just saying...

From time to time I see this DEVD detach/attach for WAN, mostly directly after rebooting the boxes. No system freeze included though.

I did run a memory test ok.

Is there a way to identify a faulty NIC?

When rebooting the box, this message appears systematically but just once.

R.

I have an ESXi 7 host.

I experienced network freezes when I created a Windows 10 VM. I already had a Windows Server running on the same host. No problems. Problems started when I added this Windows 10 VM. Network was dropping only for OPNsense. Random times. OPNsense was still reacting on console. I removed the Win10 VM and all was back to normal. Didn't had the time to investigate. Maybe you can try to shut down individual other VMs and see how this improves stability.
,,The S in IoT stands for Security!" :)

Thanks ! I've only 2 other VM running Debian...

Are you using VMXNET3 adaptor? For all of OPSense interfaces?

Yes
,,The S in IoT stands for Security!" :)

Ok.... Same here.

And do you have open-vm tools installed? I think I don't  ???

R.