OPNsense 22.1 catastrophic failure: "out of swap space", all processes killed

Started by arkanoid, March 05, 2022, 06:57:37 PM

Previous topic - Next topic
I've been trying to track down a problem that causes my OPNsense box *Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (1 cores, 1 threads), 4GB ram* running in VMware to suddenly go into killing spree and kill all processes (including ssh access), forcing me to hard reboot it.

The problem is impossible to predict: it happened today at 4:45 AM (GMT+1) when load was relatively low compared to daytime.
Before this event, I had it twice one month ago. After that I doubled the amount of ram (2GB -> 4GB), disabled swap (to exluded if from causes), and upgraded OPNsense to last version. But the problem is still here.

Please find attached a screenshot of the terminal before hard-restarting the virtual machine that clearly shows the killing spree. This is the only proof I have of the event, as the logs have no track of the problem: # grep -r swap /var/log/ returns nothing, and manual exploration of log files both via terminal and web gui shows no relevant events before the time of the incident, but the VGA screenshot shows it (I guess the killing spree kills the logging too?)

The firewall has swap file disabled (System: Settings: Miscellaneous)
This is /etc/fstab:
# Device                Mountpoint      FStype  Options         Dump    Pass#
/dev/gpt/rootfs /               ufs     rw,noatime      1       1


No IDS, no IPS, just wireguard running and many peers connected and exchanging data.

The firewall is externally monitored by:
- hypervisor (VMware)
- zabbix
So I have minute-by-minute graphs of the memory usage for weeks from both sources, that clearly confirm the firewall uses <1GB ram the whole time. Please find attached both memory charts for the day of the incident (zabbix shows a large hole, that's just zabbix agent not starting automatically at boot so I executed manually later).

The single CPU has an average idle time > 40% according to zabbix ad web gui (but always 100% according to hypervisor, yet to understand why). Please find attached the relative chart.

This is what I've found so far that seems linked to the problem, but actually I've zero clue:
- https://lists.freebsd.org/pipermail/freebsd-current/2019-September/074310.html
  - and this mail in particular: https://lists.freebsd.org/pipermail/freebsd-current/2019-September/074322.html
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241048
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231457

I've no clear idea about what's happening here, so I'm just guessing and applying potential solutions. What I'm trying now is:
# sysctl vm.pfault_oom_attempts=10
vm.pfault_oom_attempts: 3 -> 10


Any idea? Thanks

are you only using the base firewall or do you have other things like ntopng, suricata, etc etc running?

just base firewall. As I stated in post, no IPS, no IDS. All I have is a wg0 interface forwarding data between peers. NetFlow is also disabled.

The more I dive into the problem, the more I fear is linked to CPU usage.

By looking at zabbix (external monitor) logs, I noticed that while kernel+user average cpu usage is below 70%, the system load average has been dangerously near 1.0 during last month, and that drills down to the definition of system load vs cpu load: I had many waiting processes.
This could have triggered the oom and started a killing spree when "perfect storm" condition arrived, leaving a process waiting for more than oom limit.

I see only one element against this theory: why killing ALL processes, and not just some keeping the system busy? By killing he wireguard-go one the system load would have drop to near zero, but it went on killing ssh and webgui too.

Still scratching head.

In the meantime, I've installed the experimental wireguard-kmod package, and I'm experiencing MASSIVE improvements in cpu usage (and also some for memory usage). I've attached initial charts where it's clear how user cpu went to near zero.

After some testing, I can confirm that the issue was caused by excessive system load and not due to ram shortage.

reading CPU % usage is not enough, if system load (first line in `top`) reaches <num processors>, it is very possible that out-of-memory could be triggered due to process reaching excessive waiting