OPNsense going 100% CPU and running out of swap space every night at 3 AM

Started by beisser, March 13, 2023, 08:33:41 AM

Previous topic - Next topic
jo guys,

this is the first time im posting here, but i could really use some guidance.
i upgraded my virtual OPNsense firewall (running on proxmox) 2 days ago from 23.1.2 to 23.1.3 and since then i am experiencing the same issue every night at 3 am.
the firewall suddenly goes to 100% cpu and running out of swap space killing several processes along the way (among them php, sh and python3.9).
after this the system continues to forward traffic in some capacity, but the gui becomes super sluggish and many functions inside gui are no longer working (such as checking for updates or looking at logs).
the gui also says that all my plugins are missing.

in regards to whats going on at 3 am, the only thing that is scheduled by me to happen at 3am is the disconnect and reconnect of my vdsl pppoe-connection on the opnsense (its set to 3 am so that the isp doesnt interrupt it duiring the day). immediately after follows the ddclient update.

this situation continues until i reboot the opnsense at which time everything returns to normal as if nothing ever happened.

i could really use some help to figure out whats going on here.
i have been using opnsense in that configuration for months and the issue only appeared after the update 2 days ago.

edit: i have attached a few screenshots for more context. i can see that the configd, ddclient and flowd_aggregate are dead after this issue until i reboot the VM.

interesting observation. if i manually restart the configd and ddclient without rebooting the firewall, the cpu usage instantly goes to 100% and the memory usage skyrockets. see attached screenshot

second screenshot shows that everything is back to normal after reboot.
low memory usage and very low cpu usage.

it will stay that way until 3am and then it will repeat.

Check what process is using the CPU. I have been having problems when the PPPoE connection goes up and down Suricata pins the CPU. I'm on a quad core machine and it pins two of them, so I don't really notice. Restarting Suricata makes it stop (though Suricata does use high CPU for the first few tens of seconds as it goes through rules and sets itself up, then it settles).

If that solves it for you then I think we have found a bug.

Quote from: opnsense-user123 on March 13, 2023, 10:20:59 PM
Check what process is using the CPU. I have been having problems when the PPPoE connection goes up and down Suricata pins the CPU. I'm on a quad core machine and it pins two of them, so I don't really notice. Restarting Suricata makes it stop (though Suricata does use high CPU for the first few tens of seconds as it goes through rules and sets itself up, then it settles).

If that solves it for you then I think we have found a bug.

i am using just the basic firewall, no IDS/IPS or anything else.
so it cant be suricata as its simply not on the box :)

the only installed plugins are these:

os-ddclient (installed)
os-dmidecode (installed)
os-net-snmp (installed)
os-qemu-guest-agent (installed)
os-theme-cicada (installed)
os-upnp (installed)
os-zabbix-agent (installed)


Edit: will see if i can find out what process is eating up the resources when it happens again tonight.

issue is still present. this morning it looked like the attached screenshot.

top doesnt show any particular process eating resources:

last pid: 50791;  load averages:  0.77,  0.88,  0.91                                                       up 0+19:41:10  08:14:14
1474 processes:1 running, 1470 sleeping, 3 waiting
CPU:  0.2% user,  0.0% nice,  8.2% system,  0.4% interrupt, 91.2% idle
Mem: 5167M Active, 4096B Inact, 1326M Laundry, 1244M Wired, 704M Buf, 31M Free
Swap: 10G Total, 10G Used, K Free, 100% Inuse, 4096B In

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
43369 root          1  20    0    20M  5440K CPU3     3   0:03   1.81% top
6653 root         14  52    0    96M    24M usem     3   0:01   0.09% python3.9
47530 root          1  20    0    17M  1220K select   3   1:53   0.06% miniupnpd
49876 root          1  52    0    13M  2056K select   0   0:00   0.05% logger
23284 root          4  20    0    46M  2544K kqread   2   2:28   0.04% syslog-ng
62628 root          1  29    0    56M    15M nanslp   2   0:00   0.04% php
20246 root          1  24    0    56M    15M nanslp   1   0:00   0.03% php
50310 root          1  52    0    13M  1912K pfault   1   0:00   0.03% logger
99462 root          2  20    0    21M  6212K select   0   0:00   0.03% ntpd
43175 root          1  21    0    56M    15M select   0   0:00   0.03% php
77922 zabbix        1  20    0    21M  1244K accept   1   0:33   0.02% zabbix_agentd
86374 root          1  20    0    23M  5992K select   1   0:10   0.02% python3.9
85938 root          1  20    0    23M  4848K select   0   0:11   0.02% python3.9
78578 zabbix        1  20    0    21M  1112K vmwait   0   0:33   0.02% zabbix_agentd
50067 root          1  52    0    13M  1912K select   0   0:00   0.02% logger
81919 root          1  21    0    56M    15M select   0   0:00   0.02% php
83545 root          1  20    0    21M  2740K kqread   1   0:10   0.01% lighttpd
33910 root          1  20    0    34M  2248K select   2   0:19   0.01% snmpd
78009 zabbix        1  20    0    21M  1112K vmwait   0   0:32   0.01% zabbix_agentd
77614 zabbix        1  20    0    21M  4544K nanslp   0   0:11   0.01% zabbix_agentd
51456 root          1  20    0    56M    15M select   1   0:00   0.01% php
36083 root          1  20    0    18M  3676K select   0   0:00   0.00% sshd
50499 root          1  52    0    13M  1912K select   0   0:00   0.00% logger
36234 root          1  21    0    13M  2048K pfault   2   0:00   0.00% filterlog
10235 root          1  52    0    13M   852K wait     1   0:25   0.00% dhcp6c
67879 root          1  20    0    13M  8192B kqread   1   0:10   0.00% <rtsold>
  391 root          1  20    0    11M   688K select   3   0:07   0.00% devd
69691 root          1  20    0    13M  1568K select   1   0:06   0.00% rtsold
97251 root          1  20    0    40M  8192B nanslp   0   0:06   0.00% <perl>


this time ntp died, but the whole firewall is nearly unresponsive again and just like the 2 nights before it happens exactly at 3 am.

Hi!

I'm on OPNsense 23.1.3-amd64 and I just want to add my 10cents. I've had the same thing happen twice now in the space of the last 2 weeks. Also starting at 03:00 in the morning and consuming all CPU. top showed ~20 zombie processes and php as the main culprit.

Installed plugins are
* acme
* dds
* hw-probe
* nut
* smart
* theme-cicada
* theme-vicune
* udpbroadcastrelay
* wireguard

The only thing I know of that's running at 03:00 is a WAN IF reset cron entry, which I have now disabled. If it happens again, I'll try and collect some more info.

Quote from: lazyfurrybird on March 14, 2023, 12:05:59 PM
Hi!

I just want to add my 10cents. I've had the same thing happen twice now in the space of the last 2 weeks. Also starting at 03:00 in the morning and consuming all CPU. top showed ~20 zombie processes and php as the main culprit.

Installed plugins are
* acme
* dds
* hw-probe
* nut
* smart
* theme-cicada
* theme-vicune
* udpbroadcastrelay
* wireguard

The only thing I know of that's running at 03:00 is a WAN IF reset cron entry, which I have now disabled. If it happens again, I'll try and collect some more info.

same for me. wan if reset scheduled for 3 am.
i now moved the reset to 5 am to see if the issue moves with it.

FreeBSD top(1) supports sorting after various parameters. So to find which process eats all the memory (for example) start top, then type

o
size
<ENTER>

or

o
res
<ENTER>

to sort for total memory size or resident memory size, respectively.

If the memory hog is PHP or Python or some other interpreted language, then copy the PID (left column), quit top, then type

ps awwux | grep <PID>

to see what exactly the system is doing.


HTH
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

so the process eating all my cpu was pythin3.9 and this is the output of the command:

root@OPNsense:~ # ps awwux |grep 227
root      227   0.0  4.9 464000 101644  -  S    22:14     45:23.85 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.9)
root    22718   0.0  0.0  51184      8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    42272   0.0  0.0  51184      8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    62274   0.0  0.0  53300      8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force


i also see about a hunbdred or so of these in ps aux:

root    19433  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    19581  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20116  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    20135  0.0  0.7  56884 15268  -  S    07:18      0:00.16 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20151  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    20298  0.0  0.7  60984 13412  -  S    07:18      0:00.63 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20308  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20326  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20336  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    20423  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20584  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    20843  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    21114  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    21223  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    21328  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    21481  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    21788  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    21838  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    21864  0.0  0.9  60984 18416  -  S    07:18      0:00.37 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    22069  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    22368  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    22481  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    22573  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    22632  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    22718  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    22935  0.0  0.0  51184     8  -  SW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    22991  0.0  0.6  56884 12004  -  S    07:18      0:00.18 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23125  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23131  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    23149  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23187  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23257  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23640  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    23641  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    23695  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23712  0.0  0.0  51184     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.filter_configure
root    23757  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force
root    23836  0.0  0.0  53300     8  -  IW   -          0:00.00 /usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force


cpu is eaten by python, memory by the hundreds of php processes.

seems to be related to the scheduled periodic interface reset, since the issue has moved to 5 am together with my new cron setting for periodic interface reset.

question is what to do now?
having this issue every night is not an acceptable condition.

looks like a queue of scripts wanting to configure the filter?
any clue in general or backend log?

Quote from: Fright on March 15, 2023, 02:57:55 PM
looks like a queue of scripts wanting to configure the filter?
any clue in general or backend log?

unfortunately no clue.
my workaround for the moment is a scheduled reboot at 5 am instead of the periodic interface reset.
this of course is not a permanent solution, but having a firewall running out of memory/swap with a load of 50-300 everytime the interface is reset via cron is not a solution.

i can only hope that some future update may fix this issue.

Uh, i observe a load spike at 3:30 every night as well...

In addition every now and then the whole firewall goes crazy. Swapping, high load packet loss... and I can't check anything at that moment anymore, it's kinda dead. Logs are fine.

As a starting point...


# cat /var/cron/tabs/*
# cat /usr/local/etc/cron.d/*