Hi!
Updated 2 OPNsense to 21.7.3 about an hour ago. No reboot. Had to start syslog_ng manually from the Dashboard afterwards.
Now I get frequent (preconfigured...) warning emails that CPU and MEM usage is very high on both machines with no major throughput.
In Diagnostics - Activity I have
OPNsense 1
47519 root 102 0 6352 5414 CPU1 1 69:35 98.78% /usr/local/bin/python3 /usr/local/opnsense/service/configd_ctl.py -e -t 0.5 system event config_changed (python3.8)
36687 root 52 0 27 3836 RUN 1 0:45 43.36% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid{syslog-ng}
--------------------------
OPNsense 2
92636 root 83 0 32 8724 CPU0 0 1:06 35.60% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid{syslog-ng}
69153 root 103 0 5091 4559 CPU3 3 68:37 100.00% /usr/local/bin/python3 /usr/local/opnsense/service/configd_ctl.py -e -t 0.5 system event config_changed (python3.8)
Anyone?
... did a reboot on both machines, looks better now...
I do see a similar/same issue.
Upgraded to 27.1.3 and can see very high cpu load. python3.8 consumes around 60-90% CPU all the time. I rebooted the VM but it didn't help.
I reverted to the VM snapshot I did before the upgrade an CPU load is back to normal.
I had the same issue after upgrade: high cpu usage by a python process. Was gone after a reboot.
root 48330 99.4 0.3 33936 22620 - R 22:51 2:07.71 /usr/local/bin/python3 /usr/local/opnsense/service/configd_ctl.
I have upgraded but not rebooted my other machine which show the same issue, can I do anything helpful in investigating this?
Edit: I see lots of these in the log:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
syslog-ng has high cpu load now as well.
Edit 2:
On a sidenote: this shows the power of ZFS compression:
root@haanjdj:/var/log/system # zfs get compressratio
NAME PROPERTY VALUE SOURCE
zroot compressratio 8.33x -
zroot/ROOT compressratio 2.22x -
zroot/ROOT/default compressratio 2.22x -
zroot/tmp compressratio 1.66x -
zroot/usr compressratio 1.00x -
zroot/usr/home compressratio 1.00x -
zroot/usr/ports compressratio 1.00x -
zroot/usr/src compressratio 1.00x -
zroot/var compressratio 27.41x -
zroot/var/audit compressratio 1.00x -
zroot/var/crash compressratio 1.00x -
zroot/var/log compressratio 27.42x -
zroot/var/mail compressratio 1.00x -
zroot/var/tmp compressratio 1.00x -
True, python was close to 100% CPU (at one core). Rebooted and now is OK.
I'm seeing the same issue. I rebooted and the CPU usage has dropped but it is still a good bit higher than what it is normally. The memory usage seems normal.
Quote from: dinguz on September 22, 2021, 07:31:46 PM
I had the same issue after upgrade: high cpu usage by a python process. Was gone after a reboot.
root 48330 99.4 0.3 33936 22620 - R 22:51 2:07.71 /usr/local/bin/python3 /usr/local/opnsense/service/configd_ctl.
I have upgraded but not rebooted my other machine which show the same issue, can I do anything helpful in investigating this?
Edit: I see lots of these in the log:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
2021-09-22T20:28:12 configctl[48330] event @ 1632335220.47 msg:
syslog-ng has high cpu load now as well.
If you don't reboot, those will fill up your logs pretty quickly, to the tune of over 40 GB per day of logs in my case.
https://forum.opnsense.org/index.php?topic=24148.0
Franco commented there (and also on Reddit this morning) that "Something in the combination of Python and Syslog-ng update of 21.1.9" started causing this on occasion and that they're looking into it.
I had the same issue (cpu 100%) after the latest automatic update. A reboot fixed the problem.
Same for me. 1 core pegged but a reboot fixed it totally for me.
Reboot did not fix the issue in my case. Post upgrade to 21.7.3 Suricata consumes constantly 50%CPU (used to be 5-10% prior to upgrade). Other processes seem to be fine. I've rolled back and the CPU utilisation went back to normal.
Quote from: pavulinux on September 23, 2021, 08:22:50 AM
Reboot did not fix the issue in my case. Post upgrade to 21.7.3 Suricata consumes constantly 50%CPU (used to be 5-10% prior to upgrade). Other processes seem to be fine. I've rolled back and the CPU utilisation went back to normal.
Err, Suricata was updated from version 5 to 6. This might be unavoidable unless you want to roll back forever.
Cheers,
Franco
confirmed high CPU usage
for me it was syslogNG, dont have suricata installed
reboot helped, cpu usage is not high anymore
though memory usage is abit high
after reboot cpu and mem is normal again
I have been formatting opnsense for a while now, every time major versions are released (summer - January). With 21.7 I did it as well. I say this because I just upgraded to 21.7.3 and have not had any problems with high cpu and memory consumption without having to reboot to get it fixed. The same has happened with other problems that have been reported since opnsense 21.7 was released. Everything has worked fine. So I recommend when major versions are released to format Opnsense.
... I do a fresh install when the underlying OS get's a major update, as expected for 22.1, but not every 6 months. That would be overkill imho...
after reboot cpu and mem seem back to normal. Last fresh install was made in January 2018 :D
Same here. Reboot needed. Also syslog-ng crashed and fixed after reboot.
Quote from: chemlud on September 23, 2021, 01:38:25 PM
... I do a fresh install when the underlying OS get's a major update, as expected for 22.1, but not every 6 months. That would be overkill imho...
With 21.7 I did it more because of the change to zfs.
imho configd_ctl.py ("configd_ctl.py -e -t 0.5 system event config_changed") goes mad on stdin reading when pipe breaks (syslog-ng crash?)
perhaps adding some exception handling would be useful
No high memory or CPU issues on my Protectli vault.
Like i wrote in another thread, my problem with high load is the following, reboot did not fix it.
61047 root 85 0 28M 19M CPU0 0 0:06 98.33% /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.8)
After deactivate
Reporting -> Netflow -> Capture local
it runs normal without high load. Maybe some incompatibility with the new python 3.8?
Quote from: Olli on September 25, 2021, 03:25:04 PM
Like i wrote in another thread, my problem with high load is the following, reboot did not fix it.
After deactivate
Reporting -> Netflow -> Capture local
it runs normal without high load. Maybe some incompatibility with the new python 3.8?
I have had this happen in the past after Netflow database corruption. If you reset the data files (Reporting -> Settings) and re-enable Netflow, does the high load come back?
Quote from: dinguz on September 25, 2021, 04:43:17 PM
I have had this happen in the past after Netflow database corruption. If you reset the data files (Reporting -> Settings) and re-enable Netflow, does the high load come back?
Thanks. That seems to work.
Maybe the update crashed the database :)
same to me did the upgrade yesterday to OPNsense 21.7.3_1 - CPU and Memory are close to full (100%).
Syslog-ng Daemon agent was not started.
Check top .. saying .. that sylog-ng and python 3.8 eats all the ressources on the system.
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
----------------------------------------------------------------------------------------------
74757 root 6 29 0 33M 10M kqread 0 9:57 109.57% syslog-ng
15412 root 1 101 0 1435M 1221M CPU0 0 429:56 97.99% python3.8
90861 root 1 101 0 11M 1916K CPU1 1 15:39 97.89% syslogd
Did not do a reboot till now ..
Seems this firmware release was not properly tested before populate it.
Same issue here. CPU seems to get overworked then the internet comes to a crawl. A reboot fixes it, but this should not be right. I can tell this happens because I use Monit and it alerts me when the CPU is overloaded. I have a simple setup only using Suricata with a few rules and Monit. I don't use netflow, Sensei, bridges, or VPN. I updated to 21.7.3_1 hoping for a fix, but happened again this morning. I don't see anything abnormal in the top logs. It's a mystery.
top shows Suricata averaging at 2%, but python 3.8 pops in at 60-80% every once in a while which correlates to the CPU temps rising.
Also I noticed the temps average 50-55 now. Was 30-35 with 21.7.2
Have you tried to reset RRD and Netflow data (in Reporting: Settings)? I have seen this behavior occasionally with database corruption.
I have RRD disabled and no interfaces selected in netflow. I do not care about insight data. But to test I turned them on and selected the interface. Issue still remains.
did a reboot this morning and now memory-/swap/-cpu consumption is normal.
So - reboot fixes this issue definitively
problem is we shouldn't have to reboot to get OPNsense 'back to normal'. These occurrences are due to a problem and needs to be fixed. We didn't use to see this issue.
https://github.com/opnsense/core/commit/17aec4ed46 hotfix pushed today to address this... thanks to kulikov-a for digging into this
Cheers,
Franco
Now when I do an update to get this fix I get 'could not authenticate the selected mirror' due to a certificate verification error
i think le cert chain fixed already
https://www.ssllabs.com/ssltest/analyze.html?d=pkg.opnsense.org&s=89.149.211.205
or you can just change mirror
Changing mirror worked
any chance that you added cross-signed ISRG Root X1 cert to Trusted? (thats may be the reason)
I'm sorry for the "me too" post that mixes issues but it might warrant it.
I upgraded today from 21.7.2 to 21.7.3 and rebooted the firewall.
Yesterday I had disabled Netflow by leaving " Listening interfaces" and " WAN interfaces" as "Nothing selected".
Once all services had restarted post-boot, I gave it a few minutes to settle down and checked memory and cpu consumption. The cpu was high for a while.
htop gave me one process to check that looked like interesting.
/usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.8)
This had a WCPU value of 98%, constant for some 5 minutes.
A couple of ploints: an fstat showed the open file that the script was using was about 15 MB in size.
A quick look at the script I read: that it will rotate files at 10 MB and it's main method will check and repair it's sqlite db and log 'start watching flowd' to syslog.
It was there:
2021-09-30T17:58:32 /flowd_aggregate.py[54199] vacuum done
2021-09-30T17:58:31 /flowd_aggregate.py[54199] vacuum interface_086400.sqlite
2021-09-30T17:58:31 /flowd_aggregate.py[54199] vacuum interface_003600.sqlite
2021-09-30T17:58:31 /flowd_aggregate.py[54199] vacuum interface_000300.sqlite
2021-09-30T17:58:31 /flowd_aggregate.py[54199] vacuum interface_000030.sqlite
2021-09-30T17:58:15 /flowd_aggregate.py[54199] vacuum dst_port_086400.sqlite
2021-09-30T17:58:15 /flowd_aggregate.py[54199] vacuum dst_port_003600.sqlite
2021-09-30T17:58:15 /flowd_aggregate.py[54199] vacuum dst_port_000300.sqlite
2021-09-30T17:58:07 /flowd_aggregate.py[54199] vacuum src_addr_086400.sqlite
2021-09-30T17:58:07 /flowd_aggregate.py[54199] vacuum src_addr_003600.sqlite
2021-09-30T17:58:07 /flowd_aggregate.py[54199] vacuum src_addr_000300.sqlite
2021-09-30T17:57:48 dhclient[74977] Creating resolv.conf
2021-09-30T17:57:39 /flowd_aggregate.py[54199] vacuum src_addr_details_086400.sqlite
2021-09-30T17:57:08 /flowd_aggregate.py[54199] start watching flowd
2021-09-30T17:56:24 opnsense[53768] plugins_configure newwanip (execute task : webgui_configure_do(,wan))
2021-09-30T17:56:24 opnsense[53768] plugins_configure newwanip (execute task : vxlan_configure_interface())
2021-09-30T17:56:15 opnsense[53768] plugins_configure newwanip (execute task : unbound_configure_do(,wan))
2021-09-30T17:56:15 opnsense[53768] plugins_configure newwanip (execute task : openssh_configure_do(,wan))
2021-09-30T17:56:15 opnsense[53768] plugins_configure newwanip (execute task : opendns_configure_do())
2021-09-30T17:56:14 opnsense[53768] plugins_configure newwanip (execute task : ntpd_configure_do())
2021-09-30T17:56:13 opnsense[53768] /usr/local/etc/rc.newwanip: Curl error occurred: Resolving timed out after 15001 milliseconds
2021-09-30T17:56:12 /flowd_aggregate.py[54199] startup, check database.
I thought it took longer than the 2 mins the logs show but maybe I my recollection is incorrect.
What I see now is that that's the "hoovering" routine that spikes the cpu. At first I thought new python version, needs to maybe convert the data in files or db and was needed as a one-off.
However the script runs on a loop that seems outside the in-script "vacuum_interval = (60*60*8) # 8 hour vacuum cycle"
Question: is it normal that it runs even when Netflow is disabled?
I've not sighup'ed it in case is needed and this is how is meant to be.
Getting back to main topic. with 21.7.3_3, I do see slightly reduced CPU usage and lower temps with less CPU spikes. All 4 cores average in the 40C range now. Using Intel i7-6700. Going to change to i5-6600T next week to reduce temps even further.
probably my post is too verbose but is on topic.
High cpu usage post upgrade to 21.7.3 and I'm trying to point to what might be a problem. Or rather seeking confirmation if is normal to have flowd_aggregate.py running if there is no netflow enabled.
Sorry if is not clear.
@cookiemonster
Quoteif is normal to have flowd_aggregate.py running if there is no netflow enabled
no. should not imho. but script determines your netflow as enabled with local target. try to clear the "Capture local" checkbox and apply
Indeed Fright. Clearing that seems to have stopped the python script from running and reducing the cpu usage.
Thank you.