OPNSense momentarily hangs/spikes CPU

Started by Sam of Ham, August 31, 2024, 02:01:17 PM

Previous topic - Next topic
Hi all, and thanks in advance for your help.

Following on from recent issues with ZenArmor (Python 3.9 issues, see https://forum.opnsense.org/index.php?topic=42074.0;topicseen and https://forum.opnsense.org/index.php?topic=41932.0;topicseen) which I have now removed until it's proven stable again, I keep having momentary 'hangs' in my OPNsense system that result in, effectively, significant packet loss and lag.

My system's health checks come up OK, and it's a toss-up as to whether something like Discord's ping graph or an ping google.com -t shows any dropouts... But the effects are very much there in anything constant, like a Discord call or an online game.

I AM running OPNSense on an older box, a Qotom Mini PC with a J3160 and 8 GB DDR3 in it, so that could be a factor, but after initially rolling out OPNsense, I was having a great time... I can't work out why it's started to go bad, unless I just happened to get a bad unit or something. (It WAS second hand.)

I'm not the best with Linux or FreeBSD so this is an open call for help... What can I do do diagnose this, and learn as I go, so that I can either repair whatever's unhappy or at least confirm that I need to buy something new?

Thank you all!

You can confirm with top in the console/ssh which process hangs.

If you don't have large lists processing either as Aliases or in Unbound, next culprit would be the reporting DB (and maybe the DNS one.)

If not using the reporting then disable it altogether in Reporting - Settings, else at the very least reset RRD and Netflow data.



You have a fairly decent box, things should run quite smoothly once the issue pegging the CPU is addressed.

Hey! Thanks for your reply!

When you say top - do you mean pfTop via the console? I've SSH'd into it and have been looking at that/firewall log but can't see anything immediately problematic. pfTop is a lot of netstat-style connections so I don't think it's here.

I've reset NetFlow (I was listening on all interfaces [LAN, WAN, TAILSCALE, UNTRUSTED] and changed that to just LAN/WAN. Doesn't look like much change yet, still getting random timeouts in constant ping to google.

I have a Pihole that's my DNS and DHCP server - could there still be some logging issues on the OPNsense box anyways?

The DDR3 Celeron thing seemed to be fine when I got it and it did indeed work alright for a bit, but now it's unhappy, so I wasn't sure if I was just getting good performance for as long as it was 'brand new' so to speak. Glad to hear it seems to be up to spec!


I'm going to go out on a limb here and assume you mean htop?
This is the top result when searching for it https://forum.opnsense.org/index.php?topic=15011.0 and it's from 2019, so perhaps out of date.

Franco mentions using $ pkg install -A autoconf automake libtool and the most recent reply says to  [sic] add the mimugmail repo and htop via #pkg install htop but neither commands work for me.

Sorry, I know it's frustrating having to spell it out for a noob, but if you could spare a sec to educate me I'd be super grateful!

Tell a lie, I realised (or more accurately, wondered) that the "$ " was unnecessary. Seems to work better now but doesn't contain htop. I get this:

root@OPNsense:~ # pkg install htop
Updating OPNsense repository catalogue...
OPNsense repository is up to date.
All repositories are up to date.
pkg: No packages available to install matching 'htop' have been found in the repositories
root@OPNsense:~ #


Same for top.

Type top into shell. You do not need htop. You do not need to compile anything.

Oh... Cool, great to know!

Here's a snippet of the output. Nothing immediately jumps out at me as problematic?
last pid: 32400;  load averages:  0.48,  0.54,  0.50                                                        up 2+05:37:59  19:59:55
46 processes:  1 running, 45 sleeping
CPU:  0.4% user,  0.0% nice,  0.2% system,  0.1% interrupt, 99.3% idle
Mem: 61M Active, 2255M Inact, 2195M Laundry, 3146M Wired, 581M Buf, 318M Free
Swap: 8192M Total, 579M Used, 7613M Free, 7% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
56942 root          7  20    0  6486M  1840M nanslp   2  53:22   1.57% suricata
73162 root          1  20    0    14M  3288K CPU3     3   0:00   0.09% top
89890 root          1  20    0    27M    13M select   1   0:53   0.01% python3.11
88541 root          1  20    0    26M    12M select   2   0:30   0.01% python3.11
17965 root          1  33    0    13M  1912K wait     1   0:43   0.01% sh
56359 root          1  20    0    19M  9672K select   3   0:00   0.01% sshd-session
8833 root          1  20    0    23M  5972K select   3   0:17   0.01% ntpd
40019 root          1  21    0    52M    35M nanslp   3 110:00   0.01% python3.11
28497 root          1  20    0    23M  3444K select   3   0:04   0.01% mpd5
53741 root          1  20    0    22M  8128K kqread   1   0:35   0.00% lighttpd
84435 root          1  20    0    13M  1908K bpf      1   0:50   0.00% filterlog
35995 root          2  20    0    20M  3496K nanslp   3   0:05   0.00% monit
  270 root          1  20    0    86M    31M accept   0  10:29   0.00% python3.11
10839 root          3  20    0    54M    12M kqread   2   3:32   0.00% syslog-ng
8032 root         12  20    0  1262M    31M uwait    1   2:27   0.00% tailscaled
45415 _flowd        1  20    0    12M  1632K select   1   0:07   0.00% flowd
  266 root          1  52    0    26M    12M wait     3   0:03   0.00% python3.11
16035 root          1  32    0    13M  1864K nanslp   1   0:02   0.00% cron
2869 root          1  24    0    63M    31M accept   1   0:02   0.00% php-cgi
82580 root          1  21    0    63M    31M accept   1   0:02   0.00% php-cgi
15862 root          1  20    0    63M    30M accept   1   0:01   0.00% php-cgi
20014 nobody        1  20    0    12M  1788K sbwait   1   0:01   0.00% samplicate
47672 root          1  20    0    63M    30M accept   3   0:01   0.00% php-cgi
17033 root          1  20    0    12M  1276K piperd   0   0:01   0.00% daemon
54518 root          1  20    0    59M    25M wait     0   0:00   0.00% php-cgi
95682 root          1  20    0    63M    30M accept   3   0:00   0.00% php-cgi
54155 root          1  20    0    59M    25M wait     0   0:00   0.00% php-cgi
20149 root          1  39    0    63M    30M accept   3   0:00   0.00% php-cgi
7768 root          1  20    0    12M  1272K piperd   1   0:00   0.00% daemon
  589 root          1  20    0    11M   824K select   2   0:00   0.00% devd
48252 root          1  25    0    19M  9376K select   3   0:00   0.00% sshd-session
16447 root          1  20    0    19M  8144K select   1   0:00   0.00% sshd
71192 root          1  37    0    13M  3532K pause    2   0:00   0.00% csh
56558 root          1  45    0    13M  2612K wait     2   0:00   0.00% sh
58114 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
57765 root          1  52    0    12M  1548K ttyin    3   0:00   0.00% getty
60472 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
58982 root          1  52    0    12M  1548K ttyin    3   0:00   0.00% getty
61297 root          1  52    0    12M  1544K ttyin    3   0:00   0.00% getty
59071 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
59274 root          1  52    0    12M  1548K ttyin    2   0:00   0.00% getty
60338 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
45172 root          1  20    0    12M  1484K sbwait   2   0:00   0.00% flowd
10492 root          1  52    0    23M  7896K wait     2   0:00   0.00% syslog-ng
32400 root          1  33    0    12M  1800K nanslp   2   0:00   0.00% sleep
19737 root          1  52    0    12M  1872K piperd   1   0:00   0.00% daemon


Every now and then, it freezes for a sec, and something using Python 3.11 jumps to the top:
last pid: 46555;  load averages:  0.37,  0.40,  0.43                                                        up 2+05:45:04  20:07:00
49 processes:  3 running, 46 sleeping
CPU: 16.8% user,  0.0% nice,  2.5% system,  0.1% interrupt, 80.7% idle
Mem: 84M Active, 2256M Inact, 2195M Laundry, 3147M Wired, 581M Buf, 299M Free
Swap: 8192M Total, 579M Used, 7613M Free, 7% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
40019 root          1  24    0    52M    35M CPU3     3 110:11  19.54% python3.11
56942 root          7  20    0  6486M  1840M nanslp   2  53:29   1.53% suricata
10580 root          1  20    0    14M  3292K CPU2     2   0:00   0.08% top
17965 root          1  52    0    13M  1912K wait     3   0:43   0.04% sh
16035 root          1  20    0    13M  1864K nanslp   1   0:02   0.02% cron
56359 root          1  20    0    19M  9672K select   1   0:00   0.02% sshd-session
88541 root          1  20    0    26M    12M select   3   0:30   0.02% python3.11
10839 root          3  20    0    54M    12M kqread   3   3:32   0.02% syslog-ng
89890 root          1  20    0    27M    13M select   2   0:53   0.02% python3.11
8833 root          1  20    0    23M  5972K select   1   0:17   0.01% ntpd
53741 root          1  20    0    22M  8128K kqread   2   0:35   0.00% lighttpd
35995 root          2  20    0    20M  3496K nanslp   0   0:05   0.00% monit
84435 root          1  20    0    13M  1908K bpf      2   0:50   0.00% filterlog
  270 root          1  20    0    86M    31M accept   2  10:29   0.00% python3.11
8032 root         12  20    0  1262M    31M uwait    2   2:27   0.00% tailscaled
45415 _flowd        1  20    0    12M  1632K select   1   0:07   0.00% flowd
28497 root          1  20    0    23M  3444K select   1   0:04   0.00% mpd5
  266 root          1  52    0    26M    12M wait     3   0:03   0.00% python3.11
2869 root          1  24    0    63M    31M accept   1   0:02   0.00% php-cgi
82580 root          1  21    0    63M    31M accept   1   0:02   0.00% php-cgi
15862 root          1  20    0    63M    30M accept   1   0:01   0.00% php-cgi
20014 nobody        1  20    0    12M  1788K sbwait   1   0:01   0.00% samplicate
47672 root          1  20    0    63M    30M accept   3   0:01   0.00% php-cgi
45633 root          1  74    0    58M    39M CPU0     0   0:01   0.00% python3.11
17033 root          1  20    0    12M  1276K piperd   3   0:01   0.00% daemon
54518 root          1  20    0    59M    25M wait     0   0:00   0.00% php-cgi
95682 root          1  20    0    63M    30M accept   3   0:00   0.00% php-cgi
54155 root          1  20    0    59M    25M wait     0   0:00   0.00% php-cgi
20149 root          1  39    0    63M    30M accept   3   0:00   0.00% php-cgi
7768 root          1  20    0    12M  1272K piperd   1   0:00   0.00% daemon
  589 root          1  20    0    11M   824K select   0   0:00   0.00% devd
48252 root          1  25    0    19M  9376K select   3   0:00   0.00% sshd-session
16447 root          1  20    0    19M  8144K select   1   0:00   0.00% sshd
6322 root          1  20    0    13M  3536K pause    1   0:00   0.00% csh
56558 root          1  20    0    13M  2612K wait     3   0:00   0.00% sh
45497 root          1  22    0    13M  2208K wait     1   0:00   0.00% flock
45127 root          1  21    0    13M  1960K piperd   3   0:00   0.00% cron
58114 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
57765 root          1  52    0    12M  1548K ttyin    3   0:00   0.00% getty
60472 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
58982 root          1  52    0    12M  1548K ttyin    3   0:00   0.00% getty
61297 root          1  52    0    12M  1544K ttyin    3   0:00   0.00% getty
59071 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
59274 root          1  52    0    12M  1548K ttyin    2   0:00   0.00% getty
60338 root          1  52    0    12M  1544K ttyin    2   0:00   0.00% getty
45172 root          1  20    0    12M  1484K sbwait   2   0:00   0.00% flowd
10492 root          1  52    0    23M  7896K wait     2   0:00   0.00% syslog-ng
45699 root          1  52    0    12M  1796K nanslp   2   0:00   0.00% sleep
19737 root          1  52    0    12M  1872K piperd   1   0:00   0.00% daemon


Though what that means, I'm not super sure. Could the whole ZenArmor Python debacle have left some problematic Python in place or something?

You can see the commandline via "ps www 40019" or whatever PID is hogging your CPU.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Okay, good call. 40019 appears to be netflow. What does "Time" mean here? I'd assume time running, but the 'time' for that PID isn't changing in seconds. Is it time hung?

  PID TT  STAT      TIME COMMAND
40019  -  Rs   114:04.44 /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.11)


I've reset NetFlow data and turned off the interfaces (leaving both saying "nothing selected") hoping that that will free things up a bit. Can you guys recommend anything more? Again, apologies for the newbieism here - trying to learn as best I can as I go, both from you guys and lots of searching!

40019 now no longer shows up in my `top` list but I still get varied pings to google and fairly regular spikes of 200-600 in Discord. I'm lost! I don't know what changed between good performance time and now, except for ZenArmor's stuff!

newsense already told you to at least repair, better reset the netflow database. With high CPU load after an update, this is always the first thing to try.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

September 01, 2024, 04:21:20 PM #12 Last Edit: September 01, 2024, 05:22:22 PM by Sam of Ham
Okay, well, I did reset the Netflow database. I'll try the repair and see if that helps. Forgive me but I would have assumed resetting the database and/or just disabling the thing could have shown at least a temporary change, thus indicating that this service is actually the problem. All good - will see if the repair works, and report back! Thanks!

So far after a reset and repair of the database and a reboot (later on - did that today) seems to be no major change. My ping graph in Discord looks like the Frasier titlecard. I'm stumped! Any further ideas?

Again, and as always, big thanks to everyone.

It's all the same exercise again. Find out what's spiking your CPU. And that netflow feature should be disabled (untick any interfaces enabled there, reset the db, maybe even reboot after that. (No idea what's the cool factor behind these services without an Enable/Disable checkbox. NTPD comes to mind as well.)