High CPU Usage since upgrade from 19.1.10 to 19.7

Started by cguilford, July 17, 2019, 07:46:24 PM

Previous topic - Next topic
Even with "netflow off" the CPU usage is still higher than 19.1! I summarized my different trials with the different OPNsense version and I can also confirm with "netflow off" the GUI is reacting faster.

Summary see attachment.

July 21, 2019, 04:19:35 AM #16 Last Edit: July 21, 2019, 04:37:37 AM by jazz
Literally just upgraded to 19.7 in the last hour and the first thing I noticed was CPU has gone through the roof to 99% at idle with minimal traffic.  Previously on 19.1 under the same load I would see next to zero CPU usage. 


last pid: 71370;  load averages:  1.40,  1.37,  1.42                                                                           up 0+00:36:46  12:21:45
49 processes:  2 running, 47 sleeping
CPU: 50.2% user,  0.0% nice,  0.2% system,  1.2% interrupt, 48.5% idle
Mem: 142M Active, 123M Inact, 296M Wired, 170M Buf, 1369M Free
Swap:

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
35139 root          1 103    0 34996K 29804K CPU1    1  30:40  99.89% python3.7



I think I have alleviated the problem somewhat by resetting NetFlow data.  After I did that, CPU usage seems have dropped back to normal, with only the occasional spike from Python.

Also seeing high CPU utilization after upgrading from 19.1.10 to 19.7. As shown in the thread, it appears to be Python/Netflow related.

PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root       155 ki31     0K    64K CPU2    2   8:16  99.15% [idle{idle: cpu2}]
   11 root       155 ki31     0K    64K CPU0    0   9:19  89.21% [idle{idle: cpu0}]
   11 root       155 ki31     0K    64K RUN     3   8:34  85.92% [idle{idle: cpu3}]
   11 root       155 ki31     0K    64K RUN     1   8:12  72.94% [idle{idle: cpu1}]
52874 root        52    0 19736K 14632K piperd  3   0:01  44.24% /usr/local/bin/python3 /usr/local/opnsense/scripts/filte


PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
83957 root        84    0 28848K 25344K CPU2    2   3:34  96.98% /usr/local/bin/python3 /usr/local/opnsense/scripts/netfl
   11 root       155 ki31     0K    64K RUN     1   9:54  66.54% [idle{idle: cpu1}]
   11 root       155 ki31     0K    64K CPU3    3  10:10  63.93% [idle{idle: cpu3}]
   11 root       155 ki31     0K    64K CPU0    0  11:08  51.51% [idle{idle: cpu0}]
   11 root       155 ki31     0K    64K RUN     2   9:57  42.72% [idle{idle: cpu2}]
   19 root       -16    -     0K    16K -       0   0:17  11.31% [rand_harvestq]
   12 root       -60    -     0K   544K WAIT    1   0:03   1.03% [intr{swi4: clock (0)}]
    0 root       -92    -     0K   592K -       0   0:02   0.36% [kernel{dummynet}]
36090 root        52    0 51688K 41524K accept  0   0:04   0.34% /usr/local/bin/php-cgi
40440 root        20    0  1034M  4536K CPU1    1   0:00   0.07% top -aSCHIP


I'll try resetting Netflow data and report back. I've also noticed that the web interface is noticeably laggy after the 19.7 upgrade, again probably due to the CPU utilization. This is on a bare metal install, Celeron J3455 quad core, 16GB RAM, and a 120GB SSD. Usually a very snappy system.

I've been running my firewall without local netflow capture enabled since yesterday and the CPU is normal.  So then I re-enabled it and let it run for a few hours to get the attached RRD graph.  Definitely using more CPU with local netflow enabled.

Still not as bad as before I reset the Netflow data, but definitely more than 19.1.

same situation for me. Firewall stops working and the only solution is a local reboot.


Just following up on my previous post to provide some extra input. I tried first just repairing netflow data, this did not have an impact in perceived performance and CPU utilization remained high. I then completely reset RRD graphs and netflow data and rebooted the device.

Unfortunately even with these steps I've seen no improvement in page load performance. I can understand that this new version may need more core processing power for NetFlow. What doesn't make sense to me is why the whole page loads are noticeably laggy and slow compared to 19.1.

There are 3 patches listed https://github.com/opnsense/core/issues/3587 with Instructions.  Install those and see if it helps.  It's made very noticeable difference in my system and performance.

All 3 patches made their way into 19.7.1. It's not perfect and will receive more fine tuning eventually, but for now we will need to focus on other priorities even though the level of CPU use is not what it used to be in 19.1.

Using pure Pyhton 3 instead of Python 2 C bindings does have different levels of processor usage. The main issue is that Python 2 C bindings are already buggy with Clang, unmaintained and about to be deprecated via end of life of Python 2.

Thanks to everybody helping to diagnose this. <3


Cheers,
Franco

September 03, 2019, 09:37:04 PM #24 Last Edit: September 03, 2019, 09:48:57 PM by ThuTex
Franco: i updated from the 19.1 series to 19.7.3 and also noticed the cpu load...
which is now almost constantly at 95%, seemingly due to suricata and netflow.
(with suricata often logging Error reading data from iface 'pppoe0': (55u) No buffer space available )

both suricata and netflow were already running on 19.1 where i had, maybe, a 10% load (so the cpu load jumped extremely high, even in low-traffic situations)
i dont know what buffer space would be needed, but there is enough free disk space and memory as well as swap space, so that cannot be an issue.

since turning off suricata and netflow is not an option, i was wondering if it is possible to downgrade back to 19.1?
(i would rather stay on an outdated firewall than to disable functions or use -and thus pay- a lot more electricity, since this is a 24/7 appliance)

i currently kill the involved processes (suricata, netflow, syslog-ng) and then have a relatively stable, normal cpu usage for a while... but it seems to return to high usage after some time for no clear reason