CPU Performance after 23.7 upgrade

Started by JustMeHere, August 05, 2023, 02:40:40 AM

Previous topic - Next topic
Yep, uploaded the new snapshot build today. Still need to restart the watcher:

# pluginctl -s dpinger restart :watcher:


Cheers,
Franco

In my case the patch did not help to reduce CPU usage at all. The change of the $poll from 1 to 5 helped to get CPU usage back to the level it was with 23.1.

It really doesn't do much as I said initially. It also depends on hardware... or when a link is flapping constantly and it's forced to trigger. But nobody posted a gateway log of such events. These are new in 23.7 by the way to see exactly what happened to a gateway and when.

We can make the delays configurable some day but not before ironing out the real issues first.


Cheers,
Franco

Sorry, I did somehow miss your request for entries out of the log file. I am not sure if this helps, it is out of /var/log/gateways/gateways_2023082{4,6}.log (the newest 2 files):

<165>1 2023-08-24T13:18:04+00:00 scrambler.wenks.ch dpinger 2426 - [meta sequenceId="1"] ALERT: WAN1IP4GW (Addr: 8.8.8.8 Alarm: down -> none RTT: 2.7 ms RTTd: 1.3 ms Loss: 0.0 %)
<165>1 2023-08-24T13:18:04+00:00 scrambler.wenks.ch dpinger 2426 - [meta sequenceId="2"] ALERT: WAN2IP4GW (Addr: 8.8.4.4 Alarm: down -> none RTT: 15.1 ms RTTd: 6.8 ms Loss: 0.0 %)
<165>1 2023-08-24T13:18:04+00:00 scrambler.wenks.ch dpinger 2426 - [meta sequenceId="3"] ALERT: WAN1IP6GW (Addr: 2606:4700:4700::1111 Alarm: down -> none RTT: 2.4 ms RTTd: 1.5 ms Loss: 0.0 %)
<165>1 2023-08-24T13:18:04+00:00 scrambler.wenks.ch dpinger 2426 - [meta sequenceId="4"] ALERT: WAN2IP6GW (Addr: 2620:fe::9 Alarm: down -> none RTT: 2.5 ms RTTd: 1.5 ms Loss: 0.0 %)
<165>1 2023-08-24T18:18:52+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="1"] ALERT: WAN1IP4GW (Addr: 8.8.8.8 Alarm: down -> none RTT: 2.3 ms RTTd: 1.3 ms Loss: 0.0 %)
<165>1 2023-08-24T18:18:52+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="2"] ALERT: WAN2IP4GW (Addr: 8.8.4.4 Alarm: down -> none RTT: 12.5 ms RTTd: 4.9 ms Loss: 0.0 %)
<165>1 2023-08-24T18:18:52+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="3"] ALERT: WAN1IP6GW (Addr: 2606:4700:4700::1111 Alarm: down -> none RTT: 2.1 ms RTTd: 1.5 ms Loss: 0.0 %)
<165>1 2023-08-24T18:18:52+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="4"] ALERT: WAN2IP6GW (Addr: 2620:fe::9 Alarm: down -> none RTT: 2.1 ms RTTd: 1.3 ms Loss: 0.0 %)
<12>1 2023-08-26T15:24:50+00:00 scrambler.wenks.ch dpinger 50861 - [meta sequenceId="1"] exiting on signal 15
<12>1 2023-08-26T15:24:50+00:00 scrambler.wenks.ch dpinger 60121 - [meta sequenceId="2"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 8.8.8.8  bind_addr 62.12.173.5  identifier "WAN1IP4GW "
<165>1 2023-08-26T15:24:51+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="3"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2023-08-26T15:25:05+00:00 scrambler.wenks.ch dpinger 57225 - [meta sequenceId="4"] exiting on signal 15
<12>1 2023-08-26T15:25:06+00:00 scrambler.wenks.ch dpinger 82049 - [meta sequenceId="5"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 8.8.4.4  bind_addr 62.2.85.182  identifier "WAN2IP4GW "
<165>1 2023-08-26T15:25:08+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="6"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2023-08-26T15:26:26+00:00 scrambler.wenks.ch dpinger 48560 - [meta sequenceId="1"] exiting on signal 15
<12>1 2023-08-26T15:26:26+00:00 scrambler.wenks.ch dpinger 68631 - [meta sequenceId="2"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 2606:4700:4700::1111  bind_addr 2001:8a8:1005:1::5  identifier "WAN1IP6GW "
<165>1 2023-08-26T15:26:26+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="3"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2023-08-26T15:26:39+00:00 scrambler.wenks.ch dpinger 52964 - [meta sequenceId="4"] exiting on signal 15
<12>1 2023-08-26T15:26:40+00:00 scrambler.wenks.ch dpinger 59088 - [meta sequenceId="5"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 2620:fe::9  bind_addr 2001:8a8:1005:2::182  identifier "WAN2IP6GW "
<165>1 2023-08-26T15:26:40+00:00 scrambler.wenks.ch dpinger 72524 - [meta sequenceId="6"] Reloaded gateway watcher configuration on SIGHUP
<165>1 2023-08-26T16:16:36+00:00 scrambler.wenks.ch dpinger 56541 - [meta sequenceId="1"] ALERT: WAN1IP4GW (Addr: 8.8.8.8 Alarm: down -> none RTT: 2.3 ms RTTd: 2.0 ms Loss: 0.0 %)
<165>1 2023-08-26T16:16:36+00:00 scrambler.wenks.ch dpinger 56541 - [meta sequenceId="2"] ALERT: WAN2IP4GW (Addr: 8.8.4.4 Alarm: down -> none RTT: 13.4 ms RTTd: 6.5 ms Loss: 0.0 %)
<165>1 2023-08-26T16:16:36+00:00 scrambler.wenks.ch dpinger 56541 - [meta sequenceId="3"] ALERT: WAN1IP6GW (Addr: 2606:4700:4700::1111 Alarm: down -> none RTT: 1.9 ms RTTd: 2.1 ms Loss: 0.0 %)
<165>1 2023-08-26T16:16:36+00:00 scrambler.wenks.ch dpinger 56541 - [meta sequenceId="4"] ALERT: WAN2IP6GW (Addr: 2620:fe::9 Alarm: down -> none RTT: 2.1 ms RTTd: 2.1 ms Loss: 0.0 %)


That looks normal. Thanks so far.

I've slipped a small change into 23.7.3 to be able to optimise this further. If you are on 23.7.3 you can try the next patch:

https://github.com/opnsense/core/commit/bb951f2eb

# opnsense-patch bb951f2eb
# pluginctl -s dpinger restart :watcher:

Note that this breaks if being applied to 23.7.2 or earlier...


Cheers,
Franco

It helped to reduce CPU usage. In my case it is around 1/3 down from the 23.7.0 level to what it was with 23.1.x level (both with default $poll = 1).

That's sort of what I saw from here... for each second I had 0.3 seconds overall processing time and that patch should take 0.1 away. I still have to look at the class - that seems to be where the other 0.2 seconds are spent. Maybe we can get to down another 0.1. Otherwise we really have to change the standard poll interval.


Cheers,
Franco

Hello Franco

On my systems since the update to OPNsense 23.7.4 last night the CPU usage is down to the level it was with 23.1.x.

Thank you very much for this improvements!


Best regards,
Fabian

I am seeing high CPU use and temperatures after upgrading to 23.7.9. Netflow seems to be the culprit. Resetting Netflow data and rebuilding the database have not helped, nor has rebooting the box.
  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
15527 root          1 100    0    30M    20M CPU3     3  18:40  93.46% /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.9)
88671 root         31  20    0   865M   203M uwait    2   1:13   3.42% /usr/local/AdGuardHome/AdGuardHome -s run
26850 root          1  52    0    58M    28M accept   2   0:02   1.76% /usr/local/bin/php-cgi
56115 root          1  20    0    13M  2548K bpf      3   0:09   1.27% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
32301 root          5  20    0    45M    12M kqread   0   0:19   0.88% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
18061 root          1  52    0    59M    28M accept   0   0:06   0.59% /usr/local/bin/php-cgi
23558 root          1  20    0    59M    28M accept   2   0:04   0.49% /usr/local/bin/php-cgi

Has anyone else seen this?
OPNsense 24.7.7-amd64 on APU2E4 using ZFS

After a couple of days' running, it still seems netflow is causing significantly higher CPU use than before the upgrade. I have turned off IPv6 completely (following these instructions) to see if that would help; it made no difference.
  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
15527 root          1  80    0    54M    36M CPU2     2  22.8H  27.78% /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.9)
51856 root          1  21    0    58M    28M accept   3   0:05   1.27% /usr/local/bin/php-cgi
66423 root          1  52    0    58M    28M accept   3   0:02   1.17% /usr/local/bin/php-cgi
83323 root          1  20    0    58M    28M accept   2   0:05   0.29% /usr/local/bin/php-cgi
88671 root         31  20    0  1130M   477M uwait    0  38:39   0.10% /usr/local/AdGuardHome/AdGuardHome -s run
18068 root          1  52    0    58M    29M accept   1   0:06   0.10% /usr/local/bin/php-cgi
14297 root          1  25    0    13M  3792K pause    3   0:00   0.10% /bin/csh
  241 root          7  52    0   104M    42M accept   2  19:32   0.00% /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.9)
20283 root          1  20    0    88M    60M nanslp   1  11:59   0.00% /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
32301 root          4  20    0    49M    12M kqread   3   8:51   0.00% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
73875 root          1  20    0    13M  2644K bpf      3   4:38   0.00% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
79273 root          1  25    0    13M  3016K wait     3   3:14   0.00% /bin/sh /var/db/rrd/updaterrd.sh

Does anyone have suggestions as to what I could attempt, other than turning netflow off?
OPNsense 24.7.7-amd64 on APU2E4 using ZFS

Same issue here. Since upgrade to 23.7.10_1-amd64 cpu usage is much more higher then before and my monit system monitor regularly complains about cpu usage over 75%.

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
46114 root          1  85    0    48M    31M CPU1     1  33.0H  97.91% /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.9)
24479 root          1  28    0    69M    33M nanslp   0   0:01  11.18% /usr/local/bin/php-cgi
  257 root          1  52    0   135M    40M accept   0  39:54   4.87% /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.9)
67545 root         13  20    0  2869M  1074M nanslp   1 790:15   3.19% /usr/local/bin/suricata -D --netmap --pidfile /var/run/suricata.pid -c /usr/local/etc/suricata/suricata.yaml
19874 root         12  20  -20   906M    43M uwait    2 124:11   2.45% ipdrstreamer
22851 mongodb      28  52    0  2876M  1043M uwait    2 242:15   2.34% /usr/local/bin/mongod --logappend --setParameter=disabledSecureAllocatorDomains=* --config /usr/local/etc/mongodb.conf --fork
19514 root         12  20  -20  2997M   332M nanslp   1 471:32   2.03% eastpect: Eastpect Instance 0 (eastpect)
55207 root          3  20    0    50M  9976K kqread   3  12:06   0.46% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
15107 root          1  20    0    14M  3660K CPU3     3   0:02   0.43% top
44068 root          1  20    0    60M    31M accept   0   0:06   0.31% /usr/local/bin/php-cgi
37823 root          2  20  -20   946M    56M nanslp   2  18:10   0.28% /usr/local/zenarmor//bin/eastpect -D
80154 root          1  20    0    13M  1796K bpf      0   2:04   0.18% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
80838 root          1  28    0    13M  1784K nanslp   0   0:19   0.10% /usr/sbin/cron -s
16942 root          1  20    0    21M  7204K kqread   1   1:04   0.09% /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf
69045 root          1  20    0    25M    14M select   1   1:58   0.08% /usr/local/bin/python3 /usr/local/opnsense/scripts/dhcp/unbound_watcher.py --domain home (python3.9)
82347 root          1  20    0    23M    11M select   1   3:23   0.05% /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
98727 root          1  20    0    18M  7344K select   0   0:00   0.05% sshd: root@pts/0 (sshd)
82850 root          1  20    0    23M    11M select   1   3:04   0.04% /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)
  258 squid         1  20    0   174M    21M kqread   3   1:39   0.03% (squid-1) --kid squid-1 -f /usr/local/etc/squid/squid.conf (squid)
54088 root          1  21    0    13M  1932K wait     1   1:25   0.02% /bin/sh /var/db/rrd/updaterrd.sh
39938 root          1  20    0    21M  5868K select   2   0:28   0.01% /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf


Does anyone have some suggestions?

Greetings Manuel