Weird CPU utilization since 24.7.2

Started by lostcontrol, August 29, 2024, 09:08:02 PM

Previous topic - Next topic
August 29, 2024, 09:08:02 PM Last Edit: August 29, 2024, 09:09:36 PM by lostcontrol
Hi all,

I tried my luck on Reddit but got no answer so I'm posting here.

Since upgrading to 24.7.2, I see weird values (abnormally high) for CPU utilization. The previous releases seemed to be fine in this regard (but the RRD collection was broken).

I see ~40% constant usage now (see screenshot in attachment) whereas vmstat gives me less than 20!?

root@opnsense:~ # vmstat 60
procs    memory    page                      disks       faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr ada0 pas0   in   sy   cs us sy id
0  0 39 518G 504M 2.3k   4   0   1 2.0k  122    0    0  239 1.1k  927  9  3 87
0  0 39 518G 503M 2.3k   0   0   1 2.0k  171   10    0  306 1.1k 1.1k 10  3 86
1  0 39 518G 502M 2.2k   0   0   1 2.0k  171    7    0  225 1.0k  979 11  2 85
0  0 39 518G 502M 2.2k   0   0   1 2.0k  171   10    0  236 1.0k 1.0k 10  3 86
0  0 39 518G 502M 2.2k   0   0   1 2.0k  171    7    0  986 1.0k 2.6k 11  3 84
0  0 39 518G 502M 2.3k   0   0   1 2.0k  171   10    0  315 1.1k 1.1k 11  3 85
0  0 39 518G 501M 2.2k   0   0   1 2.0k  171    7    0  229 1.0k  969 11  2 85
0  0 39 518G 501M 2.3k   0   0   1 2.0k  171   11    0  309 1.1k 1.1k 11  2 85
1  0 39 518G 500M 2.2k   0   0   1 2.0k  171    8    0  379 1.2k 1.2k 13  3 83
2  0 39 518G 499M 2.6k   0   0   1 2.3k  172   10    0  335 1.3k 1.3k 13  3 83
1  0 39 518G 490M 2.4k   0   0   1 2.2k  173    6    0  489 1.1k 1.3k  8  3 87
0  0 39 518G 490M 2.2k   0   0   1 2.0k  172   10    0  284 1.0k 1.1k  9  2 87


Anyone noticed a similar behavior?

Thank you.

EDIT: Here is the Reddit post

Fun fact, I saw your thread on Reddit and linked to it here (sorry for not responding, I don't have an account there):
https://forum.opnsense.org/index.php?topic=41759.45

I've also noticed increased CPU utilization, and wondering if increased temperature readings are a 3rd order effect of whatever the overall underlying cause is.

I personally have been unable to identify the reason for it.

When you upgrade to 24.7.3 ... is it still the same or is it reverted to pre-24.7 behavior?

Quote from: irrenarzt on August 30, 2024, 12:54:55 AM
I personally have been unable to identify the reason for it.

Personally you haven't bothered with even looking the top output and focused on rants about burning CPUs at 70C.  ::)

Dunno, if something is using your CPU, have a look at what it is. Ditto for the OP here:
https://man.freebsd.org/cgi/man.cgi?top(1)


I still have the theory that this is also connected to the ICMPv6 issue in 24.7.1 and 24.7.2 which was reverted in 24.7.3.

Partially, yes. Practically there might be a case for "sysctl -a" running hotter than usual with FreeBSD 14. Maybe for the right reasons, but I think we'll never know.


Cheers,
Franco

The attached screenshot shows my "States" with the *EXACT* some configuration over the whole time of the plot.

You can clearly see the change when I upgraded from 24.1.10 to 24.7.1 on 08.08.2024 ... and the pattern also changes after the upgrade to 24.7.2 on 21.08.2024.

And guess what, now that I upgraded to 24.7.3 just yesterday, it is back to old behaviour (not easy to see yet in the screenshot, but if I adjust the axis scaling it can be clearly seen).

So, my assumption is, that this change in states also comes with a change in CPU utilization.

PS: and also my weird Android Wifi disconnects that I was chasing over the last weeks have miraculously vanished since I upgraded to 24.7.3 yesterday.  ;D

Do you happen to use any firewall aliases with large table entries loaded?

I discovered that by disabling Maxmind Geoblock, I'm able to reduce my temperatures and utilization to pre-24.7 levels. Since I made no alteration to any of mine between 24.1 and 24.7, I think python 3.11 introduced a bug.

September 02, 2024, 08:19:02 AM #8 Last Edit: September 02, 2024, 08:20:38 AM by franco
Mind you, the CPU health graph doesn't even run "sysctl -a" and never has.

And Python 3.11 was introduced in 24.1.7.

All of this points to FreeBSD 14.1 if you ask me.


Cheers,
Franco

Quote from: irrenarzt on September 02, 2024, 12:10:03 AM
Do you happen to use any firewall aliases with large table entries loaded?

I discovered that by disabling Maxmind Geoblock, I'm able to reduce my temperatures and utilization to pre-24.7 levels. Since I made no alteration to any of mine between 24.1 and 24.7, I think python 3.11 introduced a bug.

Yes, I *do* use Maxmind Geoblock with country aliases that result in large table entries.

But then again, it all went back to pre-24.7 behaviour when upgrading to 24.7.3 as you can see in my state graph.

24.1.10 -> 24.7.1 on 2024-08-08
24.7.1 -> 24.7.2 on 2024-08-21
24.7.2 -> 24.7.3 on 2024-08-29

You can see this correlates perfectly with the graph. I'm pretty sure this is due to the FreeBSD ICMP issue and not Python.

To be 100% fair the OP posted the CPU graph until you made it about states. The FreeBSD SA problem is not in dispute and not relevant here either.

https://forum.opnsense.org/index.php?topic=42066.msg211020#msg211020

In the linked OpenBSD commit you can clearly read:

This allows pf to match Address Resolution, Neighbor Unreachability
Detection and Duplicate Address Detection packets to the corresponding
states without the need to create new ones or match unrelated ones.

And FreeBSD has slowly come around on their stance on why that may be important. Prior to that we already disabled state tracking for ND packets again in 24.7.3.


Cheers,
Franco

Yes, I don't want to derail this thread any further, my assumption was, that such a large increase in states most likely is also correlated with an increase in CPU consumption. As I have seen a reduction in states with 24.7.3 in my setup, I just wanted to suggest also checking the OP's CPU issue with 24.7.3 to verify/falsify whether this also has been resolved.

But my assumption may be wrong and I won't hijack this any further.

In regards of OP issues.

I did check the RRD just for funzies (I went from .1 to .3). And yes there is a increase of CPU utilization seen in RDD but the difference is like 4% for the System when comparing 24.7.1 vs 24.7.3. No changes for the "User"

These are negligible values actually such increase I would expect as we went to a newer version of FreeBSD and a lot of stuff is happening. So some increase here is expected.

However OP show much higher Utilization values and specifically for the User space. Check top and find out what process is doing it. Its very hard to say what is causing this without knowing of your implementation.

Like what are you running:
1. Any huge Aliases lists?
2. Any routing protocol
3. IPv4 or IPv6 (Static, DHCP, etc.)
3. Etc.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Sorry for not answering earlier. I was pretty busy.

The weird thing is that like vmstat, top does not show more activity than before 24.7.2. Here is the output of top with a refresh rate of 60 seconds:


last pid:   674;  load averages:  0.52,  0.47,  0.43                                            up 4+02:18:50  21:08:22
91 processes:  1 running, 90 sleeping
CPU:  9.8% user,  0.0% nice,  3.9% system,  0.1% interrupt, 86.3% idle
Mem: 78M Active, 538M Inact, 2122M Wired, 104K Buf, 998M Free
ARC: 1580M Total, 847M MFU, 557M MRU, 9363K Anon, 22M Header, 142M Other
     1281M Compressed, 3081M Uncompressed, 2.41:1 Ratio
Swap: 8192M Total, 8192M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
49492 root          1  20    0    51M    38M nanslp   0  51:56  11.08% python3.11
19870 root          4  24    0   201M   147M kqread   1  30:53   2.32% python3.11
26229 root          1  20    0    79M    52M nanslp   1  50:16   0.86% php
16926 unbound       2  20    0   101M    76M kqread   1   1:34   0.21% unbound
94552 root          1  20    0    12M  2236K select   1   3:16   0.04% udpbroadcastrelay
87910 _flowd        1  20    0    12M  2716K select   0   1:23   0.03% flowd
96971 root          1  20    0    12M  2284K select   0   1:11   0.02% powerd
48469 root          4  68    0    13M  2680K uwait    0   0:10   0.02% dpinger
30572 root          7  20    0    45M    22M select   0   0:35   0.02% kea-dhcp4
57112 root          1  20    0    23M  8176K select   0   0:10   0.02% ntpd
94018 root          1  20    0    26M    14M select   1   0:14   0.02% python3.11
94113 root          1  20    0    27M    15M select   1   0:14   0.01% python3.11
6027 root          1  20    0    19M  9012K select   1   0:00   0.01% sshd-session
26673 root          4  68    0    13M  2644K uwait    0   0:46   0.01% dpinger
45218 nobody        1  20    0    12M  2176K sbwait   0   0:31   0.01% samplicate
23713 root          2  20    0    46M    15M kqread   0   0:21   0.01% syslog-ng
58815 root          1  20    0    12M  2348K select   0   0:02   0.01% igmpproxy
82211 root          1  20    0    14M  3748K CPU0     0   0:00   0.00% top
44911 dhcpd         1  20    0    24M    12M select   1   0:03   0.00% dhcpd
59765 root          2  20    0    22M    11M nanslp   1   0:11   0.00% monit
97725 root          1  20    0    22M    11M kqread   0   0:05   0.00% lighttpd


In my case, I see a slight increase of CPU temperature. I checked my states and indeed, they went up after 24.7.2 and slightly down again with 24.7.3 but still higher than with 24.7.0 and 24.7.1 (see attached screenshot)!?

I have 165115 aliases, most from bogon networks IPv6 (internal). I didn't change this between 24.1 and 24.7. I don't remember the amount with 24.1 though.

I use both IPv4 and IPv6. I didn't notice any issue with IPv6 except that I must reload manually my WAN interface to get the correct IPv6 gateway!? This was working fine before 24.7.

I disabled IDS because I thought it might be causing the high CPU load but this did not change anything.

But I think I just found the problem :) I look at /var/db/rrd/updaterrd.sh and I see that the CPU stats come from cpustats. I ran that program in a loop and clearly, that application is just reporting the current CPU load. For RRD, this will not work as it will depend on what is running at the time of the collection and does not consider what happened during the last minute. vmstat 60 or top -s 60 show the correct behavior.


root@opnsense:~ # sh -c "while [ true ] ; do cpustats ; sleep 2 ; done"
1.6:0.0:2.3:0.0:96.1
49.6:0.0:2.0:0.0:48.4
0.0:0.0:1.6:0.0:98.4
0.0:0.0:1.6:0.4:98.0
6.7:0.0:2.7:0.4:90.2
0.8:0.0:2.0:0.0:97.2
0.0:0.0:2.8:0.0:97.2
0.8:0.0:5.9:0.0:93.3
0.0:0.0:1.9:0.0:98.1
0.8:0.0:4.7:0.0:94.5
1.5:0.0:5.8:0.0:92.7
12.5:0.0:4.3:0.4:82.8
0.4:0.0:1.2:0.0:98.4
0.4:0.0:3.9:0.0:95.7
0.4:0.0:6.3:0.0:93.4
0.0:0.0:1.5:0.0:98.5
0.8:0.0:3.0:0.0:96.2
21.3:0.0:18.9:0.0:59.8
41.5:0.0:20.2:0.0:38.4
0.4:0.0:1.6:0.0:98.0
0.8:0.0:1.6:0.0:97.7
11.2:0.0:4.2:0.0:84.6
49.2:0.0:2.7:0.4:47.7
12.9:0.0:6.6:0.4:80.1
0.8:0.0:1.6:0.0:97.7
0.0:0.0:3.1:0.0:96.9
0.4:0.0:3.1:0.0:96.5
0.4:0.0:4.3:0.0:95.3
13.3:0.0:3.5:0.0:83.2



root@opnsense:~ # sh -c "while [ true ] ; do cpustats ; sleep 30 ; done"
44.5:0.0:8.7:0.0:46.9
0.4:0.0:2.7:0.0:96.9
0.4:0.0:1.1:0.0:98.5
0.4:0.0:1.9:0.4:97.3
0.4:0.0:1.6:0.0:98.0
45.3:0.0:3.5:0.0:51.2
0.0:0.0:2.0:0.0:98.0


I don't know how it was done before moving this script to be called by a cronjob but something was different before.

Should I open a ticket on Github for this? Thank you

Cyril