Hi all,
I tried my luck on Reddit but got no answer so I'm posting here.
Since upgrading to 24.7.2, I see weird values (abnormally high) for CPU utilization. The previous releases seemed to be fine in this regard (but the RRD collection was broken).
I see ~40% constant usage now (see screenshot in attachment) whereas vmstat gives me less than 20!?
root@opnsense:~ # vmstat 60
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr ada0 pas0 in sy cs us sy id
0 0 39 518G 504M 2.3k 4 0 1 2.0k 122 0 0 239 1.1k 927 9 3 87
0 0 39 518G 503M 2.3k 0 0 1 2.0k 171 10 0 306 1.1k 1.1k 10 3 86
1 0 39 518G 502M 2.2k 0 0 1 2.0k 171 7 0 225 1.0k 979 11 2 85
0 0 39 518G 502M 2.2k 0 0 1 2.0k 171 10 0 236 1.0k 1.0k 10 3 86
0 0 39 518G 502M 2.2k 0 0 1 2.0k 171 7 0 986 1.0k 2.6k 11 3 84
0 0 39 518G 502M 2.3k 0 0 1 2.0k 171 10 0 315 1.1k 1.1k 11 3 85
0 0 39 518G 501M 2.2k 0 0 1 2.0k 171 7 0 229 1.0k 969 11 2 85
0 0 39 518G 501M 2.3k 0 0 1 2.0k 171 11 0 309 1.1k 1.1k 11 2 85
1 0 39 518G 500M 2.2k 0 0 1 2.0k 171 8 0 379 1.2k 1.2k 13 3 83
2 0 39 518G 499M 2.6k 0 0 1 2.3k 172 10 0 335 1.3k 1.3k 13 3 83
1 0 39 518G 490M 2.4k 0 0 1 2.2k 173 6 0 489 1.1k 1.3k 8 3 87
0 0 39 518G 490M 2.2k 0 0 1 2.0k 172 10 0 284 1.0k 1.1k 9 2 87
Anyone noticed a similar behavior?
Thank you.
EDIT: Here is the Reddit post (https://www.reddit.com/r/opnsense/comments/1f13bkk/weird_cpu_utilization_since_2472/)
Fun fact, I saw your thread on Reddit and linked to it here (sorry for not responding, I don't have an account there):
https://forum.opnsense.org/index.php?topic=41759.45
I've also noticed increased CPU utilization, and wondering if increased temperature readings are a 3rd order effect of whatever the overall underlying cause is.
I personally have been unable to identify the reason for it.
When you upgrade to 24.7.3 ... is it still the same or is it reverted to pre-24.7 behavior?
Quote from: irrenarzt on August 30, 2024, 12:54:55 AM
I personally have been unable to identify the reason for it.
Personally you haven't bothered with even looking the
top output and focused on rants about burning CPUs at 70C. ::)
Dunno, if something is using your CPU, have a look at what it is. Ditto for the OP here:
https://man.freebsd.org/cgi/man.cgi?top(1)
I still have the theory that this is also connected to the ICMPv6 issue in 24.7.1 and 24.7.2 which was reverted in 24.7.3.
Partially, yes. Practically there might be a case for "sysctl -a" running hotter than usual with FreeBSD 14. Maybe for the right reasons, but I think we'll never know.
Cheers,
Franco
The attached screenshot shows my "States" with the *EXACT* some configuration over the whole time of the plot.
You can clearly see the change when I upgraded from 24.1.10 to 24.7.1 on 08.08.2024 ... and the pattern also changes after the upgrade to 24.7.2 on 21.08.2024.
And guess what, now that I upgraded to 24.7.3 just yesterday, it is back to old behaviour (not easy to see yet in the screenshot, but if I adjust the axis scaling it can be clearly seen).
So, my assumption is, that this change in states also comes with a change in CPU utilization.
PS: and also my weird Android Wifi disconnects that I was chasing over the last weeks have miraculously vanished since I upgraded to 24.7.3 yesterday. ;D
Do you happen to use any firewall aliases with large table entries loaded?
I discovered that by disabling Maxmind Geoblock, I'm able to reduce my temperatures and utilization to pre-24.7 levels. Since I made no alteration to any of mine between 24.1 and 24.7, I think python 3.11 introduced a bug.
Mind you, the CPU health graph doesn't even run "sysctl -a" and never has.
And Python 3.11 was introduced in 24.1.7.
All of this points to FreeBSD 14.1 if you ask me.
Cheers,
Franco
Quote from: irrenarzt on September 02, 2024, 12:10:03 AM
Do you happen to use any firewall aliases with large table entries loaded?
I discovered that by disabling Maxmind Geoblock, I'm able to reduce my temperatures and utilization to pre-24.7 levels. Since I made no alteration to any of mine between 24.1 and 24.7, I think python 3.11 introduced a bug.
Yes, I *do* use Maxmind Geoblock with country aliases that result in large table entries.
But then again, it all went back to pre-24.7 behaviour when upgrading to 24.7.3 as you can see in my state graph.
24.1.10 -> 24.7.1 on 2024-08-08
24.7.1 -> 24.7.2 on 2024-08-21
24.7.2 -> 24.7.3 on 2024-08-29
You can see this correlates perfectly with the graph. I'm pretty sure this is due to the FreeBSD ICMP issue and not Python.
To be 100% fair the OP posted the CPU graph until you made it about states. The FreeBSD SA problem is not in dispute and not relevant here either.
https://forum.opnsense.org/index.php?topic=42066.msg211020#msg211020
In the linked OpenBSD commit you can clearly read:
This allows pf to match Address Resolution, Neighbor Unreachability
Detection and Duplicate Address Detection packets to the corresponding
states without the need to create new ones or match unrelated ones.
And FreeBSD has slowly come around on their stance on why that may be important. Prior to that we already disabled state tracking for ND packets again in 24.7.3.
Cheers,
Franco
Yes, I don't want to derail this thread any further, my assumption was, that such a large increase in states most likely is also correlated with an increase in CPU consumption. As I have seen a reduction in states with 24.7.3 in my setup, I just wanted to suggest also checking the OP's CPU issue with 24.7.3 to verify/falsify whether this also has been resolved.
But my assumption may be wrong and I won't hijack this any further.
In regards of OP issues.
I did check the RRD just for funzies (I went from .1 to .3). And yes there is a increase of CPU utilization seen in RDD but the difference is like 4% for the System when comparing 24.7.1 vs 24.7.3. No changes for the "User"
These are negligible values actually such increase I would expect as we went to a newer version of FreeBSD and a lot of stuff is happening. So some increase here is expected.
However OP show much higher Utilization values and specifically for the User space. Check top and find out what process is doing it. Its very hard to say what is causing this without knowing of your implementation.
Like what are you running:
1. Any huge Aliases lists?
2. Any routing protocol
3. IPv4 or IPv6 (Static, DHCP, etc.)
3. Etc.
Regards,
S.
Sorry for not answering earlier. I was pretty busy.
The weird thing is that like vmstat, top does not show more activity than before 24.7.2. Here is the output of top with a refresh rate of 60 seconds:
last pid: 674; load averages: 0.52, 0.47, 0.43 up 4+02:18:50 21:08:22
91 processes: 1 running, 90 sleeping
CPU: 9.8% user, 0.0% nice, 3.9% system, 0.1% interrupt, 86.3% idle
Mem: 78M Active, 538M Inact, 2122M Wired, 104K Buf, 998M Free
ARC: 1580M Total, 847M MFU, 557M MRU, 9363K Anon, 22M Header, 142M Other
1281M Compressed, 3081M Uncompressed, 2.41:1 Ratio
Swap: 8192M Total, 8192M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
49492 root 1 20 0 51M 38M nanslp 0 51:56 11.08% python3.11
19870 root 4 24 0 201M 147M kqread 1 30:53 2.32% python3.11
26229 root 1 20 0 79M 52M nanslp 1 50:16 0.86% php
16926 unbound 2 20 0 101M 76M kqread 1 1:34 0.21% unbound
94552 root 1 20 0 12M 2236K select 1 3:16 0.04% udpbroadcastrelay
87910 _flowd 1 20 0 12M 2716K select 0 1:23 0.03% flowd
96971 root 1 20 0 12M 2284K select 0 1:11 0.02% powerd
48469 root 4 68 0 13M 2680K uwait 0 0:10 0.02% dpinger
30572 root 7 20 0 45M 22M select 0 0:35 0.02% kea-dhcp4
57112 root 1 20 0 23M 8176K select 0 0:10 0.02% ntpd
94018 root 1 20 0 26M 14M select 1 0:14 0.02% python3.11
94113 root 1 20 0 27M 15M select 1 0:14 0.01% python3.11
6027 root 1 20 0 19M 9012K select 1 0:00 0.01% sshd-session
26673 root 4 68 0 13M 2644K uwait 0 0:46 0.01% dpinger
45218 nobody 1 20 0 12M 2176K sbwait 0 0:31 0.01% samplicate
23713 root 2 20 0 46M 15M kqread 0 0:21 0.01% syslog-ng
58815 root 1 20 0 12M 2348K select 0 0:02 0.01% igmpproxy
82211 root 1 20 0 14M 3748K CPU0 0 0:00 0.00% top
44911 dhcpd 1 20 0 24M 12M select 1 0:03 0.00% dhcpd
59765 root 2 20 0 22M 11M nanslp 1 0:11 0.00% monit
97725 root 1 20 0 22M 11M kqread 0 0:05 0.00% lighttpd
In my case, I see a slight increase of CPU temperature. I checked my states and indeed, they went up after 24.7.2 and slightly down again with 24.7.3 but still higher than with 24.7.0 and 24.7.1 (see attached screenshot)!?
I have 165115 aliases, most from bogon networks IPv6 (internal). I didn't change this between 24.1 and 24.7. I don't remember the amount with 24.1 though.
I use both IPv4 and IPv6. I didn't notice any issue with IPv6 except that I must reload manually my WAN interface to get the correct IPv6 gateway!? This was working fine before 24.7.
I disabled IDS because I thought it might be causing the high CPU load but this did not change anything.
But I think I just found the problem :) I look at /var/db/rrd/updaterrd.sh and I see that the CPU stats come from cpustats. I ran that program in a loop and clearly, that application is just reporting the current CPU load. For RRD, this will not work as it will depend on what is running at the time of the collection and does not consider what happened during the last minute. vmstat 60 or top -s 60 show the correct behavior.
root@opnsense:~ # sh -c "while [ true ] ; do cpustats ; sleep 2 ; done"
1.6:0.0:2.3:0.0:96.1
49.6:0.0:2.0:0.0:48.4
0.0:0.0:1.6:0.0:98.4
0.0:0.0:1.6:0.4:98.0
6.7:0.0:2.7:0.4:90.2
0.8:0.0:2.0:0.0:97.2
0.0:0.0:2.8:0.0:97.2
0.8:0.0:5.9:0.0:93.3
0.0:0.0:1.9:0.0:98.1
0.8:0.0:4.7:0.0:94.5
1.5:0.0:5.8:0.0:92.7
12.5:0.0:4.3:0.4:82.8
0.4:0.0:1.2:0.0:98.4
0.4:0.0:3.9:0.0:95.7
0.4:0.0:6.3:0.0:93.4
0.0:0.0:1.5:0.0:98.5
0.8:0.0:3.0:0.0:96.2
21.3:0.0:18.9:0.0:59.8
41.5:0.0:20.2:0.0:38.4
0.4:0.0:1.6:0.0:98.0
0.8:0.0:1.6:0.0:97.7
11.2:0.0:4.2:0.0:84.6
49.2:0.0:2.7:0.4:47.7
12.9:0.0:6.6:0.4:80.1
0.8:0.0:1.6:0.0:97.7
0.0:0.0:3.1:0.0:96.9
0.4:0.0:3.1:0.0:96.5
0.4:0.0:4.3:0.0:95.3
13.3:0.0:3.5:0.0:83.2
root@opnsense:~ # sh -c "while [ true ] ; do cpustats ; sleep 30 ; done"
44.5:0.0:8.7:0.0:46.9
0.4:0.0:2.7:0.0:96.9
0.4:0.0:1.1:0.0:98.5
0.4:0.0:1.9:0.4:97.3
0.4:0.0:1.6:0.0:98.0
45.3:0.0:3.5:0.0:51.2
0.0:0.0:2.0:0.0:98.0
I don't know how it was done before moving this script to be called by a cronjob but something was different before.
Should I open a ticket on Github for this? Thank you
Cyril