High CPU-load

Started by ThomasE, December 02, 2024, 11:36:23 AM

Previous topic - Next topic
Hi,

our system is a Intel(R) Xeon(R) E-2378 CPU @ 2.60GHz (8 cores, 8 threads) with 64GB of RAM and four Intel 10Gbit NICs. We're having about 300 VLAN interfaces and a symmetric 2GBit/s line to connect to the internet which will hopefully become 4GBit/s in the near future. Almost all traffic is internet traffic and thus limited by our external connection. We're running a captive portal with a few hundred connected clients and the usual DHCP, unbound DNS, NTP - all of which shouldn't need large amounts of CPU power.

During normal operation this setup works just fine (load ~5), but as soon as we do something out of the ordinary - for example starting updates on a large number of devices simultaneously - our system can't handle it any more. The load goes up to over 100, VPN gives up completely and everything else becomes just very, very slow. As the throughput hardly exceeds 2GBit/s as internal traffic is almost negligible, we're seriously concerned about what happens when we increase or bandwidth as planned.

We've already worked through some performances guides and have implemented the following changes:


  • machdep.hyperthreading_allowed = 0
  • net.isr.maxthreads =  -1
  • net.pf.source_nodes_hashsize
  • kern.ipc.maxsockbuf = 16777216

However, none of those seems to improve things significantly.

Current RAM usage never exceeds 3GB which is a bit odd IMHO. While I'm aware that 64GB may well be quite a bit more than needed, 3GB on the other hand seems pretty low considerung our rather big environment.

Do we really need better hardware or what other things are worth looking at to improve performance?

Regards
Thomas

You should probably look at "top" to find the process that is causing this. I doubt that the plain routing would cause that high load. It could be some kind of secondary cause, like Zenarmor or suricata or probably even just logging of default firewall rules.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: meyergru on December 02, 2024, 11:48:56 AM
You should probably look at "top" to find the process that is causing this. I doubt that the plain routing would cause that high load. It could be some kind of secondary cause, like Zenarmor or suricata or probably even just logging of default firewall rules.
There is indeed one process that's quite noticable:

/usr/local/sbin/lighttpd -f /var/etc/lighttpd-cp-zone-0.conf

It has an aggregated CPU time of 15 hours (uptime: 5 days) and uses up between 20% and 40% all the time. Looks like Captive Portal to me. While this is more than I expected, I would assume it remains somewhat constant and doesn't increase as traffic goes up...

I meant processes going up in usage while you have that kind of situation. I would think that excessive logging and some process that uses logs like Zenarmor, crowdsec wil then take more CPU cycles. Thus, reducing logging might fix it.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: meyergru on December 02, 2024, 01:10:18 PM
I meant processes going up in usage while you have that kind of situation. I would think that excessive logging and some process that uses logs like Zenarmor, crowdsec wil then take more CPU cycles. Thus, reducing logging might fix it.
We don't use Zenarmor, CrowdSec or anything known to take a lot of CPU... As I said, a handful of people connecting via OpenVPN (or trying to do so) plus the usual stuff (DHCP, DNS, NTP) - that's it. Firewall Logging is currently disabled and only used for debug purposes. Got to be missing something, but I don't know where to look... :-(

We need to wait for the next batch of updates to watch the system under heavy load. Besides [h]top - is there anything we should specifically look at?


You will most likely see the culprit when the situation arises.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Ok, here we go. We just ran into the issue for a few minutes. I attached the output of htop during the event (1.jpg). I just realize the sort column is poorly chosen, but maybe this gives a hint, anyway...

Any ideas?

OT: How can I include attached pictures inside the posting?

There seems to be no processes using up all the CPU. Lighthttpd is fine. I do not know what that configd process does.

I only see that you use netflow and a captive portal, maybe you should disable them to see if that fixes the problem. With Netflow, I have seen database corruptions that repeatedly hogged the CPU. There is a button to reset the netflow data, or you can disable that altogether.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: ThomasE on December 11, 2024, 10:52:31 AM
Ok, here we go. We just ran into the issue for a few minutes. I attached the output of htop during the event (1.jpg). I just realize the sort column is poorly chosen, but maybe this gives a hint, anyway...

Any ideas?

OT: How can I include attached pictures inside the posting?

Copy the link once you upload the picture to forum > edit your post > click on Insert image > paste the link into it.

The netflow can be CPU heavy in same cases as @meyergru mentions. Try to disable it as well any other additional services (shaper, captive portal, etc.).

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Quote from: meyergru on December 11, 2024, 11:50:03 AMThere seems to be no processes using up all the CPU. Lighthttpd is fine. I do not know what that configd process does.
The configd is a random occurrence. I assume that a colleague modified the configuration just when I was making that screenshot. I've been watching it for some time now - it's at 0.0% CPU all the time, so I think this can be safely ignored.

The fact that there are not processes using up all that CPU is what puzzles me most. When looking at the CPU usage for each core as it is shown in htop, all eight cores are shown with values between 40% and 100%. Rarely, if ever one of them drops below that value. On the other hand, the only process that's continually using more than 1% CPU (lighttpd -f /var/etc/lighttpd-cp-zone-0.conf) uses between 20% and 40% CPU. The number of processes shown as Running is 2 most of the time (htop and lighttpd) and their number is never exceeding 4 - yet the current load average is around 7. Every few seconds I can see a number of other processes showing up:

iftop --nNb -i vlan0.x.y -s2 -t

There're quite a few of them - obviously, because we have a lot of VLANs - but from what I can tell, they can only account for very short spikes in CPU usage - not what we currently observe. This picture shows some of the iftop processes which I'd consider "typical".

You cannot view this attachment.

Most of the time around three of four of those processes can be seen, sometimes there're none, sometimes there are up to 30.

Quote from: meyergru on December 11, 2024, 11:50:03 AMI only see that you use netflow and a captive portal, maybe you should disable them to see if that fixes the problem. With Netflow, I have seen database corruptions that repeatedly hogged the CPU. There is a button to reset the netflow data, or you can disable that altogether.
I've disabled netflow for the time being as we don't really need it though it's certainly nice to have. No change.

You cannot view this attachment.

Captive Portal is right on top of our list of "suspects", though I'm still unsure. We have a few hundred concurrent sessions. Yes, that's quite a bit, but then again it's not that much, I think. What might be part of the problem is that our CP has a very high general "availability": There are almost 100 buildings scattered throughout the whole city allowing unauthenticated access to the WiFi that leads to our CP. I would assume that not too many people randomly try to actively connect to an open WiFi just because "it's there", but I'm not sure what their smartphones are doing in the background.

you've already done a lot of performance tuning! Given your setup, here are a few additional suggestions:

Check for Software Updates: Ensure that all your software, including the OS and OpnSense, are up to date. Sometimes performance improvements are included in updates.

Optimize DNS and NTP Settings: Fine-tune your DNS and NTP configurations to ensure they're not causing unnecessary load.

Monitor CPU and Memory Usage: Use tools like htop or top to monitor real-time CPU and memory usage. This can help identify any processes that are consuming more resources than expected.

Consider Load Balancing: If possible, distribute the load across multiple servers to prevent any single server from becoming a bottleneck.

Evaluate Network Configuration: Double-check your network settings to ensure there are no misconfigurations causing unnecessary traffic or delays.

If these steps don't help, it might be worth considering hardware upgrades or consulting with a performance specialist to identify any underlying issues.

Good luck, and I hope this helps!

8 Ice Lake cores, and Lighttpd is eating 50%? Whew. Definitely the captive portal, but that's still a lot of CPU. Pure (and uninformed) speculation, but could network queueing be bottling up one core or particular cores? "netstat -Q" only gives that info when you have RSS enabled - I'm not sure how to determine it otherwise. htop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.

Quote from: pfry on January 05, 2025, 01:10:21 AM8 Ice Lake cores, and Lighttpd is eating 50%? Whew. Definitely the captive portal, but that's still a lot of CPU.
Sadly, it's definitely not the CP. We've set up an even better machine with the exact same configuration in a clean lab environment, so there're no CP or any other clients doing anything. We placed two test machines into two different VLANs and ran iperf for testing. Within seconds the load average reported by top goes up beyond 20.

QuotePure (and uninformed) speculation, but could network queueing be bottling up one core or particular cores? "netstat -Q" only gives that info when you have RSS enabled - I'm not sure how to determine it otherwise.
From what I can tell the traffic handling is equally spread of all CPUs, netstat -Q seems to confirm that:

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip       135  1000        0    45706      201  7240161  7285548
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     0        0        0        0        0        0
   0   0   arp        0     0     2019        0        0        0     2019
   0   0   ether      0     0 123022625        0        0        0 123022625
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ip_direct     0     0        0        0        0        0        0
   0   0   ip6_direct     0     0        0        0        0        0        0
   1   1   ip         2  1000        0    41053     1569  8672745  8713708
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     0        0        0        0        0        0
   1   1   ether      0     0 128714595        0        0        0 128714595
   1   1   ip6        0     0        0        0        0        0        0
   1   1   ip_direct     0     0        0        0        0        0        0
   1   1   ip6_direct     0     0        0        0        0        0        0
   2   2   ip       286  1000        0   141216     2463  8082614  8223246
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0        0        0        0        0        0
   2   2   ether      0     0 132776645        0        0        0 132776645
   2   2   ip6        0     0        0        0        0        0        0
   2   2   ip_direct     0     0        0        0        0        0        0
   2   2   ip6_direct     0     0        0        0        0        0        0

(It goes on like that for all other cores...)

Quotehtop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
The interrupt stat as shown by top is indeed interesting - it's between 80% and 90%! Then there's less than 1% user, a bit more than 1% system and the remaining ~15% is shown as idle.

We've already played a bit with the tunables including, but not limited to:

dev.ixl.0.iflib.override_nrxqs=32
dev.ixl.0.iflib.override_ntxqs=32
dev.ixl.1.iflib.override_nrxqs=32
dev.ixl.1.iflib.override_ntxqs=32
machdep.hyperthreading_allowed=0
net.inet.ip.fw.dyn_buckets=16777216
net.inet.ip.fw.dyn_max=16777216
net.inet.rss.enabled=1
net.isr.maxthreads=-1
net.pf.source_nodes_hashsize=1048576
I do admit that we don't fully (or sometimes at all) understand what those optimizations do, but we encountered them while reading various guides and tried them. However, setting those seemed to have very, very little effect at best. Updating NIC drivers on our Intel card led to a somewhat reduced total throughput going down from around 8Gbit/s to about 6Gbit/s along with what seems to be a slightly lower load average, but we didn't do any precise measurements there. We also tried using a Broadcom NIC instead of Intel - no change.

Quote from: ThomasE on January 10, 2025, 02:20:35 PMSadly, it's definitely not the CP.
[...]

What's the top process on your new machine?

Crud. I have the hardware to test, but no bench space and no software. It'll be a while before I can test higher than 1Gb, and this issue interests me.

While I'm posting useless text, ixl... are you up to date? x710 NVM updater (generic) (Minimal changes from 9.52.) (Updated firmware is critical for DPDK, but less so in other applications. I was testing DANOS/Vyatta and VPP, so I got into the habit.)