High CPU-load

Started by ThomasE, December 02, 2024, 11:36:23 AM

Previous topic - Next topic
June 04, 2025, 10:13:03 AM #30 Last Edit: June 04, 2025, 10:29:58 AM by ThomasE
Quote from: meyergru on June 04, 2025, 12:49:10 AMSo basically what happens is that 300 VLANs - which presumably connect to a similar number of client machines - use VPN connections, capped at 4 GBit/s total. When all of those act in parallel, the problem occurs.
The problem already occurs with no traffic at all. With the exception of one 1Gbit/s interface solely used for administration and accessing the GUI, all other interfaces were physically disconnected. (They were - of course - enabled in the configuration.) There were some VPN servers configured and activated, but they weren't being used. To be precise, we have two legacy OpenVPN servers, one "new" OpenVPN instance for testing purposes and one WireGuard instance also for testing. Apart from that, everything es is simple routing/NAT. There is a total of 2% (18875/1000000) firewall table entries.

QuoteEven then, there is no single user process that the load can be attributed to. Thus, I would guess that the VPN calculations are the culprit. Those may be delegated to the FreeBSD kernel. The interrupt load could also point to that (maybe induced by context switches).
While the first sentence is entirely true, there should be VPN calculations whatsoever as VPN wasn't even used and won't ever be used to a greater extent. Even in production, there are at most 10 OpenVPN connections (road warrior).

QuoteDepending on how the VPN infrastructure is built (site-2-site vs. client-2-site), you probably could use different VPN technologies (e.g. Wireguard) or employ different ciphers, which may lead to better parallelism, if my guess should turn out to be the underlying problem.
I do agree that a significant number of established VPN connections might indeed be an issue, but this is not the case.

Initially, you said that a large number of updates causes the problem, not that it occurs out of the blue. When you deploy a new instance, this boils down to a high initial load with OpenVPN, because of the certificate handshakes.

Anyway, if the problem occurs even without specific traffic spikes, it seems to be the pure number of VLANs involved. I would argue that it is quite unsual to have that many. Probably the routing table hashing mechanism in FreeBSD is not optimized for that (or can be tuned by enlarging the memory for it). As I said, I am by no means a FreeBSD expert, but I saw that you can even set the routing algorithm, see "sysctl net.route.algo.inet.algo".
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on June 04, 2025, 10:40:23 AMInitially, you said that a large number of updates causes the problem, not that it occurs out of the blue. When you deploy a new instance, this boils down to a high initial load with OpenVPN, because of the certificate handshakes.
Correct, that's where the whole thing started: We installed OPNsense on our somewhat older server hardware (8 cores, 16 threads, 128GB RAM). For the most part this worked just fine, but we had some issues during traffic spikes. After our attempts to solve the problem via tuning failed, we switched to the best server hardware available to us: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads), 384 GB RAM, you get the idea. This was meant to be a temporary solution as this hardware seemed way too much for that purpose and was intended for running 20+ virtual machines instead of just one firewall - and we'd need two of those machines for redundancy.

In order to rule out any hardware or driver issues, we decided to get the appliance - which performs much worse than our old server. :-(

QuoteAnyway, if the problem occurs even without specific traffic spikes, it seems to be the pure number of VLANs involved. I would argue that it is quite unsual to have that many.
I agree with you that this is indeed somewhat unusual, but that's what we've got... ;-)

QuoteProbably the routing table hashing mechanism in FreeBSD is not optimized for that (or can be tuned by enlarging the memory for it). As I said, I am by no means a FreeBSD expert, but I saw that you can even set the routing algorithm, see "sysctl net.route.algo.inet.algo".
My knowledge about FreeBSD is even more limited, but this looks like a good starting point for some more research... :)

Thanks
Thomas

As a customer with an official Deciso appliance I would move this discussion from the community forum to an equally official support case.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: ThomasE on June 04, 2025, 01:37:15 PMIn order to rule out any hardware or driver issues, we decided to get the appliance - which performs much worse than our old server. :-(

It is PC hardware, a an embedded Zen 1 Epyc. (Aside: Network hardware is what? Intel E810s? It would be interesting to look at "dmesg | grep ice" (specifically to look at the installed ddp package) and "pciconf -lcv ice0" (e.g. - I'd look at all ice devices for the PCI-e version and lanes), but those should be irrelevant to your issue. The E810 seems a bit finicky - getting it set up in a happy fashion can take effort.)

The available routing algorithms don't look bad, and your pf table is small (bogonsv6 isn't even loaded). 300 VLANs is quite a few, but I can't imagine the VLAN code would be an issue. I don't know of any interface-related jobs in OPNsense/FreeBSD that might eat CPU just from iterating over a long list. I wonder if Deciso can replicate your issue.