OPNsense Forum

English Forums => Hardware and Performance => Topic started by: ThomasE on December 02, 2024, 11:36:23 AM

Title: High CPU-load
Post by: ThomasE on December 02, 2024, 11:36:23 AM
Hi,

our system is a Intel(R) Xeon(R) E-2378 CPU @ 2.60GHz (8 cores, 8 threads) with 64GB of RAM and four Intel 10Gbit NICs. We're having about 300 VLAN interfaces and a symmetric 2GBit/s line to connect to the internet which will hopefully become 4GBit/s in the near future. Almost all traffic is internet traffic and thus limited by our external connection. We're running a captive portal with a few hundred connected clients and the usual DHCP, unbound DNS, NTP - all of which shouldn't need large amounts of CPU power.

During normal operation this setup works just fine (load ~5), but as soon as we do something out of the ordinary - for example starting updates on a large number of devices simultaneously - our system can't handle it any more. The load goes up to over 100, VPN gives up completely and everything else becomes just very, very slow. As the throughput hardly exceeds 2GBit/s as internal traffic is almost negligible, we're seriously concerned about what happens when we increase or bandwidth as planned.

We've already worked through some performances guides and have implemented the following changes:


However, none of those seems to improve things significantly.

Current RAM usage never exceeds 3GB which is a bit odd IMHO. While I'm aware that 64GB may well be quite a bit more than needed, 3GB on the other hand seems pretty low considerung our rather big environment.

Do we really need better hardware or what other things are worth looking at to improve performance?

Regards
Thomas
Title: Re: High CPU-load
Post by: meyergru on December 02, 2024, 11:48:56 AM
You should probably look at "top" to find the process that is causing this. I doubt that the plain routing would cause that high load. It could be some kind of secondary cause, like Zenarmor or suricata or probably even just logging of default firewall rules.
Title: Re: High CPU-load
Post by: ThomasE on December 02, 2024, 12:48:17 PM
Quote from: meyergru on December 02, 2024, 11:48:56 AM
You should probably look at "top" to find the process that is causing this. I doubt that the plain routing would cause that high load. It could be some kind of secondary cause, like Zenarmor or suricata or probably even just logging of default firewall rules.
There is indeed one process that's quite noticable:

/usr/local/sbin/lighttpd -f /var/etc/lighttpd-cp-zone-0.conf

It has an aggregated CPU time of 15 hours (uptime: 5 days) and uses up between 20% and 40% all the time. Looks like Captive Portal to me. While this is more than I expected, I would assume it remains somewhat constant and doesn't increase as traffic goes up...
Title: Re: High CPU-load
Post by: meyergru on December 02, 2024, 01:10:18 PM
I meant processes going up in usage while you have that kind of situation. I would think that excessive logging and some process that uses logs like Zenarmor, crowdsec wil then take more CPU cycles. Thus, reducing logging might fix it.
Title: Re: High CPU-load
Post by: ThomasE on December 02, 2024, 02:23:23 PM
Quote from: meyergru on December 02, 2024, 01:10:18 PM
I meant processes going up in usage while you have that kind of situation. I would think that excessive logging and some process that uses logs like Zenarmor, crowdsec wil then take more CPU cycles. Thus, reducing logging might fix it.
We don't use Zenarmor, CrowdSec or anything known to take a lot of CPU... As I said, a handful of people connecting via OpenVPN (or trying to do so) plus the usual stuff (DHCP, DNS, NTP) - that's it. Firewall Logging is currently disabled and only used for debug purposes. Got to be missing something, but I don't know where to look... :-(

We need to wait for the next batch of updates to watch the system under heavy load. Besides [h]top - is there anything we should specifically look at?

Title: Re: High CPU-load
Post by: meyergru on December 02, 2024, 03:09:28 PM
You will most likely see the culprit when the situation arises.
Title: Re: High CPU-load
Post by: ThomasE on December 11, 2024, 10:52:31 AM
Ok, here we go. We just ran into the issue for a few minutes. I attached the output of htop during the event (1.jpg). I just realize the sort column is poorly chosen, but maybe this gives a hint, anyway...

Any ideas?

OT: How can I include attached pictures inside the posting?
Title: Re: High CPU-load
Post by: meyergru on December 11, 2024, 11:50:03 AM
There seems to be no processes using up all the CPU. Lighthttpd is fine. I do not know what that configd process does.

I only see that you use netflow and a captive portal, maybe you should disable them to see if that fixes the problem. With Netflow, I have seen database corruptions that repeatedly hogged the CPU. There is a button to reset the netflow data, or you can disable that altogether.
Title: Re: High CPU-load
Post by: Seimus on December 11, 2024, 11:55:58 AM
Quote from: ThomasE on December 11, 2024, 10:52:31 AM
Ok, here we go. We just ran into the issue for a few minutes. I attached the output of htop during the event (1.jpg). I just realize the sort column is poorly chosen, but maybe this gives a hint, anyway...

Any ideas?

OT: How can I include attached pictures inside the posting?

Copy the link once you upload the picture to forum > edit your post > click on Insert image > paste the link into it.

The netflow can be CPU heavy in same cases as @meyergru mentions. Try to disable it as well any other additional services (shaper, captive portal, etc.).

Regards,
S.
Title: Re: High CPU-load
Post by: ThomasE on December 17, 2024, 11:19:28 AM
Quote from: meyergru on December 11, 2024, 11:50:03 AMThere seems to be no processes using up all the CPU. Lighthttpd is fine. I do not know what that configd process does.
The configd is a random occurrence. I assume that a colleague modified the configuration just when I was making that screenshot. I've been watching it for some time now - it's at 0.0% CPU all the time, so I think this can be safely ignored.

The fact that there are not processes using up all that CPU is what puzzles me most. When looking at the CPU usage for each core as it is shown in htop, all eight cores are shown with values between 40% and 100%. Rarely, if ever one of them drops below that value. On the other hand, the only process that's continually using more than 1% CPU (lighttpd -f /var/etc/lighttpd-cp-zone-0.conf) uses between 20% and 40% CPU. The number of processes shown as Running is 2 most of the time (htop and lighttpd) and their number is never exceeding 4 - yet the current load average is around 7. Every few seconds I can see a number of other processes showing up:

iftop --nNb -i vlan0.x.y -s2 -t

There're quite a few of them - obviously, because we have a lot of VLANs - but from what I can tell, they can only account for very short spikes in CPU usage - not what we currently observe. This picture shows some of the iftop processes which I'd consider "typical".

Bildschirmfoto_2024-12-17_11-06-42.png

Most of the time around three of four of those processes can be seen, sometimes there're none, sometimes there are up to 30.

Quote from: meyergru on December 11, 2024, 11:50:03 AMI only see that you use netflow and a captive portal, maybe you should disable them to see if that fixes the problem. With Netflow, I have seen database corruptions that repeatedly hogged the CPU. There is a button to reset the netflow data, or you can disable that altogether.
I've disabled netflow for the time being as we don't really need it though it's certainly nice to have. No change.

Bildschirmfoto_2024-12-17_09-52-00.png

Captive Portal is right on top of our list of "suspects", though I'm still unsure. We have a few hundred concurrent sessions. Yes, that's quite a bit, but then again it's not that much, I think. What might be part of the problem is that our CP has a very high general "availability": There are almost 100 buildings scattered throughout the whole city allowing unauthenticated access to the WiFi that leads to our CP. I would assume that not too many people randomly try to actively connect to an open WiFi just because "it's there", but I'm not sure what their smartphones are doing in the background.
Title: Re: High CPU-load
Post by: MakaHomes on January 04, 2025, 06:41:39 PM
you've already done a lot of performance tuning! Given your setup, here are a few additional suggestions:

Check for Software Updates: Ensure that all your software, including the OS and OpnSense, are up to date. Sometimes performance improvements are included in updates.

Optimize DNS and NTP Settings: Fine-tune your DNS and NTP configurations to ensure they're not causing unnecessary load.

Monitor CPU and Memory Usage: Use tools like htop or top to monitor real-time CPU and memory usage. This can help identify any processes that are consuming more resources than expected.

Consider Load Balancing: If possible, distribute the load across multiple servers to prevent any single server from becoming a bottleneck.

Evaluate Network Configuration: Double-check your network settings to ensure there are no misconfigurations causing unnecessary traffic or delays.

If these steps don't help, it might be worth considering hardware upgrades or consulting with a performance specialist to identify any underlying issues.

Good luck, and I hope this helps!
Title: Re: High CPU-load
Post by: pfry on January 05, 2025, 01:10:21 AM
8 Ice Lake cores, and Lighttpd is eating 50%? Whew. Definitely the captive portal, but that's still a lot of CPU. Pure (and uninformed) speculation, but could network queueing be bottling up one core or particular cores? "netstat -Q" only gives that info when you have RSS enabled - I'm not sure how to determine it otherwise. htop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
Title: Re: High CPU-load
Post by: ThomasE on January 10, 2025, 02:20:35 PM
Quote from: pfry on January 05, 2025, 01:10:21 AM8 Ice Lake cores, and Lighttpd is eating 50%? Whew. Definitely the captive portal, but that's still a lot of CPU.
Sadly, it's definitely not the CP. We've set up an even better machine with the exact same configuration in a clean lab environment, so there're no CP or any other clients doing anything. We placed two test machines into two different VLANs and ran iperf for testing. Within seconds the load average reported by top goes up beyond 20.

QuotePure (and uninformed) speculation, but could network queueing be bottling up one core or particular cores? "netstat -Q" only gives that info when you have RSS enabled - I'm not sure how to determine it otherwise.
From what I can tell the traffic handling is equally spread of all CPUs, netstat -Q seems to confirm that:

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip       135  1000        0    45706      201  7240161  7285548
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     0        0        0        0        0        0
   0   0   arp        0     0     2019        0        0        0     2019
   0   0   ether      0     0 123022625        0        0        0 123022625
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ip_direct     0     0        0        0        0        0        0
   0   0   ip6_direct     0     0        0        0        0        0        0
   1   1   ip         2  1000        0    41053     1569  8672745  8713708
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     0        0        0        0        0        0
   1   1   ether      0     0 128714595        0        0        0 128714595
   1   1   ip6        0     0        0        0        0        0        0
   1   1   ip_direct     0     0        0        0        0        0        0
   1   1   ip6_direct     0     0        0        0        0        0        0
   2   2   ip       286  1000        0   141216     2463  8082614  8223246
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0        0        0        0        0        0
   2   2   ether      0     0 132776645        0        0        0 132776645
   2   2   ip6        0     0        0        0        0        0        0
   2   2   ip_direct     0     0        0        0        0        0        0
   2   2   ip6_direct     0     0        0        0        0        0        0

(It goes on like that for all other cores...)

Quotehtop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
The interrupt stat as shown by top is indeed interesting - it's between 80% and 90%! Then there's less than 1% user, a bit more than 1% system and the remaining ~15% is shown as idle.

We've already played a bit with the tunables including, but not limited to:

dev.ixl.0.iflib.override_nrxqs=32
dev.ixl.0.iflib.override_ntxqs=32
dev.ixl.1.iflib.override_nrxqs=32
dev.ixl.1.iflib.override_ntxqs=32
machdep.hyperthreading_allowed=0
net.inet.ip.fw.dyn_buckets=16777216
net.inet.ip.fw.dyn_max=16777216
net.inet.rss.enabled=1
net.isr.maxthreads=-1
net.pf.source_nodes_hashsize=1048576
I do admit that we don't fully (or sometimes at all) understand what those optimizations do, but we encountered them while reading various guides and tried them. However, setting those seemed to have very, very little effect at best. Updating NIC drivers on our Intel card led to a somewhat reduced total throughput going down from around 8Gbit/s to about 6Gbit/s along with what seems to be a slightly lower load average, but we didn't do any precise measurements there. We also tried using a Broadcom NIC instead of Intel - no change.
Title: Re: High CPU-load
Post by: pfry on January 10, 2025, 05:24:15 PM
Quote from: ThomasE on January 10, 2025, 02:20:35 PMSadly, it's definitely not the CP.
[...]

What's the top process on your new machine?

Crud. I have the hardware to test, but no bench space and no software. It'll be a while before I can test higher than 1Gb, and this issue interests me.

While I'm posting useless text, ixl... are you up to date? x710 NVM updater (generic) (https://www.intel.com/content/www/us/en/download/18190/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-700-series.html) (Minimal changes from 9.52.) (Updated firmware is critical for DPDK, but less so in other applications. I was testing DANOS/Vyatta and VPP, so I got into the habit.)
Title: Re: High CPU-load
Post by: EricPerl on January 13, 2025, 01:56:51 AM
Quote from: ThomasE on January 10, 2025, 02:20:35 PM...
Quotehtop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
The interrupt stat as shown by top is indeed interesting - it's between 80% and 90%! Then there's less than 1% user, a bit more than 1% system and the remaining ~15% is shown as idle.
...

I have no clue how to dig deeper but that looks concerning.
Title: Re: High CPU-load
Post by: EricPerl on January 14, 2025, 02:43:40 AM
If you have not done so yet, 'vmstat -i' and 'systat -vmstat' seem to be the next step wrt finding the device triggering the interrupts.
Title: Re: High CPU-load
Post by: ThomasE on January 14, 2025, 09:50:34 AM
Quote from: EricPerl on January 14, 2025, 02:43:40 AMIf you have not done so yet, 'vmstat -i' and 'systat -vmstat' seem to be the next step wrt finding the device triggering the interrupts.
Ok, so here we go...

vmstat -i
interrupt                          total       rate
cpu0:timer                     346081709        999
cpu1:timer                       5389274         16
cpu2:timer                       5441949         16
cpu3:timer                       5441140         16
cpu4:timer                       5466498         16
cpu5:timer                       5543676         16
cpu6:timer                       5480325         16
cpu7:timer                       5598821         16
cpu8:timer                       5212744         15
cpu9:timer                       5198477         15
cpu10:timer                      5221979         15
cpu11:timer                      5162333         15
cpu12:timer                      5261906         15
cpu13:timer                      5317179         15
cpu14:timer                      5368505         15
cpu15:timer                      7457476         22
cpu16:timer                      5158599         15
cpu17:timer                      5146936         15
cpu18:timer                      5188516         15
cpu19:timer                      5163081         15
cpu20:timer                      5173798         15
cpu21:timer                      5262110         15
cpu22:timer                      5264156         15
cpu23:timer                      5328887         15
cpu24:timer                      5351694         15
cpu25:timer                      5328853         15
cpu26:timer                      5348923         15
cpu27:timer                      5352703         15
cpu28:timer                      5390198         16
cpu29:timer                      5447410         16
cpu30:timer                      5463452         16
cpu31:timer                      7490578         22
irq112: ahci0                         52          0
irq113: xhci0                    5394722         16
irq115: igb0:rxq0                  89822          0
irq116: igb0:rxq1                 278626          1
irq117: igb0:rxq2                   3747          0
irq118: igb0:rxq3                   1343          0
irq119: igb0:rxq4                   9842          0
irq120: igb0:rxq5                    228          0
irq121: igb0:rxq6                    543          0
irq122: igb0:rxq7                    711          0
irq123: igb0:aq                        2          0
irq331: bxe0:sp                   346413          1
irq332: bxe0:fp00               18789253         54
irq333: bxe0:fp01               18676967         54
irq334: bxe0:fp02               17816764         51
irq335: bxe0:fp03               17740225         51
irq336: bxe1:sp                   347275          1
irq337: bxe1:fp00               21428716         62
irq338: bxe1:fp01               21227816         61
irq339: bxe1:fp02               20153853         58
irq340: bxe1:fp03               20304599         59
Total                          678115404       1957

And this is the systat under load.

    4 users    Load 26.02 17.48  8.97                  Jan 14 09:48:32
   Mem usage:   2%Phy  2%Kmem                           VN PAGER   SWAP PAGER
Mem:      REAL           VIRTUAL                        in   out     in   out
       Tot   Share     Tot    Share     Free   count
Act  2180M  98768K    518G     148M     364G   pages
All  2198M    113M    518G     262M                       ioflt  Interrupts
Proc:                                                 207 cow    272k total
  r   p   d    s   w   Csw  Trp  Sys  Int  Sof  Flt   465 zfod   1126 cpu0:timer
             170      651K   1K   2K 232K  53K   1K       ozfod  1127 cpu1:timer
                                                         %ozfod  1127 cpu2:timer
 0.8%Sys  77.6%Intr  0.1%User  0.0%Nice 21.5%Idle         daefr  1127 cpu3:timer
|    |    |    |    |    |    |    |    |    |    |   241 prcfr  1067 cpu4:timer
+++++++++++++++++++++++++++++++++++++++               855 totfr  1048 cpu5:timer
                                           dtbuf          react  1045 cpu6:timer
Namei     Name-cache   Dir-cache   6280561 maxvn          pdwak  1073 cpu7:timer
   Calls    hits   %    hits   %    441406 numvn       50 pdpgs  1083 cpu8:timer
    2847    2843 100                357588 frevn          intrn  1077 cpu9:timer
                                                    6438M wire   1103 cpu10:time
Disks   da0   cd0 pass0 pass1 pass2 pass3            134M act    1025 cpu11:time
KB/t  40.74  0.00  0.00  0.00  0.00  0.00           2251M inact  1104 cpu12:time
tps      21     0     0     0     0     0               0 laund  1086 cpu13:time
MB/s   0.82  0.00  0.00  0.00  0.00  0.00            364G free   1077 cpu14:time
%busy    59     0     0     0     0     0             57K buf    1075 cpu15:time
                                                                 1110 cpu16:time
                                                                 1081 cpu17:time
                                                                 1080 cpu18:time
                                                                 1092 cpu19:time
                                                                 1075 cpu20:time
                                                                 1062 cpu21:time
                                                                 1037 cpu22:time
                                                                 1085 cpu23:time
                                                                 1101 cpu24:time
                                                                 1072 cpu25:time
                                                                 1072 cpu26:time
                                                                 1074 cpu27:time
                                                                 1070 cpu28:time
                                                                 1115 cpu29:time
                                                                 1077 cpu30:time
                                                                 1085 cpu31:time
                                                                      ahci0 112
                                                                   68 xhci0 113
                                                                      igb0:rxq0
                                                                   28 igb0:rxq1
                                                                      igb0:rxq2
                                                                      igb0:rxq3
                                                                      igb0:rxq4
                                                                      igb0:rxq5
                                                                      igb0:rxq6
                                                                      igb0:rxq7
                                                                      igb0:aq
                                                                    1 bxe0:sp
                                                                28378 bxe0:fp00
                                                                22496 bxe0:fp01
                                                                24798 bxe0:fp02
                                                                35004 bxe0:fp03
                                                                    1 bxe1:sp
                                                                29363 bxe1:fp00
                                                                29687 bxe1:fp01
                                                                29310 bxe1:fp02
                                                                38388 bxe1:fp03
Title: Re: High CPU-load
Post by: pfry on January 14, 2025, 04:43:39 PM
Cute! I have to note those...

Are you logging to a USB flash device? Or am I misreading that? If so, might be worth reducing storage chatter and see what happens.
Title: Re: High CPU-load
Post by: EricPerl on January 14, 2025, 11:29:34 PM
I'm quite outside of my area of expertise here but:

vmstat -i is cumulative since the system is up.
Yes it looks like some USB controller got busy but it's not during systat.
In this output, what strikes me is the uneven cpu0:timer compared to the other.

The 2nd output is live (refreshed every X secs).
The 2 BXE devices seem pretty busy. Broadcom NICs?
Some level of busy should be expected under load but that much?

Some of the optimization work might have been counterproductive....

Also, it might be worth looking at the details of the slots used on the MB: PCI gen, lanes, exclusions...
I don't have OPN on bare metal and these low-level tools tend to be pretty distro specific...


Title: Re: High CPU-load
Post by: ThomasE on January 15, 2025, 08:33:46 AM
QuoteCute! I have to note those...

Are you logging to a USB flash device? Or am I misreading that? If so, might be worth reducing storage chatter and see what happens.
There's no USB device attached and we're only logging critical errors as everything above that is guaranteed to severely overload the system. ;-)

QuoteI'm quite outside of my area of expertise here but:
So am I so welcome to the club. ;-)

Quotevmstat -i is cumulative since the system is up.
Yes it looks like some USB controller got busy but it's not during systat.
In this output, what strikes me is the uneven cpu0:timer compared to the other.
I must admit that I didn't even notice that. Maybe the system defaults to cpu:0 and only if that one is busy does round robin or whatever on the other cores?

QuoteThe 2nd output is live (refreshed every X secs).
The 2 BXE devices seem pretty busy. Broadcom NICs?
Some level of busy should be expected under load but that much?
Yes, Broadcom NICs. Intel NICs behave pretty much the same. At that time I was using

iperf3 -c 10.199.0.150 -p 5201 -P 128 -t 120
to push around 8Gbit/s of traffic through those interfaces, so yes, that is quite a bit. However, and that is where we get to the original question, I think this machine should be able to handle this with ease - especially, if it's the only thing going on there...

QuoteSome of the optimization work might have been counterproductive....
Acknowledged, though as far as I can tell, none of it seemed to have any noticeable effect at all.

QuoteAlso, it might be worth looking at the details of the slots used on the MB: PCI gen, lanes, exclusions...
I will have a look though I admit that I don't have a clue what exactly to look for. Maybe - though not very likely - I will know once I see it. :)
Title: Re: High CPU-load
Post by: pfry on January 15, 2025, 03:58:26 PM
Quote from: ThomasE on January 15, 2025, 08:33:46 AMI will have a look though I admit that I don't have a clue what exactly to look for. Maybe - though not very likely - I will know once I see it. :)

Any Intel later than Ivy Bridge (2012) would be all v3+. I wouldn't expect PCI-e sharing or lane limits to be an issue, but it can't hurt to look. What motherboard model, with what cards?
Title: Re: High CPU-load
Post by: EricPerl on January 15, 2025, 09:48:48 PM
In addition to MB and NIC models, it's important to mention which device (not just the NICs) is connected into each slot.

The interrupt level seemed well distributed during the test (at least for the interval captured).
I'd keep an eye on load distribution to try and understand the discrepancy on the cumulative view.

You seem to have a spare. Any chance to install proxmox and a virtualized OPN.
The drivers under Debian might be better.
This said, this could wait until we look at hardware configuration.
Title: Re: High CPU-load
Post by: gstrauss on January 23, 2025, 07:17:24 AM
Something must be wrong for lighttpd to be using that much CPU and your system not to be under heavy request load.  Is there any information in the lighttpd error log?  Is there a high request load visible in the lighttpd access log?  Can you get an strace or truss of the lighttpd process and share it?  That CPU usage is aberrant.  I am a lighttpd developer but I do not have an opnsense test system.  If you have a debugger installed on the system, a few stack traces might also be useful.  `echo bt full | gdb -p $lighttpd_pid`
Title: Re: High CPU-load
Post by: EricPerl on January 23, 2025, 08:46:56 PM
Per reply #12, the CPU load appears to come from processing interrupts. CP and lighttpd were ruled out.
Title: Re: High CPU-load
Post by: gstrauss on January 24, 2025, 10:57:19 AM
Reply #2, #6, and #9 showed htop with the lighttpd process having taking a large amount of CPU time, so I wonder if that is contributing.

On the off chance that there is some interaction with openssl KTLS on your large system, which might also have TLS acceleration hardware used by openssl drivers, please *test* with KTLS disabled in lighttpd, as lighttpd mod_openssl uses KTLS by default on Linux and FreeBSD, when available.  lighttpd.conf: `ssl.openssl.ssl-conf-cmd += ("Options" => "-KTLS")`, or you can *test* disabling KTLS system-wide on FreeBSD `sysctl kern.ipc.tls.enable=0`
Title: Re: High CPU-load
Post by: ThomasE on March 13, 2025, 08:59:12 AM
Hi,

thanks for all the input that you gave. For now, I think we solved mitigated the problem simply by throwing more hardware at it. We now haven an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads) with 384GB of RAM. I'd assume that this setup is severely overpowered. Although we still have a load average around 15 during normal operations, the system doesn't go down under pressure anymore which is our primary goal.

The current solution is only temporary as we're planning to get a DEC4280 Appliance in the near future. Should the problem persist after that, I'll come back... ;-)

Thomas
:)
Title: Re: High CPU-load
Post by: ThomasE on June 03, 2025, 01:44:03 PM
We finally got our DEC4280 Appliance and gave it a try. After installing all available updates we imported our original configuration (with some slight changes to match the new device names). Bootup took around 15 minutes - a bit longer than usual but that's ok. Even with just one network interface connected for administration, the GUI was extremely slow, the whole system was close to being unoperable. A simple change to a firewall rule could take as long as a minute to apply. At this time there was absolutely no traffic being routed, no captive portal clients trying to connect - there was nothing at all!

So we have the best appliance available but the system won't even run our configuration without any network load? I'm aware that our setup is quite big, but is it really that much beyond what OPNsense can possibly handle? After all, it's not the network traffic that causes issues, and we aren't even thinking about things like intrusion detection - the only thing we've got is a lot of interfaces...
Title: Re: High CPU-load
Post by: EricPerl on June 03, 2025, 07:03:51 PM
Define "a lot".
What is resource consumption like? Which processes consume?
Title: Re: High CPU-load
Post by: Cyclone3d on June 03, 2025, 11:49:25 PM
Quote from: EricPerl on June 03, 2025, 07:03:51 PMDefine "a lot".
What is resource consumption like? Which processes consume?

I registered early just to reply to this post.
"a lot" is defined in the OP (original post on page 1) - 300 VPN interfaces.
Title: Re: High CPU-load
Post by: meyergru on June 04, 2025, 12:49:10 AM
So basically what happens is that 300 VLANs - which presumably connect to a similar number of client machines - use VPN connections, capped at 4 GBit/s total. When all of those act in parallel, the problem occurs.

Even then, there is no single user process that the load can be attributed to. Thus, I would guess that the VPN calculations are the culprit. Those may be delegated to the FreeBSD kernel. The interrupt load could also point to that (maybe induced by context switches).

IDK enough about FreeBSD, but give how some other things are implemented there, I suspect that there may be multithreading issues. What I mean by that is that while the CPU would be potent enough to handle 4 Gbit/s VPN traffic over a single connection, it maybe cannot handle 300 separate VPN connections of 12 MBit/s each - probably because they are also inherently confined to one kernel encryption thread.

On a side note, sometimes, space-vs-time optimizations can speed up one task by using internal multithreading, while at the same time be detrimental when multiple tasks are executed at once - think of limited cache size, for example.
Thus, even if the Deciso DEC4280 appliance is sufficient for 1 stream of 4 GBit/s VPN traffic, it probably is not for 300 streams at once, but you should ask Deciso about it.

There are options to expedite VPN traffic by using hardware encryption, under System: Settings: Miscellaneous, like Intel QAT, but that would require the system to support it. What you did to mitigate the problem effectively does the same thing (giving more CPU power to the VPN tasks).

Depending on how the VPN infrastructure is built (site-2-site vs. client-2-site), you probably could use different VPN technologies (e.g. Wireguard) or employ different ciphers, which may lead to better parallelism, if my guess should turn out to be the underlying problem.
Title: Re: High CPU-load
Post by: ThomasE on June 04, 2025, 10:13:03 AM
Quote from: meyergru on June 04, 2025, 12:49:10 AMSo basically what happens is that 300 VLANs - which presumably connect to a similar number of client machines - use VPN connections, capped at 4 GBit/s total. When all of those act in parallel, the problem occurs.
The problem already occurs with no traffic at all. With the exception of one 1Gbit/s interface solely used for administration and accessing the GUI, all other interfaces were physically disconnected. (They were - of course - enabled in the configuration.) There were some VPN servers configured and activated, but they weren't being used. To be precise, we have two legacy OpenVPN servers, one "new" OpenVPN instance for testing purposes and one WireGuard instance also for testing. Apart from that, everything es is simple routing/NAT. There is a total of 2% (18875/1000000) firewall table entries.

QuoteEven then, there is no single user process that the load can be attributed to. Thus, I would guess that the VPN calculations are the culprit. Those may be delegated to the FreeBSD kernel. The interrupt load could also point to that (maybe induced by context switches).
While the first sentence is entirely true, there should be VPN calculations whatsoever as VPN wasn't even used and won't ever be used to a greater extent. Even in production, there are at most 10 OpenVPN connections (road warrior).

QuoteDepending on how the VPN infrastructure is built (site-2-site vs. client-2-site), you probably could use different VPN technologies (e.g. Wireguard) or employ different ciphers, which may lead to better parallelism, if my guess should turn out to be the underlying problem.
I do agree that a significant number of established VPN connections might indeed be an issue, but this is not the case.
Title: Re: High CPU-load
Post by: meyergru on June 04, 2025, 10:40:23 AM
Initially, you said that a large number of updates causes the problem, not that it occurs out of the blue. When you deploy a new instance, this boils down to a high initial load with OpenVPN, because of the certificate handshakes.

Anyway, if the problem occurs even without specific traffic spikes, it seems to be the pure number of VLANs involved. I would argue that it is quite unsual to have that many. Probably the routing table hashing mechanism in FreeBSD is not optimized for that (or can be tuned by enlarging the memory for it). As I said, I am by no means a FreeBSD expert, but I saw that you can even set the routing algorithm, see "sysctl net.route.algo.inet.algo".
Title: Re: High CPU-load
Post by: ThomasE on June 04, 2025, 01:37:15 PM
Quote from: meyergru on June 04, 2025, 10:40:23 AMInitially, you said that a large number of updates causes the problem, not that it occurs out of the blue. When you deploy a new instance, this boils down to a high initial load with OpenVPN, because of the certificate handshakes.
Correct, that's where the whole thing started: We installed OPNsense on our somewhat older server hardware (8 cores, 16 threads, 128GB RAM). For the most part this worked just fine, but we had some issues during traffic spikes. After our attempts to solve the problem via tuning failed, we switched to the best server hardware available to us: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads), 384 GB RAM, you get the idea. This was meant to be a temporary solution as this hardware seemed way too much for that purpose and was intended for running 20+ virtual machines instead of just one firewall - and we'd need two of those machines for redundancy.

In order to rule out any hardware or driver issues, we decided to get the appliance - which performs much worse than our old server. :-(

QuoteAnyway, if the problem occurs even without specific traffic spikes, it seems to be the pure number of VLANs involved. I would argue that it is quite unsual to have that many.
I agree with you that this is indeed somewhat unusual, but that's what we've got... ;-)

QuoteProbably the routing table hashing mechanism in FreeBSD is not optimized for that (or can be tuned by enlarging the memory for it). As I said, I am by no means a FreeBSD expert, but I saw that you can even set the routing algorithm, see "sysctl net.route.algo.inet.algo".
My knowledge about FreeBSD is even more limited, but this looks like a good starting point for some more research... :)

Thanks
Thomas
Title: Re: High CPU-load
Post by: Patrick M. Hausen on June 04, 2025, 01:39:17 PM
As a customer with an official Deciso appliance I would move this discussion from the community forum to an equally official support case.
Title: Re: High CPU-load
Post by: pfry on June 04, 2025, 10:46:04 PM
Quote from: ThomasE on June 04, 2025, 01:37:15 PMIn order to rule out any hardware or driver issues, we decided to get the appliance - which performs much worse than our old server. :-(

It is PC hardware, a an embedded Zen 1 Epyc. (Aside: Network hardware is what? Intel E810s? It would be interesting to look at "dmesg | grep ice" (specifically to look at the installed ddp package) and "pciconf -lcv ice0" (e.g. - I'd look at all ice devices for the PCI-e version and lanes), but those should be irrelevant to your issue. The E810 seems a bit finicky - getting it set up in a happy fashion can take effort.)

The available routing algorithms don't look bad, and your pf table is small (bogonsv6 isn't even loaded). 300 VLANs is quite a few, but I can't imagine the VLAN code would be an issue. I don't know of any interface-related jobs in OPNsense/FreeBSD that might eat CPU just from iterating over a long list. I wonder if Deciso can replicate your issue.
Title: Re: High CPU-load
Post by: ThomasE on June 10, 2025, 03:44:57 PM
Quote from: Patrick M. Hausen on June 04, 2025, 01:39:17 PMAs a customer with an official Deciso appliance I would move this discussion from the community forum to an equally official support case.
We're in constant communication with our hardware vendor (who in turn talks with Deciso) and we finally got some answers:

OPNsense will eventually run into performance issues once the number of interfaces reaches three digits. That's the core of the problem and it fits perfectly into what we observed. It has nothing to do with NICs or any other hardware/driver issues, the amount of traffic, open CP sessions or whatsoever. Using more powerful hardware will - likely only up to a point - mitigate the problem, but there's no way it can actually solve it. It's a software design issue - OPNsense is not optimized for a high number of interfaces. We're now in need of a network and firewall redesign. Nothing we can't handle, but obviously it would've been great if we had known this right from the beginning.

Maybe someone will take this as an opportunity to add this information to the documentation. Currently the only hint that a "high number of users or interface assignments may be less practical" can only be found in a footnote of the product description of the appliances. It doesn't say what "high number" means - could be 10, 100 or 1.000 depending on who you ask - and the word "impractical" doesn't mean that the whole system will collapse. Some hint that with decent hardware up to 100 Interfaces is likely fine and beyond that performance issues are to be expected. :)

Title: Re: High CPU-load
Post by: Kets_One on June 10, 2025, 05:31:40 PM
@thomasE

Great that you finally got to the bottom of this.
Did you return the dec4280 to Deciso or did you find a job it can handle?
Title: Re: High CPU-load
Post by: ThomasE on June 11, 2025, 09:25:11 AM
Quote from: Kets_One on June 10, 2025, 05:31:40 PM@thomasE

Great that you finally got to the bottom of this.
Did you return the dec4280 to Deciso or did you find a job it can handle?
Returning the appliance wouldn't have solved our problem as we'd still have OPNsense running on a single, quite capable server that should be put to better use elsewhere. Under heavy network load we're already experiencing some issues and as bandwidth increases, this will get worse eventually. I'm not even talking about the consequences of building a CARP cluster for redundancy. ;-)

We're currently working on a network redesign migrating a significant part of the interfaces to switches and route them via a transfer network to the OPNsense. That way we can reduce the number of interfaces on the OPNsense from around 300 to about 20. Fortunately, our switches are able to handle this - we've checked that already. ;-) Lots of work ahead, but at least it will simplify setting up CARP...