Hi,
our system is a Intel(R) Xeon(R) E-2378 CPU @ 2.60GHz (8 cores, 8 threads) with 64GB of RAM and four Intel 10Gbit NICs. We're having about 300 VLAN interfaces and a symmetric 2GBit/s line to connect to the internet which will hopefully become 4GBit/s in the near future. Almost all traffic is internet traffic and thus limited by our external connection. We're running a captive portal with a few hundred connected clients and the usual DHCP, unbound DNS, NTP - all of which shouldn't need large amounts of CPU power.
During normal operation this setup works just fine (load ~5), but as soon as we do something out of the ordinary - for example starting updates on a large number of devices simultaneously - our system can't handle it any more. The load goes up to over 100, VPN gives up completely and everything else becomes just very, very slow. As the throughput hardly exceeds 2GBit/s as internal traffic is almost negligible, we're seriously concerned about what happens when we increase or bandwidth as planned.
We've already worked through some performances guides and have implemented the following changes:
- machdep.hyperthreading_allowed = 0
- net.isr.maxthreads = -1
- net.pf.source_nodes_hashsize
- kern.ipc.maxsockbuf = 16777216
However, none of those seems to improve things significantly.
Current RAM usage never exceeds 3GB which is a bit odd IMHO. While I'm aware that 64GB may well be quite a bit more than needed, 3GB on the other hand seems pretty low considerung our rather big environment.
Do we really need better hardware or what other things are worth looking at to improve performance?
Regards
Thomas
You should probably look at "top" to find the process that is causing this. I doubt that the plain routing would cause that high load. It could be some kind of secondary cause, like Zenarmor or suricata or probably even just logging of default firewall rules.
Quote from: meyergru on December 02, 2024, 11:48:56 AM
You should probably look at "top" to find the process that is causing this. I doubt that the plain routing would cause that high load. It could be some kind of secondary cause, like Zenarmor or suricata or probably even just logging of default firewall rules.
There is indeed one process that's quite noticable:
/usr/local/sbin/lighttpd -f /var/etc/lighttpd-cp-zone-0.conf
It has an aggregated CPU time of 15 hours (uptime: 5 days) and uses up between 20% and 40% all the time. Looks like Captive Portal to me. While this is more than I expected, I would assume it remains somewhat constant and doesn't increase as traffic goes up...
I meant processes going up in usage while you have that kind of situation. I would think that excessive logging and some process that uses logs like Zenarmor, crowdsec wil then take more CPU cycles. Thus, reducing logging might fix it.
Quote from: meyergru on December 02, 2024, 01:10:18 PM
I meant processes going up in usage while you have that kind of situation. I would think that excessive logging and some process that uses logs like Zenarmor, crowdsec wil then take more CPU cycles. Thus, reducing logging might fix it.
We don't use Zenarmor, CrowdSec or anything known to take a lot of CPU... As I said, a handful of people connecting via OpenVPN (or trying to do so) plus the usual stuff (DHCP, DNS, NTP) - that's it. Firewall Logging is currently disabled and only used for debug purposes. Got to be missing something, but I don't know where to look... :-(
We need to wait for the next batch of updates to watch the system under heavy load. Besides [h]top - is there anything we should specifically look at?
You will most likely see the culprit when the situation arises.
Ok, here we go. We just ran into the issue for a few minutes. I attached the output of htop during the event (1.jpg). I just realize the sort column is poorly chosen, but maybe this gives a hint, anyway...
Any ideas?
OT: How can I include attached pictures inside the posting?
There seems to be no processes using up all the CPU. Lighthttpd is fine. I do not know what that configd process does.
I only see that you use netflow and a captive portal, maybe you should disable them to see if that fixes the problem. With Netflow, I have seen database corruptions that repeatedly hogged the CPU. There is a button to reset the netflow data, or you can disable that altogether.
Quote from: ThomasE on December 11, 2024, 10:52:31 AM
Ok, here we go. We just ran into the issue for a few minutes. I attached the output of htop during the event (1.jpg). I just realize the sort column is poorly chosen, but maybe this gives a hint, anyway...
Any ideas?
OT: How can I include attached pictures inside the posting?
Copy the link once you upload the picture to forum > edit your post > click on Insert image > paste the link into it.
The netflow can be CPU heavy in same cases as @meyergru mentions. Try to disable it as well any other additional services (shaper, captive portal, etc.).
Regards,
S.
Quote from: meyergru on December 11, 2024, 11:50:03 AMThere seems to be no processes using up all the CPU. Lighthttpd is fine. I do not know what that configd process does.
The configd is a random occurrence. I assume that a colleague modified the configuration just when I was making that screenshot. I've been watching it for some time now - it's at 0.0% CPU all the time, so I think this can be safely ignored.
The fact that there are not processes using up all that CPU is what puzzles me most. When looking at the CPU usage for each core as it is shown in htop, all eight cores are shown with values between 40% and 100%. Rarely, if ever one of them drops below that value. On the other hand, the only process that's continually using more than 1% CPU (lighttpd -f /var/etc/lighttpd-cp-zone-0.conf) uses between 20% and 40% CPU. The number of processes shown as
Running is 2 most of the time (htop and lighttpd) and their number is never exceeding 4 - yet the current load average is around 7. Every few seconds I can see a number of other processes showing up:
iftop --nNb -i vlan0.x.y -s2 -t
There're quite a few of them - obviously, because we have a lot of VLANs - but from what I can tell, they can only account for very short spikes in CPU usage - not what we currently observe. This picture shows some of the iftop processes which I'd consider "typical".
Bildschirmfoto_2024-12-17_11-06-42.png
Most of the time around three of four of those processes can be seen, sometimes there're none, sometimes there are up to 30.
Quote from: meyergru on December 11, 2024, 11:50:03 AMI only see that you use netflow and a captive portal, maybe you should disable them to see if that fixes the problem. With Netflow, I have seen database corruptions that repeatedly hogged the CPU. There is a button to reset the netflow data, or you can disable that altogether.
I've disabled netflow for the time being as we don't really
need it though it's certainly nice to have. No change.
Bildschirmfoto_2024-12-17_09-52-00.png
Captive Portal is right on top of our list of "suspects", though I'm still unsure. We have a few hundred concurrent sessions. Yes, that's quite a bit, but then again it's not
that much, I think. What
might be part of the problem is that our CP has a very high general "availability": There are almost 100 buildings scattered throughout the whole city allowing unauthenticated access to the WiFi that leads to our CP. I would assume that not too many people randomly try to actively connect to an open WiFi just because "it's there", but I'm not sure what their smartphones are doing in the background.
you've already done a lot of performance tuning! Given your setup, here are a few additional suggestions:
Check for Software Updates: Ensure that all your software, including the OS and OpnSense, are up to date. Sometimes performance improvements are included in updates.
Optimize DNS and NTP Settings: Fine-tune your DNS and NTP configurations to ensure they're not causing unnecessary load.
Monitor CPU and Memory Usage: Use tools like htop or top to monitor real-time CPU and memory usage. This can help identify any processes that are consuming more resources than expected.
Consider Load Balancing: If possible, distribute the load across multiple servers to prevent any single server from becoming a bottleneck.
Evaluate Network Configuration: Double-check your network settings to ensure there are no misconfigurations causing unnecessary traffic or delays.
If these steps don't help, it might be worth considering hardware upgrades or consulting with a performance specialist to identify any underlying issues.
Good luck, and I hope this helps!
8 Ice Lake cores, and Lighttpd is eating 50%? Whew. Definitely the captive portal, but that's still a lot of CPU. Pure (and uninformed) speculation, but could network queueing be bottling up one core or particular cores? "netstat -Q" only gives that info when you have RSS enabled - I'm not sure how to determine it otherwise. htop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
Quote from: pfry on January 05, 2025, 01:10:21 AM8 Ice Lake cores, and Lighttpd is eating 50%? Whew. Definitely the captive portal, but that's still a lot of CPU.
Sadly, it's definitely
not the CP. We've set up an even better machine with the exact same configuration in a clean lab environment, so there're no CP or any other clients doing anything. We placed two test machines into two different VLANs and ran iperf for testing. Within seconds the load average reported by top goes up beyond 20.
QuotePure (and uninformed) speculation, but could network queueing be bottling up one core or particular cores? "netstat -Q" only gives that info when you have RSS enabled - I'm not sure how to determine it otherwise.
From what I can tell the traffic handling is equally spread of all CPUs, netstat -Q seems to confirm that:
Workstreams:
WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled
0 0 ip 135 1000 0 45706 201 7240161 7285548
0 0 igmp 0 0 0 0 0 0 0
0 0 rtsock 0 0 0 0 0 0 0
0 0 arp 0 0 2019 0 0 0 2019
0 0 ether 0 0 123022625 0 0 0 123022625
0 0 ip6 0 0 0 0 0 0 0
0 0 ip_direct 0 0 0 0 0 0 0
0 0 ip6_direct 0 0 0 0 0 0 0
1 1 ip 2 1000 0 41053 1569 8672745 8713708
1 1 igmp 0 0 0 0 0 0 0
1 1 rtsock 0 0 0 0 0 0 0
1 1 arp 0 0 0 0 0 0 0
1 1 ether 0 0 128714595 0 0 0 128714595
1 1 ip6 0 0 0 0 0 0 0
1 1 ip_direct 0 0 0 0 0 0 0
1 1 ip6_direct 0 0 0 0 0 0 0
2 2 ip 286 1000 0 141216 2463 8082614 8223246
2 2 igmp 0 0 0 0 0 0 0
2 2 rtsock 0 0 0 0 0 0 0
2 2 arp 0 0 0 0 0 0 0
2 2 ether 0 0 132776645 0 0 0 132776645
2 2 ip6 0 0 0 0 0 0 0
2 2 ip_direct 0 0 0 0 0 0 0
2 2 ip6_direct 0 0 0 0 0 0 0
(It goes on like that for all other cores...)
Quotehtop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
The interrupt stat as shown by top is indeed interesting - it's between 80% and 90%! Then there's less than 1% user, a bit more than 1% system and the remaining ~15% is shown as idle.
We've already played a bit with the tunables including, but not limited to:
dev.ixl.0.iflib.override_nrxqs=32
dev.ixl.0.iflib.override_ntxqs=32
dev.ixl.1.iflib.override_nrxqs=32
dev.ixl.1.iflib.override_ntxqs=32
machdep.hyperthreading_allowed=0
net.inet.ip.fw.dyn_buckets=16777216
net.inet.ip.fw.dyn_max=16777216
net.inet.rss.enabled=1
net.isr.maxthreads=-1
net.pf.source_nodes_hashsize=1048576
I do admit that we don't fully (or sometimes at all) understand what those optimizations do, but we encountered them while reading various guides and tried them. However, setting those seemed to have very, very little effect at best. Updating NIC drivers on our Intel card led to a somewhat
reduced total throughput going down from around 8Gbit/s to about 6Gbit/s along with what seems to be a slightly lower load average, but we didn't do any precise measurements there. We also tried using a Broadcom NIC instead of Intel - no change.
Quote from: ThomasE on January 10, 2025, 02:20:35 PMSadly, it's definitely not the CP.
[...]
What's the top process on your new machine?
Crud. I have the hardware to test, but no bench space and no software. It'll be a while before I can test higher than 1Gb, and this issue interests me.
While I'm posting useless text, ixl... are you up to date? x710 NVM updater (generic) (https://www.intel.com/content/www/us/en/download/18190/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-700-series.html) (Minimal changes from 9.52.) (Updated firmware is critical for DPDK, but less so in other applications. I was testing DANOS/Vyatta and VPP, so I got into the habit.)
Quote from: ThomasE on January 10, 2025, 02:20:35 PM...
Quotehtop doesn't seem to have top's "interrupt" stat (unless I'm misreading it), which may or may not be helpful. I just don't have a heavily loaded device to look at.
The interrupt stat as shown by top is indeed interesting - it's between 80% and 90%! Then there's less than 1% user, a bit more than 1% system and the remaining ~15% is shown as idle.
...
I have no clue how to dig deeper but that looks concerning.
If you have not done so yet, 'vmstat -i' and 'systat -vmstat' seem to be the next step wrt finding the device triggering the interrupts.
Quote from: EricPerl on January 14, 2025, 02:43:40 AMIf you have not done so yet, 'vmstat -i' and 'systat -vmstat' seem to be the next step wrt finding the device triggering the interrupts.
Ok, so here we go...
vmstat -i
interrupt total rate
cpu0:timer 346081709 999
cpu1:timer 5389274 16
cpu2:timer 5441949 16
cpu3:timer 5441140 16
cpu4:timer 5466498 16
cpu5:timer 5543676 16
cpu6:timer 5480325 16
cpu7:timer 5598821 16
cpu8:timer 5212744 15
cpu9:timer 5198477 15
cpu10:timer 5221979 15
cpu11:timer 5162333 15
cpu12:timer 5261906 15
cpu13:timer 5317179 15
cpu14:timer 5368505 15
cpu15:timer 7457476 22
cpu16:timer 5158599 15
cpu17:timer 5146936 15
cpu18:timer 5188516 15
cpu19:timer 5163081 15
cpu20:timer 5173798 15
cpu21:timer 5262110 15
cpu22:timer 5264156 15
cpu23:timer 5328887 15
cpu24:timer 5351694 15
cpu25:timer 5328853 15
cpu26:timer 5348923 15
cpu27:timer 5352703 15
cpu28:timer 5390198 16
cpu29:timer 5447410 16
cpu30:timer 5463452 16
cpu31:timer 7490578 22
irq112: ahci0 52 0
irq113: xhci0 5394722 16
irq115: igb0:rxq0 89822 0
irq116: igb0:rxq1 278626 1
irq117: igb0:rxq2 3747 0
irq118: igb0:rxq3 1343 0
irq119: igb0:rxq4 9842 0
irq120: igb0:rxq5 228 0
irq121: igb0:rxq6 543 0
irq122: igb0:rxq7 711 0
irq123: igb0:aq 2 0
irq331: bxe0:sp 346413 1
irq332: bxe0:fp00 18789253 54
irq333: bxe0:fp01 18676967 54
irq334: bxe0:fp02 17816764 51
irq335: bxe0:fp03 17740225 51
irq336: bxe1:sp 347275 1
irq337: bxe1:fp00 21428716 62
irq338: bxe1:fp01 21227816 61
irq339: bxe1:fp02 20153853 58
irq340: bxe1:fp03 20304599 59
Total 678115404 1957
And this is the systat under load.
4 users Load 26.02 17.48 8.97 Jan 14 09:48:32
Mem usage: 2%Phy 2%Kmem VN PAGER SWAP PAGER
Mem: REAL VIRTUAL in out in out
Tot Share Tot Share Free count
Act 2180M 98768K 518G 148M 364G pages
All 2198M 113M 518G 262M ioflt Interrupts
Proc: 207 cow 272k total
r p d s w Csw Trp Sys Int Sof Flt 465 zfod 1126 cpu0:timer
170 651K 1K 2K 232K 53K 1K ozfod 1127 cpu1:timer
%ozfod 1127 cpu2:timer
0.8%Sys 77.6%Intr 0.1%User 0.0%Nice 21.5%Idle daefr 1127 cpu3:timer
| | | | | | | | | | | 241 prcfr 1067 cpu4:timer
+++++++++++++++++++++++++++++++++++++++ 855 totfr 1048 cpu5:timer
dtbuf react 1045 cpu6:timer
Namei Name-cache Dir-cache 6280561 maxvn pdwak 1073 cpu7:timer
Calls hits % hits % 441406 numvn 50 pdpgs 1083 cpu8:timer
2847 2843 100 357588 frevn intrn 1077 cpu9:timer
6438M wire 1103 cpu10:time
Disks da0 cd0 pass0 pass1 pass2 pass3 134M act 1025 cpu11:time
KB/t 40.74 0.00 0.00 0.00 0.00 0.00 2251M inact 1104 cpu12:time
tps 21 0 0 0 0 0 0 laund 1086 cpu13:time
MB/s 0.82 0.00 0.00 0.00 0.00 0.00 364G free 1077 cpu14:time
%busy 59 0 0 0 0 0 57K buf 1075 cpu15:time
1110 cpu16:time
1081 cpu17:time
1080 cpu18:time
1092 cpu19:time
1075 cpu20:time
1062 cpu21:time
1037 cpu22:time
1085 cpu23:time
1101 cpu24:time
1072 cpu25:time
1072 cpu26:time
1074 cpu27:time
1070 cpu28:time
1115 cpu29:time
1077 cpu30:time
1085 cpu31:time
ahci0 112
68 xhci0 113
igb0:rxq0
28 igb0:rxq1
igb0:rxq2
igb0:rxq3
igb0:rxq4
igb0:rxq5
igb0:rxq6
igb0:rxq7
igb0:aq
1 bxe0:sp
28378 bxe0:fp00
22496 bxe0:fp01
24798 bxe0:fp02
35004 bxe0:fp03
1 bxe1:sp
29363 bxe1:fp00
29687 bxe1:fp01
29310 bxe1:fp02
38388 bxe1:fp03
Cute! I have to note those...
Are you logging to a USB flash device? Or am I misreading that? If so, might be worth reducing storage chatter and see what happens.
I'm quite outside of my area of expertise here but:
vmstat -i is cumulative since the system is up.
Yes it looks like some USB controller got busy but it's not during systat.
In this output, what strikes me is the uneven cpu0:timer compared to the other.
The 2nd output is live (refreshed every X secs).
The 2 BXE devices seem pretty busy. Broadcom NICs?
Some level of busy should be expected under load but that much?
Some of the optimization work might have been counterproductive....
Also, it might be worth looking at the details of the slots used on the MB: PCI gen, lanes, exclusions...
I don't have OPN on bare metal and these low-level tools tend to be pretty distro specific...
QuoteCute! I have to note those...
Are you logging to a USB flash device? Or am I misreading that? If so, might be worth reducing storage chatter and see what happens.
There's no USB device attached and we're only logging critical errors as everything above that is guaranteed to severely overload the system. ;-)
QuoteI'm quite outside of my area of expertise here but:
So am I so welcome to the club. ;-)
Quotevmstat -i is cumulative since the system is up.
Yes it looks like some USB controller got busy but it's not during systat.
In this output, what strikes me is the uneven cpu0:timer compared to the other.
I must admit that I didn't even notice that. Maybe the system defaults to cpu:0 and only if that one is busy does round robin or whatever on the other cores?
QuoteThe 2nd output is live (refreshed every X secs).
The 2 BXE devices seem pretty busy. Broadcom NICs?
Some level of busy should be expected under load but that much?
Yes, Broadcom NICs. Intel NICs behave pretty much the same. At that time I was using
iperf3 -c 10.199.0.150 -p 5201 -P 128 -t 120
to push around 8Gbit/s of traffic through those interfaces, so yes, that
is quite a bit. However, and that is where we get to the original question, I think this machine should be able to handle this with ease - especially, if it's the
only thing going on there...
QuoteSome of the optimization work might have been counterproductive....
Acknowledged, though as far as I can tell, none of it seemed to have
any noticeable effect at all.
QuoteAlso, it might be worth looking at the details of the slots used on the MB: PCI gen, lanes, exclusions...
I will have a look though I admit that I don't have a clue what exactly to look for. Maybe - though not very likely - I will know once I see it. :)
Quote from: ThomasE on January 15, 2025, 08:33:46 AMI will have a look though I admit that I don't have a clue what exactly to look for. Maybe - though not very likely - I will know once I see it. :)
Any Intel later than Ivy Bridge (2012) would be all v3+. I wouldn't expect PCI-e sharing or lane limits to be an issue, but it can't hurt to look. What motherboard model, with what cards?
In addition to MB and NIC models, it's important to mention which device (not just the NICs) is connected into each slot.
The interrupt level seemed well distributed during the test (at least for the interval captured).
I'd keep an eye on load distribution to try and understand the discrepancy on the cumulative view.
You seem to have a spare. Any chance to install proxmox and a virtualized OPN.
The drivers under Debian might be better.
This said, this could wait until we look at hardware configuration.
Something must be wrong for lighttpd to be using that much CPU and your system not to be under heavy request load. Is there any information in the lighttpd error log? Is there a high request load visible in the lighttpd access log? Can you get an strace or truss of the lighttpd process and share it? That CPU usage is aberrant. I am a lighttpd developer but I do not have an opnsense test system. If you have a debugger installed on the system, a few stack traces might also be useful. `echo bt full | gdb -p $lighttpd_pid`
Per reply #12, the CPU load appears to come from processing interrupts. CP and lighttpd were ruled out.
Reply #2, #6, and #9 showed htop with the lighttpd process having taking a large amount of CPU time, so I wonder if that is contributing.
On the off chance that there is some interaction with openssl KTLS on your large system, which might also have TLS acceleration hardware used by openssl drivers, please *test* with KTLS disabled in lighttpd, as lighttpd mod_openssl uses KTLS by default on Linux and FreeBSD, when available. lighttpd.conf: `ssl.openssl.ssl-conf-cmd += ("Options" => "-KTLS")`, or you can *test* disabling KTLS system-wide on FreeBSD `sysctl kern.ipc.tls.enable=0`
Hi,
thanks for all the input that you gave. For now, I think we solved mitigated the problem simply by throwing more hardware at it. We now haven an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads) with 384GB of RAM. I'd assume that this setup is severely overpowered. Although we still have a load average around 15 during normal operations, the system doesn't go down under pressure anymore which is our primary goal.
The current solution is only temporary as we're planning to get a DEC4280 Appliance in the near future. Should the problem persist after that, I'll come back... ;-)
Thomas
:)