Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - ThomasE

#1
Hardware and Performance / Re: High CPU-load
June 04, 2025, 01:37:15 PM
Quote from: meyergru on June 04, 2025, 10:40:23 AMInitially, you said that a large number of updates causes the problem, not that it occurs out of the blue. When you deploy a new instance, this boils down to a high initial load with OpenVPN, because of the certificate handshakes.
Correct, that's where the whole thing started: We installed OPNsense on our somewhat older server hardware (8 cores, 16 threads, 128GB RAM). For the most part this worked just fine, but we had some issues during traffic spikes. After our attempts to solve the problem via tuning failed, we switched to the best server hardware available to us: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads), 384 GB RAM, you get the idea. This was meant to be a temporary solution as this hardware seemed way too much for that purpose and was intended for running 20+ virtual machines instead of just one firewall - and we'd need two of those machines for redundancy.

In order to rule out any hardware or driver issues, we decided to get the appliance - which performs much worse than our old server. :-(

QuoteAnyway, if the problem occurs even without specific traffic spikes, it seems to be the pure number of VLANs involved. I would argue that it is quite unsual to have that many.
I agree with you that this is indeed somewhat unusual, but that's what we've got... ;-)

QuoteProbably the routing table hashing mechanism in FreeBSD is not optimized for that (or can be tuned by enlarging the memory for it). As I said, I am by no means a FreeBSD expert, but I saw that you can even set the routing algorithm, see "sysctl net.route.algo.inet.algo".
My knowledge about FreeBSD is even more limited, but this looks like a good starting point for some more research... :)

Thanks
Thomas
#2
Hardware and Performance / Re: High CPU-load
June 04, 2025, 10:13:03 AM
Quote from: meyergru on June 04, 2025, 12:49:10 AMSo basically what happens is that 300 VLANs - which presumably connect to a similar number of client machines - use VPN connections, capped at 4 GBit/s total. When all of those act in parallel, the problem occurs.
The problem already occurs with no traffic at all. With the exception of one 1Gbit/s interface solely used for administration and accessing the GUI, all other interfaces were physically disconnected. (They were - of course - enabled in the configuration.) There were some VPN servers configured and activated, but they weren't being used. To be precise, we have two legacy OpenVPN servers, one "new" OpenVPN instance for testing purposes and one WireGuard instance also for testing. Apart from that, everything es is simple routing/NAT. There is a total of 2% (18875/1000000) firewall table entries.

QuoteEven then, there is no single user process that the load can be attributed to. Thus, I would guess that the VPN calculations are the culprit. Those may be delegated to the FreeBSD kernel. The interrupt load could also point to that (maybe induced by context switches).
While the first sentence is entirely true, there should be VPN calculations whatsoever as VPN wasn't even used and won't ever be used to a greater extent. Even in production, there are at most 10 OpenVPN connections (road warrior).

QuoteDepending on how the VPN infrastructure is built (site-2-site vs. client-2-site), you probably could use different VPN technologies (e.g. Wireguard) or employ different ciphers, which may lead to better parallelism, if my guess should turn out to be the underlying problem.
I do agree that a significant number of established VPN connections might indeed be an issue, but this is not the case.
#3
Hardware and Performance / Re: High CPU-load
June 03, 2025, 01:44:03 PM
We finally got our DEC4280 Appliance and gave it a try. After installing all available updates we imported our original configuration (with some slight changes to match the new device names). Bootup took around 15 minutes - a bit longer than usual but that's ok. Even with just one network interface connected for administration, the GUI was extremely slow, the whole system was close to being unoperable. A simple change to a firewall rule could take as long as a minute to apply. At this time there was absolutely no traffic being routed, no captive portal clients trying to connect - there was nothing at all!

So we have the best appliance available but the system won't even run our configuration without any network load? I'm aware that our setup is quite big, but is it really that much beyond what OPNsense can possibly handle? After all, it's not the network traffic that causes issues, and we aren't even thinking about things like intrusion detection - the only thing we've got is a lot of interfaces...
#4
As the integrated netflow feature only supports up to 8 interfaces simultaneously, we decided to set up an external server to collect netflow data for further processing. Since our hardware is quite capable (we thought), we activated netflow on all ~200 interfaces (mostly VLAN) at once which basically crashed the whole system. Of course the primary lesson of this is, to never ever and under no circumstances do anything on more than 10 interfaces at once, unless you're begging for trouble and feel a really strong urge to bring in cake on the following day which is how we usually deal with colleagues accidentally breaking something. ;-)

But seriously, how much load is there to be expected per interface sending netflow data to an external server? Does it depend on the amount of traffic on that interface or is that irrelevant? Is activating netflow on literally hundreds of interfaces something that a well-equipped system should be able to handle or is it way beyond of what any powerful system can do?

At the moment this is not about identifying problems and finding tuning options to solve them - it's about making sure that what we want to do is something that actually can be done.

Thanks
Thomas
#5
Quote from: doktornotor on April 10, 2025, 09:31:01 AMI'd say the proper place is /usr/local/opnsense/service/templates/OPNsense/WebGui/php.ini - followed by

configctl webgui restart

for changes to have effect.

Thanks! That was it! I already identified that file and modified it, but I did a
service configd restart

as suggested by - among others - ChatGPT rather than a

configctl webgui restart

That did it. :)

Thanks.
#6
Hi,

I'm currently getting the following error message whenever a hit "Apply Changes" in interface configuration:

PHP Fatal error:  Maximum execution time of 300 seconds exceeded in /usr/local/etc/inc/plugins.inc.d/ipsec.inc on line 144
Occasionally it's in a different file, but I don't think that matters. After some examination I have a good idea of what's happening and I know what I've done, so I'm aware what's caused it. If I'm not totally wrong, I know a way that should fix this once and for good - I just don't know how to do it.

I was tasked with reconfiguring some 80 interfaces - a rather simple change of the respective interface IPs. Applying changes after one interface takes between 30 and 60 seconds. Don't know why it takes so long, but it doesn't really matter. We have a rather big setup which is working just fine, so this is not a problem as that kind of change is not something that we do frequently. Rather than doing something every single minute or so I figured it might be a good idea to do a bulk change meaning I reconfigure a greater number of interfaces, then hit "Apply changes" and have much longer time that I can use to do something else. This was a bad idea, because after some time I get the error message mentioned above. About 20 interfaces have been reconfigured as expected - the other ones haven't. Hitting "Apply changes" will start the whole process from the very beginning, which I suppose is "by design".

I can think of three different approaches to this problem - in order of preference:

  • Applying changes to each interface individually on CLI.
  • Temporarily increasing PHP max_exection_time.
  • Rebooting the system.

So far I haven't found a way to do (1) and I'm open for suggestions. As for (2), I have tried modifying max_exection_time in /usr/local/etc/php.ini as well as in /usr/local/lib/php.ini, which are the only places where max_execution_time=300 is being set. (I changed the value to 3000 as this is only meant to be a temporary fix.) However, that doesn't seem to change anything. Rebooting the system (3) would cause a downtime that I'd like to avoid. This will be my last option if everything else fails.

Thanks
Thomas
#7
Sadly, I just found out why there are no more error messages: The whole CP is down... :-(

According to the log files, the service is starting normally, but ports 8000 and 9000 are closed. We checked the limits (ulimit -a) for root and www users - no problem there. Any ideas?

But there're even more strange things...

ls -lh /var/etc/lighttpd-cp-zone-0.conf
-rw-r--r--  1 root wheel  1.2M Mar 24 08:15 /var/etc/lighttpd-cp-zone-0.conf
Notice the file size. At the beginning of the file, there's one extremely long line consisting of characters that can't be seen. After that, there's this:

              # ssl enabled, redirect to https
                       
#############################################################################################
###  Captive portal zone 0 lighttpd.conf  BEGIN
###  -- listen on port 8000 for primary (ssl) connections
###  -- forward on port 9000 for plain http redirection
#############################################################################################
#
#### modules to load
server.modules           = ( "mod_expire",
                             "mod_auth",
                             "mod_redirect",
                             "mod_access",
                             "mod_deflate",
                             "mod_status",
                             "mod_rewrite",
                             "mod_proxy",
                             "mod_setenv",
Nothing suspicious after that. Now I don't think this has anything to do with our orginal problem, but it still looks strange to me...
#8
Well, I think it's safe to assume those limits are there for a reason ranging from "Nothing. You may just waste some resources that nobody really cares about on any modern system." to "Setting this too high may eventually crash the whole thing." - and since I'm on a production system, I prefer being a bit too careful rather than sorry. ;-)

Anyway, it looks like "Mission accomplished!" to me. I kept doubling those numbers until the messages disappeared. I ended up with

## number of file descriptors (leave off for lighty loaded sites)
server.max-fds         = 131072

## maximum concurrent connections the server will accept (1/2 of server.max-fds)
server.max-connections = 65536

which I suppose gives you an idea of just how busy our CP is. ;-)

Besides, I'm not sure if it has something to do with it or because it's Friday afternoon, but the load average on the whole system seems to be down significantly - from ~15 to ~3.

Thanks and have a nice weekend!
Thomas

#9
Hi Franco,

looks like that helped... a little bit... maybe... ;-)

There are still a lot of messages, but the rate seems to be somewhat slower - down by a roughly a third I would guess. I do assume that this is because of the applied patch - not because of less traffic on the captive portal. Is it safe to simply increase those numbers until everything works? By that I mean applying a factor of 2, 4, 8 or perhaps 16 at most - not hundreds or thousands. ;-)
#10
Great, I'd be more than happy to give it a try and tell you what happened. ;-)

I found the options in

/var/etc/lighttpd-cp-zone-0.conf

where they're unset. Of course I could simply edit that file, but I'm afraid that change will be overridden as soon as I make any changes to the CP configuration if not even sooner, so what would be the best place to make this setting persistent?

Thanks!
Thomas
#11
The process is lighttpd, /usr/obj/usr/ports/www/lighttpd/work/lighttpd-1.4.76/src/mod_openssl.c.{3510|3470}. (I found those two numbers in a single screenshot. There may be others. ;-)
#12
24.7, 24.10 Legacy Series / SSL errors on console...
March 13, 2025, 09:40:14 AM
Hi there,

we're running an OPNsense 24.10 and we believe that the following problem comes from our captive portal setup:

We're literally getting hundreds of SSL error messages every minute logged to the console, which basically makes it completely unusable. We can - and in most cases do - login via SSH and do everything from there, so it's not that big of an issue, but seriously, it is my understanding that only really critical messages should be logged to the console. Unless I'm completely mistaken and those messages really do indicate that something is going completely wrong and we must do something about it, I'm looking for a way to simply get rid of those messages by sending them to some log file only.

Here're a few examples of what's being logged:


error: 0A000417: SSL routines::sslv3 alert illegal parameter
error: 0A000102: SSL routines::unsupported protocol
error: 0A0000EB: SSL routines::no application protocol
error: 0A000076: SSL routines::no suitable signature algorithm
error: 0A00010B: SSL routines::wrong version number

Those seem to be by far the most frequent error messages, but there are other ones, too. They come at a rate of ~200 per minute, so it's really quite a bit. The system itself seems so be working just fine - except maybe for a somewhat high load average we think is unrelated to this problem.

Any suggestions what we could do about it?

Thanks
Thomas
:)
#13
Hardware and Performance / Re: High CPU-load
March 13, 2025, 08:59:12 AM
Hi,

thanks for all the input that you gave. For now, I think we solved mitigated the problem simply by throwing more hardware at it. We now haven an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads) with 384GB of RAM. I'd assume that this setup is severely overpowered. Although we still have a load average around 15 during normal operations, the system doesn't go down under pressure anymore which is our primary goal.

The current solution is only temporary as we're planning to get a DEC4280 Appliance in the near future. Should the problem persist after that, I'll come back... ;-)

Thomas
:)
#14
Hardware and Performance / Re: High CPU-load
January 15, 2025, 08:33:46 AM
QuoteCute! I have to note those...

Are you logging to a USB flash device? Or am I misreading that? If so, might be worth reducing storage chatter and see what happens.
There's no USB device attached and we're only logging critical errors as everything above that is guaranteed to severely overload the system. ;-)

QuoteI'm quite outside of my area of expertise here but:
So am I so welcome to the club. ;-)

Quotevmstat -i is cumulative since the system is up.
Yes it looks like some USB controller got busy but it's not during systat.
In this output, what strikes me is the uneven cpu0:timer compared to the other.
I must admit that I didn't even notice that. Maybe the system defaults to cpu:0 and only if that one is busy does round robin or whatever on the other cores?

QuoteThe 2nd output is live (refreshed every X secs).
The 2 BXE devices seem pretty busy. Broadcom NICs?
Some level of busy should be expected under load but that much?
Yes, Broadcom NICs. Intel NICs behave pretty much the same. At that time I was using

iperf3 -c 10.199.0.150 -p 5201 -P 128 -t 120
to push around 8Gbit/s of traffic through those interfaces, so yes, that is quite a bit. However, and that is where we get to the original question, I think this machine should be able to handle this with ease - especially, if it's the only thing going on there...

QuoteSome of the optimization work might have been counterproductive....
Acknowledged, though as far as I can tell, none of it seemed to have any noticeable effect at all.

QuoteAlso, it might be worth looking at the details of the slots used on the MB: PCI gen, lanes, exclusions...
I will have a look though I admit that I don't have a clue what exactly to look for. Maybe - though not very likely - I will know once I see it. :)
#15
Hardware and Performance / Re: High CPU-load
January 14, 2025, 09:50:34 AM
Quote from: EricPerl on January 14, 2025, 02:43:40 AMIf you have not done so yet, 'vmstat -i' and 'systat -vmstat' seem to be the next step wrt finding the device triggering the interrupts.
Ok, so here we go...

vmstat -i
interrupt                          total       rate
cpu0:timer                     346081709        999
cpu1:timer                       5389274         16
cpu2:timer                       5441949         16
cpu3:timer                       5441140         16
cpu4:timer                       5466498         16
cpu5:timer                       5543676         16
cpu6:timer                       5480325         16
cpu7:timer                       5598821         16
cpu8:timer                       5212744         15
cpu9:timer                       5198477         15
cpu10:timer                      5221979         15
cpu11:timer                      5162333         15
cpu12:timer                      5261906         15
cpu13:timer                      5317179         15
cpu14:timer                      5368505         15
cpu15:timer                      7457476         22
cpu16:timer                      5158599         15
cpu17:timer                      5146936         15
cpu18:timer                      5188516         15
cpu19:timer                      5163081         15
cpu20:timer                      5173798         15
cpu21:timer                      5262110         15
cpu22:timer                      5264156         15
cpu23:timer                      5328887         15
cpu24:timer                      5351694         15
cpu25:timer                      5328853         15
cpu26:timer                      5348923         15
cpu27:timer                      5352703         15
cpu28:timer                      5390198         16
cpu29:timer                      5447410         16
cpu30:timer                      5463452         16
cpu31:timer                      7490578         22
irq112: ahci0                         52          0
irq113: xhci0                    5394722         16
irq115: igb0:rxq0                  89822          0
irq116: igb0:rxq1                 278626          1
irq117: igb0:rxq2                   3747          0
irq118: igb0:rxq3                   1343          0
irq119: igb0:rxq4                   9842          0
irq120: igb0:rxq5                    228          0
irq121: igb0:rxq6                    543          0
irq122: igb0:rxq7                    711          0
irq123: igb0:aq                        2          0
irq331: bxe0:sp                   346413          1
irq332: bxe0:fp00               18789253         54
irq333: bxe0:fp01               18676967         54
irq334: bxe0:fp02               17816764         51
irq335: bxe0:fp03               17740225         51
irq336: bxe1:sp                   347275          1
irq337: bxe1:fp00               21428716         62
irq338: bxe1:fp01               21227816         61
irq339: bxe1:fp02               20153853         58
irq340: bxe1:fp03               20304599         59
Total                          678115404       1957

And this is the systat under load.

    4 users    Load 26.02 17.48  8.97                  Jan 14 09:48:32
   Mem usage:   2%Phy  2%Kmem                           VN PAGER   SWAP PAGER
Mem:      REAL           VIRTUAL                        in   out     in   out
       Tot   Share     Tot    Share     Free   count
Act  2180M  98768K    518G     148M     364G   pages
All  2198M    113M    518G     262M                       ioflt  Interrupts
Proc:                                                 207 cow    272k total
  r   p   d    s   w   Csw  Trp  Sys  Int  Sof  Flt   465 zfod   1126 cpu0:timer
             170      651K   1K   2K 232K  53K   1K       ozfod  1127 cpu1:timer
                                                         %ozfod  1127 cpu2:timer
 0.8%Sys  77.6%Intr  0.1%User  0.0%Nice 21.5%Idle         daefr  1127 cpu3:timer
|    |    |    |    |    |    |    |    |    |    |   241 prcfr  1067 cpu4:timer
+++++++++++++++++++++++++++++++++++++++               855 totfr  1048 cpu5:timer
                                           dtbuf          react  1045 cpu6:timer
Namei     Name-cache   Dir-cache   6280561 maxvn          pdwak  1073 cpu7:timer
   Calls    hits   %    hits   %    441406 numvn       50 pdpgs  1083 cpu8:timer
    2847    2843 100                357588 frevn          intrn  1077 cpu9:timer
                                                    6438M wire   1103 cpu10:time
Disks   da0   cd0 pass0 pass1 pass2 pass3            134M act    1025 cpu11:time
KB/t  40.74  0.00  0.00  0.00  0.00  0.00           2251M inact  1104 cpu12:time
tps      21     0     0     0     0     0               0 laund  1086 cpu13:time
MB/s   0.82  0.00  0.00  0.00  0.00  0.00            364G free   1077 cpu14:time
%busy    59     0     0     0     0     0             57K buf    1075 cpu15:time
                                                                 1110 cpu16:time
                                                                 1081 cpu17:time
                                                                 1080 cpu18:time
                                                                 1092 cpu19:time
                                                                 1075 cpu20:time
                                                                 1062 cpu21:time
                                                                 1037 cpu22:time
                                                                 1085 cpu23:time
                                                                 1101 cpu24:time
                                                                 1072 cpu25:time
                                                                 1072 cpu26:time
                                                                 1074 cpu27:time
                                                                 1070 cpu28:time
                                                                 1115 cpu29:time
                                                                 1077 cpu30:time
                                                                 1085 cpu31:time
                                                                      ahci0 112
                                                                   68 xhci0 113
                                                                      igb0:rxq0
                                                                   28 igb0:rxq1
                                                                      igb0:rxq2
                                                                      igb0:rxq3
                                                                      igb0:rxq4
                                                                      igb0:rxq5
                                                                      igb0:rxq6
                                                                      igb0:rxq7
                                                                      igb0:aq
                                                                    1 bxe0:sp
                                                                28378 bxe0:fp00
                                                                22496 bxe0:fp01
                                                                24798 bxe0:fp02
                                                                35004 bxe0:fp03
                                                                    1 bxe1:sp
                                                                29363 bxe1:fp00
                                                                29687 bxe1:fp01
                                                                29310 bxe1:fp02
                                                                38388 bxe1:fp03