Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - ThomasE

#1
Hardware and Performance / Re: High CPU-load
June 11, 2025, 09:25:11 AM
Quote from: Kets_One on June 10, 2025, 05:31:40 PM@thomasE

Great that you finally got to the bottom of this.
Did you return the dec4280 to Deciso or did you find a job it can handle?
Returning the appliance wouldn't have solved our problem as we'd still have OPNsense running on a single, quite capable server that should be put to better use elsewhere. Under heavy network load we're already experiencing some issues and as bandwidth increases, this will get worse eventually. I'm not even talking about the consequences of building a CARP cluster for redundancy. ;-)

We're currently working on a network redesign migrating a significant part of the interfaces to switches and route them via a transfer network to the OPNsense. That way we can reduce the number of interfaces on the OPNsense from around 300 to about 20. Fortunately, our switches are able to handle this - we've checked that already. ;-) Lots of work ahead, but at least it will simplify setting up CARP...
#2
Hardware and Performance / Re: High CPU-load
June 10, 2025, 03:44:57 PM
Quote from: Patrick M. Hausen on June 04, 2025, 01:39:17 PMAs a customer with an official Deciso appliance I would move this discussion from the community forum to an equally official support case.
We're in constant communication with our hardware vendor (who in turn talks with Deciso) and we finally got some answers:

OPNsense will eventually run into performance issues once the number of interfaces reaches three digits. That's the core of the problem and it fits perfectly into what we observed. It has nothing to do with NICs or any other hardware/driver issues, the amount of traffic, open CP sessions or whatsoever. Using more powerful hardware will - likely only up to a point - mitigate the problem, but there's no way it can actually solve it. It's a software design issue - OPNsense is not optimized for a high number of interfaces. We're now in need of a network and firewall redesign. Nothing we can't handle, but obviously it would've been great if we had known this right from the beginning.

Maybe someone will take this as an opportunity to add this information to the documentation. Currently the only hint that a "high number of users or interface assignments may be less practical" can only be found in a footnote of the product description of the appliances. It doesn't say what "high number" means - could be 10, 100 or 1.000 depending on who you ask - and the word "impractical" doesn't mean that the whole system will collapse. Some hint that with decent hardware up to 100 Interfaces is likely fine and beyond that performance issues are to be expected. :)

#3
Hardware and Performance / Re: High CPU-load
June 04, 2025, 01:37:15 PM
Quote from: meyergru on June 04, 2025, 10:40:23 AMInitially, you said that a large number of updates causes the problem, not that it occurs out of the blue. When you deploy a new instance, this boils down to a high initial load with OpenVPN, because of the certificate handshakes.
Correct, that's where the whole thing started: We installed OPNsense on our somewhat older server hardware (8 cores, 16 threads, 128GB RAM). For the most part this worked just fine, but we had some issues during traffic spikes. After our attempts to solve the problem via tuning failed, we switched to the best server hardware available to us: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads), 384 GB RAM, you get the idea. This was meant to be a temporary solution as this hardware seemed way too much for that purpose and was intended for running 20+ virtual machines instead of just one firewall - and we'd need two of those machines for redundancy.

In order to rule out any hardware or driver issues, we decided to get the appliance - which performs much worse than our old server. :-(

QuoteAnyway, if the problem occurs even without specific traffic spikes, it seems to be the pure number of VLANs involved. I would argue that it is quite unsual to have that many.
I agree with you that this is indeed somewhat unusual, but that's what we've got... ;-)

QuoteProbably the routing table hashing mechanism in FreeBSD is not optimized for that (or can be tuned by enlarging the memory for it). As I said, I am by no means a FreeBSD expert, but I saw that you can even set the routing algorithm, see "sysctl net.route.algo.inet.algo".
My knowledge about FreeBSD is even more limited, but this looks like a good starting point for some more research... :)

Thanks
Thomas
#4
Hardware and Performance / Re: High CPU-load
June 04, 2025, 10:13:03 AM
Quote from: meyergru on June 04, 2025, 12:49:10 AMSo basically what happens is that 300 VLANs - which presumably connect to a similar number of client machines - use VPN connections, capped at 4 GBit/s total. When all of those act in parallel, the problem occurs.
The problem already occurs with no traffic at all. With the exception of one 1Gbit/s interface solely used for administration and accessing the GUI, all other interfaces were physically disconnected. (They were - of course - enabled in the configuration.) There were some VPN servers configured and activated, but they weren't being used. To be precise, we have two legacy OpenVPN servers, one "new" OpenVPN instance for testing purposes and one WireGuard instance also for testing. Apart from that, everything es is simple routing/NAT. There is a total of 2% (18875/1000000) firewall table entries.

QuoteEven then, there is no single user process that the load can be attributed to. Thus, I would guess that the VPN calculations are the culprit. Those may be delegated to the FreeBSD kernel. The interrupt load could also point to that (maybe induced by context switches).
While the first sentence is entirely true, there should be VPN calculations whatsoever as VPN wasn't even used and won't ever be used to a greater extent. Even in production, there are at most 10 OpenVPN connections (road warrior).

QuoteDepending on how the VPN infrastructure is built (site-2-site vs. client-2-site), you probably could use different VPN technologies (e.g. Wireguard) or employ different ciphers, which may lead to better parallelism, if my guess should turn out to be the underlying problem.
I do agree that a significant number of established VPN connections might indeed be an issue, but this is not the case.
#5
Hardware and Performance / Re: High CPU-load
June 03, 2025, 01:44:03 PM
We finally got our DEC4280 Appliance and gave it a try. After installing all available updates we imported our original configuration (with some slight changes to match the new device names). Bootup took around 15 minutes - a bit longer than usual but that's ok. Even with just one network interface connected for administration, the GUI was extremely slow, the whole system was close to being unoperable. A simple change to a firewall rule could take as long as a minute to apply. At this time there was absolutely no traffic being routed, no captive portal clients trying to connect - there was nothing at all!

So we have the best appliance available but the system won't even run our configuration without any network load? I'm aware that our setup is quite big, but is it really that much beyond what OPNsense can possibly handle? After all, it's not the network traffic that causes issues, and we aren't even thinking about things like intrusion detection - the only thing we've got is a lot of interfaces...
#6
As the integrated netflow feature only supports up to 8 interfaces simultaneously, we decided to set up an external server to collect netflow data for further processing. Since our hardware is quite capable (we thought), we activated netflow on all ~200 interfaces (mostly VLAN) at once which basically crashed the whole system. Of course the primary lesson of this is, to never ever and under no circumstances do anything on more than 10 interfaces at once, unless you're begging for trouble and feel a really strong urge to bring in cake on the following day which is how we usually deal with colleagues accidentally breaking something. ;-)

But seriously, how much load is there to be expected per interface sending netflow data to an external server? Does it depend on the amount of traffic on that interface or is that irrelevant? Is activating netflow on literally hundreds of interfaces something that a well-equipped system should be able to handle or is it way beyond of what any powerful system can do?

At the moment this is not about identifying problems and finding tuning options to solve them - it's about making sure that what we want to do is something that actually can be done.

Thanks
Thomas
#7
Quote from: doktornotor on April 10, 2025, 09:31:01 AMI'd say the proper place is /usr/local/opnsense/service/templates/OPNsense/WebGui/php.ini - followed by

configctl webgui restart

for changes to have effect.

Thanks! That was it! I already identified that file and modified it, but I did a
service configd restart

as suggested by - among others - ChatGPT rather than a

configctl webgui restart

That did it. :)

Thanks.
#8
Hi,

I'm currently getting the following error message whenever a hit "Apply Changes" in interface configuration:

PHP Fatal error:  Maximum execution time of 300 seconds exceeded in /usr/local/etc/inc/plugins.inc.d/ipsec.inc on line 144
Occasionally it's in a different file, but I don't think that matters. After some examination I have a good idea of what's happening and I know what I've done, so I'm aware what's caused it. If I'm not totally wrong, I know a way that should fix this once and for good - I just don't know how to do it.

I was tasked with reconfiguring some 80 interfaces - a rather simple change of the respective interface IPs. Applying changes after one interface takes between 30 and 60 seconds. Don't know why it takes so long, but it doesn't really matter. We have a rather big setup which is working just fine, so this is not a problem as that kind of change is not something that we do frequently. Rather than doing something every single minute or so I figured it might be a good idea to do a bulk change meaning I reconfigure a greater number of interfaces, then hit "Apply changes" and have much longer time that I can use to do something else. This was a bad idea, because after some time I get the error message mentioned above. About 20 interfaces have been reconfigured as expected - the other ones haven't. Hitting "Apply changes" will start the whole process from the very beginning, which I suppose is "by design".

I can think of three different approaches to this problem - in order of preference:

  • Applying changes to each interface individually on CLI.
  • Temporarily increasing PHP max_exection_time.
  • Rebooting the system.

So far I haven't found a way to do (1) and I'm open for suggestions. As for (2), I have tried modifying max_exection_time in /usr/local/etc/php.ini as well as in /usr/local/lib/php.ini, which are the only places where max_execution_time=300 is being set. (I changed the value to 3000 as this is only meant to be a temporary fix.) However, that doesn't seem to change anything. Rebooting the system (3) would cause a downtime that I'd like to avoid. This will be my last option if everything else fails.

Thanks
Thomas
#9
24.7, 24.10 Series / Re: SSL errors on console...
March 24, 2025, 09:25:25 AM
Sadly, I just found out why there are no more error messages: The whole CP is down... :-(

According to the log files, the service is starting normally, but ports 8000 and 9000 are closed. We checked the limits (ulimit -a) for root and www users - no problem there. Any ideas?

But there're even more strange things...

ls -lh /var/etc/lighttpd-cp-zone-0.conf
-rw-r--r--  1 root wheel  1.2M Mar 24 08:15 /var/etc/lighttpd-cp-zone-0.conf
Notice the file size. At the beginning of the file, there's one extremely long line consisting of characters that can't be seen. After that, there's this:

              # ssl enabled, redirect to https
                       
#############################################################################################
###  Captive portal zone 0 lighttpd.conf  BEGIN
###  -- listen on port 8000 for primary (ssl) connections
###  -- forward on port 9000 for plain http redirection
#############################################################################################
#
#### modules to load
server.modules           = ( "mod_expire",
                             "mod_auth",
                             "mod_redirect",
                             "mod_access",
                             "mod_deflate",
                             "mod_status",
                             "mod_rewrite",
                             "mod_proxy",
                             "mod_setenv",
Nothing suspicious after that. Now I don't think this has anything to do with our orginal problem, but it still looks strange to me...
#10
24.7, 24.10 Series / Re: SSL errors on console...
March 21, 2025, 02:46:39 PM
Well, I think it's safe to assume those limits are there for a reason ranging from "Nothing. You may just waste some resources that nobody really cares about on any modern system." to "Setting this too high may eventually crash the whole thing." - and since I'm on a production system, I prefer being a bit too careful rather than sorry. ;-)

Anyway, it looks like "Mission accomplished!" to me. I kept doubling those numbers until the messages disappeared. I ended up with

## number of file descriptors (leave off for lighty loaded sites)
server.max-fds         = 131072

## maximum concurrent connections the server will accept (1/2 of server.max-fds)
server.max-connections = 65536

which I suppose gives you an idea of just how busy our CP is. ;-)

Besides, I'm not sure if it has something to do with it or because it's Friday afternoon, but the load average on the whole system seems to be down significantly - from ~15 to ~3.

Thanks and have a nice weekend!
Thomas

#11
24.7, 24.10 Series / Re: SSL errors on console...
March 20, 2025, 03:59:17 PM
Hi Franco,

looks like that helped... a little bit... maybe... ;-)

There are still a lot of messages, but the rate seems to be somewhat slower - down by a roughly a third I would guess. I do assume that this is because of the applied patch - not because of less traffic on the captive portal. Is it safe to simply increase those numbers until everything works? By that I mean applying a factor of 2, 4, 8 or perhaps 16 at most - not hundreds or thousands. ;-)
#12
24.7, 24.10 Series / Re: SSL errors on console...
March 19, 2025, 08:14:48 AM
Great, I'd be more than happy to give it a try and tell you what happened. ;-)

I found the options in

/var/etc/lighttpd-cp-zone-0.conf

where they're unset. Of course I could simply edit that file, but I'm afraid that change will be overridden as soon as I make any changes to the CP configuration if not even sooner, so what would be the best place to make this setting persistent?

Thanks!
Thomas
#13
24.7, 24.10 Series / Re: SSL errors on console...
March 18, 2025, 04:20:20 PM
The process is lighttpd, /usr/obj/usr/ports/www/lighttpd/work/lighttpd-1.4.76/src/mod_openssl.c.{3510|3470}. (I found those two numbers in a single screenshot. There may be others. ;-)
#14
24.7, 24.10 Series / SSL errors on console...
March 13, 2025, 09:40:14 AM
Hi there,

we're running an OPNsense 24.10 and we believe that the following problem comes from our captive portal setup:

We're literally getting hundreds of SSL error messages every minute logged to the console, which basically makes it completely unusable. We can - and in most cases do - login via SSH and do everything from there, so it's not that big of an issue, but seriously, it is my understanding that only really critical messages should be logged to the console. Unless I'm completely mistaken and those messages really do indicate that something is going completely wrong and we must do something about it, I'm looking for a way to simply get rid of those messages by sending them to some log file only.

Here're a few examples of what's being logged:


error: 0A000417: SSL routines::sslv3 alert illegal parameter
error: 0A000102: SSL routines::unsupported protocol
error: 0A0000EB: SSL routines::no application protocol
error: 0A000076: SSL routines::no suitable signature algorithm
error: 0A00010B: SSL routines::wrong version number

Those seem to be by far the most frequent error messages, but there are other ones, too. They come at a rate of ~200 per minute, so it's really quite a bit. The system itself seems so be working just fine - except maybe for a somewhat high load average we think is unrelated to this problem.

Any suggestions what we could do about it?

Thanks
Thomas
:)
#15
Hardware and Performance / Re: High CPU-load
March 13, 2025, 08:59:12 AM
Hi,

thanks for all the input that you gave. For now, I think we solved mitigated the problem simply by throwing more hardware at it. We now haven an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (32 cores, 64 threads) with 384GB of RAM. I'd assume that this setup is severely overpowered. Although we still have a load average around 15 during normal operations, the system doesn't go down under pressure anymore which is our primary goal.

The current solution is only temporary as we're planning to get a DEC4280 Appliance in the near future. Should the problem persist after that, I'll come back... ;-)

Thomas
:)