Hi there,
We're a small business running OPNSense 22.7.6.
We've been experiencing a really slow webserver for a few weeks now, and don't know what might lead to this.
I'm looking to debug this problem.
What do I mean by "slow"?
- After logging in, it takes up to 2 minutes to show the dashboard, then another 10-20 seconds to show utilization
- Both the traffic graph and system logs are not working on the dashboard (System log reports "An empty response from the server")
- Basically every click takes ages to do anything
- Firmware: Packages reports no packages being found on the system.
- and more..
I updated to 22.7.6 yesterday, since I hoped that would help. Afterwards I ran the health check, which succeeded. Now I'm not even able to do that anymore, since the firmware status page takes ages to load.
It always runs smoothly after rebooting, but only for a few clicks, then it goes back to being slow.
CPU and memory usage are as low as 2% and 20%.
The only installed plugins are:
os-dmidecode
os-iperf
os-node_exporter
Please tell me how to proceed, to find the culprit.
Thanks,
Ian
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?
Quote from: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.
I guess you're talking about logging of FW rules?
That sounds promising, as we're logging a lot. I will look into that ASAP.
Can you please point me to the release notes for that, I must've missed those.
Thanks!
Quote from: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?
I'll double check, but I'm quite sure it's just one IP.
Quote from: ianf on November 08, 2022, 03:42:19 PM
Quote from: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.
I guess you're talking about logging of FW rules?
That sounds promising, as we're logging a lot. I will look into that ASAP.
Can you please point me to the release notes for that, I must've missed those.
Thanks!
It was easy to miss because there were no action expected to be required from the user but there were reports in the immediacy of the release.
https://forum.opnsense.org/index.php?topic=26536.0
https://forum.opnsense.org/index.php?topic=28054.0
Some seem to have been an interaction between the change and mfs but nonetheless I suggest to investigate if it can be that amount of logs to read what could be causing the problem.
I have now disabled logging for all rules (but default ones), but the issue still persists. A reboot shouldn't be necessary, am I right?
Also logging to memory was disabled already.
As of now I can't even read my system logs, because they just aren't loading.
Connecting via SSH and reading from there works great, only a bit slow right when I switch to the root user.
Quote from: ianf on November 08, 2022, 03:43:24 PM
Quote from: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?
I'll double check, but I'm quite sure it's just one IP.
Running the DNS lookup via GUI didn't work (I never got a result). Through the shell I did confirm resolving it's FQDN @localhost returned one IP only.
For the viewing logs item, now that you have stopped the abundance of logging, you could still have a large amount on the filesystem that needs reading by the front end. What I would do to continue eliminating this as the problem, I'd ssh, list the files and verify their size and space consumed, to have an idea, before proceeding with moving them to another location and truncate the latest.log file if necessary. Followed by logging out of UI and logging back in. Be sure to note the owner and group.
[penguin@OPNsense ~]$ sudo ls -alh /var/log/filter/
total 194906
drwx------ 2 root wheel 34B Nov 8 20:01 .
drwxr-xr-x 24 root wheel 41B Nov 8 03:01 ..
-rw------- 1 root wheel 18M Oct 9 23:59 filter_20221009.log
-rw------- 1 root wheel 23M Oct 10 23:59 filter_20221010.log
-rw------- 1 root wheel 19M Oct 11 23:59 filter_20221011.log
-rw------- 1 root wheel 27M Oct 12 23:59 filter_20221012.log
-rw------- 1 root wheel 27M Oct 13 23:59 filter_20221013.log
-rw------- 1 root wheel 32M Oct 14 23:59 filter_20221014.log
-rw------- 1 root wheel 23M Oct 15 23:59 filter_20221015.log
-rw------- 1 root wheel 16M Oct 16 23:59 filter_20221016.log
-rw------- 1 root wheel 15M Oct 17 23:59 filter_20221017.log
-rw------- 1 root wheel 14M Oct 18 23:59 filter_20221018.log
-rw------- 1 root wheel 15M Oct 19 23:59 filter_20221019.log
-rw------- 1 root wheel 19M Oct 20 23:59 filter_20221020.log
-rw------- 1 root wheel 22M Oct 21 23:59 filter_20221021.log
-rw------- 1 root wheel 19M Oct 22 23:59 filter_20221022.log
-rw------- 1 root wheel 16M Oct 23 23:59 filter_20221023.log
-rw------- 1 root wheel 20M Oct 24 23:59 filter_20221024.log
-rw------- 1 root wheel 20M Oct 25 23:59 filter_20221025.log
-rw------- 1 root wheel 24M Oct 26 23:59 filter_20221026.log
-rw------- 1 root wheel 22M Oct 27 23:59 filter_20221027.log
-rw------- 1 root wheel 20M Oct 28 23:59 filter_20221028.log
-rw------- 1 root wheel 17M Oct 29 23:59 filter_20221029.log
-rw------- 1 root wheel 17M Oct 30 23:59 filter_20221030.log
-rw------- 1 root wheel 21M Oct 31 23:59 filter_20221031.log
-rw------- 1 root wheel 17M Nov 1 23:59 filter_20221101.log
-rw------- 1 root wheel 16M Nov 2 23:59 filter_20221102.log
-rw------- 1 root wheel 19M Nov 3 23:59 filter_20221103.log
-rw------- 1 root wheel 20M Nov 4 23:59 filter_20221104.log
-rw------- 1 root wheel 20M Nov 5 23:59 filter_20221105.log
-rw------- 1 root wheel 17M Nov 6 23:59 filter_20221106.log
-rw------- 1 root wheel 21M Nov 7 23:59 filter_20221107.log
-rw------- 1 root wheel 17M Nov 8 20:32 filter_20221108.log
lrwxr-x--- 1 root wheel 35B Nov 8 20:01 latest.log -> /var/log/filter/filter_20221108.log
[penguin@OPNsense ~]$ sudo df -ah /var/log/filter/
Filesystem Size Used Avail Capacity Mounted on
zroot/var/log 93G 252M 93G 0% /var/log
[penguin@OPNsense ~]$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
zroot 6.86G 92.9G 88K /zroot
zroot/ROOT 6.59G 92.9G 88K none
zroot/ROOT/21.7.8_backup 8K 92.9G 1.84G /
zroot/ROOT/22.1.10_backup 8K 92.9G 2.03G /
zroot/ROOT/22.1.2 8K 92.9G 1.84G /
zroot/ROOT/22.7 6.59G 92.9G 1.98G /
zroot/tmp 1.14M 92.9G 1.14M /tmp
zroot/usr 352K 92.9G 88K /usr
zroot/usr/home 88K 92.9G 88K /usr/home
zroot/usr/ports 88K 92.9G 88K /usr/ports
zroot/usr/src 88K 92.9G 88K /usr/src
zroot/var 252M 92.9G 88K /var
zroot/var/audit 88K 92.9G 88K /var/audit
zroot/var/crash 88K 92.9G 88K /var/crash
zroot/var/log 252M 92.9G 252M /var/log
zroot/var/mail 112K 92.9G 112K /var/mail
zroot/var/tmp 112K 92.9G 112K /var/tmp
Above to show you my current usage for comparison.
What you can see is that I'm using the defaults in System > Settings > Logging. That is Preserve logs days I have it empty, giving me the last 31 days (enable full help at top right). All the options there I have disabled (unticked). I just noticed, there is an option there to reset log files. Make sure you are comfortable doing so.
This is just a thought. I could be entirely wrong and the size of logs play no part in the rendering of the UI pages, but that's what I would do to investigate.
I'm currently looking at around 186GB of logs. I have a feeling this might be it :). Currently working on getting them off the drive, thanks already.
I'll update once I'm done!
I deleted most of the logs, sized down the latest.log as well.
After logging out and back in, issues persisted. I deleted all logs via GUI. Issues still persisted. After a reboot it seems to now be running better, I clicked around a bit and it seems to stay responsive.
I'll update again, if anything changes.
Thanks already for all your help!
Sadly I have to come back to this.
After deactivating the logs it has gotten better, until just now, we experienced another web GUI crash. It's no longer responsive whatsoever, the page doesn't even load. I ran /usr/local/etc/rc.restart_webgui in hopes that would help, as all other services are running just fine. This took about 3 minutes to complete, but I'm still left with a non-responding OPNSense web server.
Update: I was finally able to access the webgui after 5+ minutes, but still everything takes ages to do, nothing loads, errors everywhere..
Any help is appreciated.
Recognized the slow Gui as well. DNS Lookup to FW Hostname is working ok.
Deleted all Logs, slowness still persists.
Have still these errors in System: Log Files: Web GUI
2022-12-05T11:06:05 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-05T11:06:05 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 80455
2022-12-04T14:15:51 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-04T14:15:51 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 34252
In my Case when I connect via IP Address to the Box the GUI is reacting fast..., connecting via FW Hostname is still slow...
BR
For me this isn't the case. Even via IP the webgui is basically unusable.
Update:
I found a workaround:
1. restart configd through webgui
2. disable monit
3. restart the monit service via ssh:
# configctl monit restart
4. enable monit
How I found it:
I looked at zombie processes, as I thought those might give me an idea of what could be the problem.
# ps aux | grep 'Z'
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 11085 0.0 0.0 0 0 - Z 12:51 0:00.11 <defunct>
# ps -o ppid= -p 11085
80828
# ps aux -p 80828
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 80828 0.0 0.1 22572 8976 - I Sat00 0:09.21 /usr/local/bin/monit -c /usr/local/etc/monitrc
Now I knew it had to do with monit. I disabled it first, but that didn't help. I tried restarting through configctl:
# configctl monit restart
unable to connect to configd socket (@/var/run/configd.socket)
Therefore I knew configctl was not working properly, and I restarted configd from the webgui.
Now I was able to run the restart command from above, which led to a much more responsive webserver already. Now I just re-enabled monit, and it's running fine.
I hope it stays that way, otherwise I know at least have a workaround.
Update:
I just updated to opnsense-22.7.9_3, and the logs supported my suspicion of configd being broken:
From the update logs:
...
Starting configd.
Bus error (core dumped)
/usr/local/etc/rc.d/configd: WARNING: failed to start configd
...
Does anyone know how to fix this?
You can't like this but I'm beginning to think there might be hardware problems behind apparently unrelated problems "Bus error (core dumped)"
I would personally be inclined to save the config you can. Reinstall and restore config.
To see if the problems go. I imagine they won't but at least you'll be able to switch from chasing software problems.
I guess that might be the case. We've reinstalled multiple times by now, and as it's not stayed better, I guess hardware has the highest probability. Do you or anyone else have any insights on what the `Bus error` could stand for?
Log in via SSH and type
dmesg
That possibly shows something.
dmesg shows a bunch of errors:
pid 88001 (php-cgi), jid 0, uid 0: exited on signal 10 (core dumped)
pid 42076 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
...
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (53 occurrences)
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (62 occurrences)
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (62 occurrences)
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (135 occurrences)
...
pid 12580 (php), jid 0, uid 0: exited on signal 11 (core dumped)
pid 50157 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 21 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 41295 (python3.9), jid 0, uid 0: exited on signal 10 (core dumped)
pid 61620 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 53937 (php), jid 0, uid 0: exited on signal 11 (core dumped)
pid 10294 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 85021 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 26932 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
...
those are the ones I'm seing the most, apart from arp changes.
Hints at a memory or CPU problem. Overheating can also cause these.
Core temps sit between 34 and 39°C, so I guess RAM or CPU. I'll look into replacing our unit then.
Quote from: crissi on December 05, 2022, 04:14:08 PM
Recognized the slow Gui as well. DNS Lookup to FW Hostname is working ok.
Deleted all Logs, slowness still persists.
Have still these errors in System: Log Files: Web GUI
2022-12-05T11:06:05 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-05T11:06:05 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 80455
2022-12-04T14:15:51 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-04T14:15:51 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 34252
In my Case when I connect via IP Address to the Box the GUI is reacting fast..., connecting via FW Hostname is still slow...
BR
I have the exact same problem