Need Help - Debugging slow webgui

Started by ianf, November 08, 2022, 01:09:48 PM

Previous topic - Next topic
Hi there,

We're a small business running OPNSense 22.7.6.

We've been experiencing a really slow webserver for a few weeks now, and don't know what might lead to this.

I'm looking to debug this problem.

What do I mean by "slow"?
- After logging in, it takes up to 2 minutes to show the dashboard, then another 10-20 seconds to show utilization
- Both the traffic graph and system logs are not working on the dashboard (System log reports "An empty response from the server")
- Basically every click takes ages to do anything
- Firmware: Packages reports no packages being found on the system.
- and more..

I updated to 22.7.6 yesterday, since I hoped that would help. Afterwards I ran the health check, which succeeded. Now I'm not even able to do that anymore, since the firmware status page takes ages to load.

It always runs smoothly after rebooting, but only for a few clicks, then it goes back to being slow.

CPU and memory usage are as low as 2% and 20%.

The only installed plugins are:

os-dmidecode
os-iperf
os-node_exporter

Please tell me how to proceed, to find the culprit.

Thanks,
Ian

There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.

Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?

Quote from: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.

I guess you're talking about logging of FW rules?

That sounds promising, as we're logging a lot. I will look into that ASAP.
Can you please point me to the release notes for that, I must've missed those.

Thanks!

Quote from: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?

I'll double check, but I'm quite sure it's just one IP.

Quote from: ianf on November 08, 2022, 03:42:19 PM
Quote from: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.

I guess you're talking about logging of FW rules?

That sounds promising, as we're logging a lot. I will look into that ASAP.
Can you please point me to the release notes for that, I must've missed those.

Thanks!
It was easy to miss because there were no action expected to be required from the user but there were reports in the immediacy of the release.
https://forum.opnsense.org/index.php?topic=26536.0
https://forum.opnsense.org/index.php?topic=28054.0
Some seem to have been an interaction between the change and mfs but nonetheless I suggest to investigate if it can be that amount of logs to read what could be causing the problem.

November 08, 2022, 05:59:07 PM #6 Last Edit: November 08, 2022, 06:21:04 PM by ianf
I have now disabled logging for all rules (but default ones), but the issue still persists. A reboot shouldn't be necessary, am I right?

Also logging to memory was disabled already.

As of now I can't even read my system logs, because they just aren't loading.

Connecting via SSH and reading from there works great, only a bit slow right when I switch to the root user.

Quote from: ianf on November 08, 2022, 03:43:24 PM
Quote from: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?

I'll double check, but I'm quite sure it's just one IP.

Running the DNS lookup via GUI didn't work (I never got a result). Through the shell I did confirm resolving it's FQDN @localhost returned one IP only.

For the viewing logs item, now that you have stopped the abundance of logging, you could still have a large amount on the filesystem that needs reading by the front end. What I would do to continue eliminating this as the problem, I'd ssh, list the files and verify their size and space consumed, to have an idea, before proceeding with moving them to another location and truncate the latest.log file if necessary. Followed by logging out of UI and logging back in. Be sure to note the owner and group.

[penguin@OPNsense ~]$ sudo ls -alh /var/log/filter/
total 194906
drwx------   2 root  wheel    34B Nov  8 20:01 .
drwxr-xr-x  24 root  wheel    41B Nov  8 03:01 ..
-rw-------   1 root  wheel    18M Oct  9 23:59 filter_20221009.log
-rw-------   1 root  wheel    23M Oct 10 23:59 filter_20221010.log
-rw-------   1 root  wheel    19M Oct 11 23:59 filter_20221011.log
-rw-------   1 root  wheel    27M Oct 12 23:59 filter_20221012.log
-rw-------   1 root  wheel    27M Oct 13 23:59 filter_20221013.log
-rw-------   1 root  wheel    32M Oct 14 23:59 filter_20221014.log
-rw-------   1 root  wheel    23M Oct 15 23:59 filter_20221015.log
-rw-------   1 root  wheel    16M Oct 16 23:59 filter_20221016.log
-rw-------   1 root  wheel    15M Oct 17 23:59 filter_20221017.log
-rw-------   1 root  wheel    14M Oct 18 23:59 filter_20221018.log
-rw-------   1 root  wheel    15M Oct 19 23:59 filter_20221019.log
-rw-------   1 root  wheel    19M Oct 20 23:59 filter_20221020.log
-rw-------   1 root  wheel    22M Oct 21 23:59 filter_20221021.log
-rw-------   1 root  wheel    19M Oct 22 23:59 filter_20221022.log
-rw-------   1 root  wheel    16M Oct 23 23:59 filter_20221023.log
-rw-------   1 root  wheel    20M Oct 24 23:59 filter_20221024.log
-rw-------   1 root  wheel    20M Oct 25 23:59 filter_20221025.log
-rw-------   1 root  wheel    24M Oct 26 23:59 filter_20221026.log
-rw-------   1 root  wheel    22M Oct 27 23:59 filter_20221027.log
-rw-------   1 root  wheel    20M Oct 28 23:59 filter_20221028.log
-rw-------   1 root  wheel    17M Oct 29 23:59 filter_20221029.log
-rw-------   1 root  wheel    17M Oct 30 23:59 filter_20221030.log
-rw-------   1 root  wheel    21M Oct 31 23:59 filter_20221031.log
-rw-------   1 root  wheel    17M Nov  1 23:59 filter_20221101.log
-rw-------   1 root  wheel    16M Nov  2 23:59 filter_20221102.log
-rw-------   1 root  wheel    19M Nov  3 23:59 filter_20221103.log
-rw-------   1 root  wheel    20M Nov  4 23:59 filter_20221104.log
-rw-------   1 root  wheel    20M Nov  5 23:59 filter_20221105.log
-rw-------   1 root  wheel    17M Nov  6 23:59 filter_20221106.log
-rw-------   1 root  wheel    21M Nov  7 23:59 filter_20221107.log
-rw-------   1 root  wheel    17M Nov  8 20:32 filter_20221108.log
lrwxr-x---   1 root  wheel    35B Nov  8 20:01 latest.log -> /var/log/filter/filter_20221108.log

[penguin@OPNsense ~]$ sudo df -ah /var/log/filter/
Filesystem       Size    Used   Avail Capacity  Mounted on
zroot/var/log     93G    252M     93G     0%    /var/log

[penguin@OPNsense ~]$ zfs list
NAME                        USED  AVAIL     REFER  MOUNTPOINT
zroot                      6.86G  92.9G       88K  /zroot
zroot/ROOT                 6.59G  92.9G       88K  none
zroot/ROOT/21.7.8_backup      8K  92.9G     1.84G  /
zroot/ROOT/22.1.10_backup     8K  92.9G     2.03G  /
zroot/ROOT/22.1.2             8K  92.9G     1.84G  /
zroot/ROOT/22.7            6.59G  92.9G     1.98G  /
zroot/tmp                  1.14M  92.9G     1.14M  /tmp
zroot/usr                   352K  92.9G       88K  /usr
zroot/usr/home               88K  92.9G       88K  /usr/home
zroot/usr/ports              88K  92.9G       88K  /usr/ports
zroot/usr/src                88K  92.9G       88K  /usr/src
zroot/var                   252M  92.9G       88K  /var
zroot/var/audit              88K  92.9G       88K  /var/audit
zroot/var/crash              88K  92.9G       88K  /var/crash
zroot/var/log               252M  92.9G      252M  /var/log
zroot/var/mail              112K  92.9G      112K  /var/mail
zroot/var/tmp               112K  92.9G      112K  /var/tmp

Above to show you my current usage for comparison.
What you can see is that I'm using the defaults in System > Settings > Logging. That is Preserve logs days I have it empty, giving me the last 31 days (enable full help at top right). All the options there I have disabled (unticked). I just noticed, there is an option there to reset log files. Make sure you are comfortable doing so.
This is just a thought. I could be entirely wrong and the size of logs play no part in the rendering of the UI pages, but that's what I would do to investigate.

I'm currently looking at around 186GB of logs. I have a feeling this might be it :). Currently working on getting them off the drive, thanks already.
I'll update once I'm done!

I deleted most of the logs, sized down the latest.log as well.
After logging out and back in, issues persisted. I deleted all logs via GUI. Issues still persisted. After a reboot it seems to now be running better, I clicked around a bit and it seems to stay responsive.
I'll update again, if anything changes.

Thanks already for all your help!

December 05, 2022, 02:28:07 PM #11 Last Edit: December 05, 2022, 02:37:43 PM by ianf
Sadly I have to come back to this.

After deactivating the logs it has gotten better, until just now, we experienced another web GUI crash. It's no longer responsive whatsoever, the page doesn't even load. I ran /usr/local/etc/rc.restart_webgui in hopes that would help, as all other services are running just fine. This took about 3 minutes to complete, but I'm still left with a non-responding OPNSense web server.

Update: I was finally able to access the webgui after 5+ minutes, but still everything takes ages to do, nothing loads, errors everywhere..

Any help is appreciated.

Recognized the slow Gui as well. DNS Lookup to FW Hostname is working ok.
Deleted all Logs, slowness still persists.

Have still these errors in System: Log Files: Web GUI

2022-12-05T11:06:05 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-05T11:06:05 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 80455
2022-12-04T14:15:51 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-04T14:15:51 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 34252


In my Case when I connect via IP Address to the Box the GUI is reacting fast..., connecting via FW Hostname is still slow...

BR
Cheers,
Crissi

For me this isn't the case. Even via IP the webgui is basically unusable.

Update:
I found a workaround:
1. restart configd through webgui
2. disable monit
3. restart the monit service via ssh:
# configctl monit restart
4. enable monit

How I found it:
I looked at zombie processes, as I thought those might give me an idea of what could be the problem.

# ps aux | grep 'Z'
USER       PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED        TIME COMMAND
root     11085   0.0  0.0      0     0  -  Z    12:51       0:00.11 <defunct>

# ps -o ppid= -p 11085
80828

# ps aux -p 80828
USER   PID %CPU %MEM   VSZ  RSS TT  STAT STARTED    TIME COMMAND
root 80828  0.0  0.1 22572 8976  -  I    Sat00   0:09.21 /usr/local/bin/monit -c /usr/local/etc/monitrc


Now I knew it had to do with monit. I disabled it first, but that didn't help. I tried restarting through configctl:

# configctl monit restart
unable to connect to configd socket (@/var/run/configd.socket)

Therefore I knew configctl was not working properly, and I restarted configd from the webgui.

Now I was able to run the restart command from above, which led to a much more responsive webserver already. Now I just re-enabled monit, and it's running fine.

I hope it stays that way, otherwise I know at least have a workaround.