OPNsense Forum

Archive => 22.7 Legacy Series => Topic started by: ianf on November 08, 2022, 01:09:48 PM

Title: Need Help - Debugging slow webgui
Post by: ianf on November 08, 2022, 01:09:48 PM
Hi there,

We're a small business running OPNSense 22.7.6.

We've been experiencing a really slow webserver for a few weeks now, and don't know what might lead to this.

I'm looking to debug this problem.

What do I mean by "slow"?
- After logging in, it takes up to 2 minutes to show the dashboard, then another 10-20 seconds to show utilization
- Both the traffic graph and system logs are not working on the dashboard (System log reports "An empty response from the server")
- Basically every click takes ages to do anything
- Firmware: Packages reports no packages being found on the system.
- and more..

I updated to 22.7.6 yesterday, since I hoped that would help. Afterwards I ran the health check, which succeeded. Now I'm not even able to do that anymore, since the firmware status page takes ages to load.

It always runs smoothly after rebooting, but only for a few clicks, then it goes back to being slow.

CPU and memory usage are as low as 2% and 20%.

The only installed plugins are:

os-dmidecode
os-iperf
os-node_exporter

Please tell me how to proceed, to find the culprit.

Thanks,
Ian
Title: Re: Need Help - Debugging slow webgui
Post by: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.
Title: Re: Need Help - Debugging slow webgui
Post by: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on November 08, 2022, 03:42:19 PM
Quote from: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.

I guess you're talking about logging of FW rules?

That sounds promising, as we're logging a lot. I will look into that ASAP.
Can you please point me to the release notes for that, I must've missed those.

Thanks!
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on November 08, 2022, 03:43:24 PM
Quote from: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?

I'll double check, but I'm quite sure it's just one IP.
Title: Re: Need Help - Debugging slow webgui
Post by: cookiemonster on November 08, 2022, 04:18:40 PM
Quote from: ianf on November 08, 2022, 03:42:19 PM
Quote from: cookiemonster on November 08, 2022, 01:20:51 PM
There were reports when the logging mechanism changed a couple of releases back I think. I think it was a removal or addition of circular logging. In short, we were told by release notes to revise the logging and reduce as necessary. If not done, then there was a lot of logging going to disk, then the UI would show these type of symptoms when trying to retrieve from large amounts of data.
I'm not saying that is the problem in this case, but it won't hurt to check that.

I guess you're talking about logging of FW rules?

That sounds promising, as we're logging a lot. I will look into that ASAP.
Can you please point me to the release notes for that, I must've missed those.

Thanks!
It was easy to miss because there were no action expected to be required from the user but there were reports in the immediacy of the release.
https://forum.opnsense.org/index.php?topic=26536.0
https://forum.opnsense.org/index.php?topic=28054.0
Some seem to have been an interaction between the change and mfs but nonetheless I suggest to investigate if it can be that amount of logs to read what could be causing the problem.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on November 08, 2022, 05:59:07 PM
I have now disabled logging for all rules (but default ones), but the issue still persists. A reboot shouldn't be necessary, am I right?

Also logging to memory was disabled already.

As of now I can't even read my system logs, because they just aren't loading.

Connecting via SSH and reading from there works great, only a bit slow right when I switch to the root user.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on November 08, 2022, 06:20:39 PM
Quote from: ianf on November 08, 2022, 03:43:24 PM
Quote from: weeßicknich on November 08, 2022, 02:00:36 PM
Could you do a DNS lookup of the hostname you are using to access the GUI and see if it returns a single address or multiple values, some of which might not be reachable beacuse of firewall rules?

I'll double check, but I'm quite sure it's just one IP.

Running the DNS lookup via GUI didn't work (I never got a result). Through the shell I did confirm resolving it's FQDN @localhost returned one IP only.
Title: Re: Need Help - Debugging slow webgui
Post by: cookiemonster on November 08, 2022, 09:55:50 PM
For the viewing logs item, now that you have stopped the abundance of logging, you could still have a large amount on the filesystem that needs reading by the front end. What I would do to continue eliminating this as the problem, I'd ssh, list the files and verify their size and space consumed, to have an idea, before proceeding with moving them to another location and truncate the latest.log file if necessary. Followed by logging out of UI and logging back in. Be sure to note the owner and group.

[penguin@OPNsense ~]$ sudo ls -alh /var/log/filter/
total 194906
drwx------   2 root  wheel    34B Nov  8 20:01 .
drwxr-xr-x  24 root  wheel    41B Nov  8 03:01 ..
-rw-------   1 root  wheel    18M Oct  9 23:59 filter_20221009.log
-rw-------   1 root  wheel    23M Oct 10 23:59 filter_20221010.log
-rw-------   1 root  wheel    19M Oct 11 23:59 filter_20221011.log
-rw-------   1 root  wheel    27M Oct 12 23:59 filter_20221012.log
-rw-------   1 root  wheel    27M Oct 13 23:59 filter_20221013.log
-rw-------   1 root  wheel    32M Oct 14 23:59 filter_20221014.log
-rw-------   1 root  wheel    23M Oct 15 23:59 filter_20221015.log
-rw-------   1 root  wheel    16M Oct 16 23:59 filter_20221016.log
-rw-------   1 root  wheel    15M Oct 17 23:59 filter_20221017.log
-rw-------   1 root  wheel    14M Oct 18 23:59 filter_20221018.log
-rw-------   1 root  wheel    15M Oct 19 23:59 filter_20221019.log
-rw-------   1 root  wheel    19M Oct 20 23:59 filter_20221020.log
-rw-------   1 root  wheel    22M Oct 21 23:59 filter_20221021.log
-rw-------   1 root  wheel    19M Oct 22 23:59 filter_20221022.log
-rw-------   1 root  wheel    16M Oct 23 23:59 filter_20221023.log
-rw-------   1 root  wheel    20M Oct 24 23:59 filter_20221024.log
-rw-------   1 root  wheel    20M Oct 25 23:59 filter_20221025.log
-rw-------   1 root  wheel    24M Oct 26 23:59 filter_20221026.log
-rw-------   1 root  wheel    22M Oct 27 23:59 filter_20221027.log
-rw-------   1 root  wheel    20M Oct 28 23:59 filter_20221028.log
-rw-------   1 root  wheel    17M Oct 29 23:59 filter_20221029.log
-rw-------   1 root  wheel    17M Oct 30 23:59 filter_20221030.log
-rw-------   1 root  wheel    21M Oct 31 23:59 filter_20221031.log
-rw-------   1 root  wheel    17M Nov  1 23:59 filter_20221101.log
-rw-------   1 root  wheel    16M Nov  2 23:59 filter_20221102.log
-rw-------   1 root  wheel    19M Nov  3 23:59 filter_20221103.log
-rw-------   1 root  wheel    20M Nov  4 23:59 filter_20221104.log
-rw-------   1 root  wheel    20M Nov  5 23:59 filter_20221105.log
-rw-------   1 root  wheel    17M Nov  6 23:59 filter_20221106.log
-rw-------   1 root  wheel    21M Nov  7 23:59 filter_20221107.log
-rw-------   1 root  wheel    17M Nov  8 20:32 filter_20221108.log
lrwxr-x---   1 root  wheel    35B Nov  8 20:01 latest.log -> /var/log/filter/filter_20221108.log

[penguin@OPNsense ~]$ sudo df -ah /var/log/filter/
Filesystem       Size    Used   Avail Capacity  Mounted on
zroot/var/log     93G    252M     93G     0%    /var/log

[penguin@OPNsense ~]$ zfs list
NAME                        USED  AVAIL     REFER  MOUNTPOINT
zroot                      6.86G  92.9G       88K  /zroot
zroot/ROOT                 6.59G  92.9G       88K  none
zroot/ROOT/21.7.8_backup      8K  92.9G     1.84G  /
zroot/ROOT/22.1.10_backup     8K  92.9G     2.03G  /
zroot/ROOT/22.1.2             8K  92.9G     1.84G  /
zroot/ROOT/22.7            6.59G  92.9G     1.98G  /
zroot/tmp                  1.14M  92.9G     1.14M  /tmp
zroot/usr                   352K  92.9G       88K  /usr
zroot/usr/home               88K  92.9G       88K  /usr/home
zroot/usr/ports              88K  92.9G       88K  /usr/ports
zroot/usr/src                88K  92.9G       88K  /usr/src
zroot/var                   252M  92.9G       88K  /var
zroot/var/audit              88K  92.9G       88K  /var/audit
zroot/var/crash              88K  92.9G       88K  /var/crash
zroot/var/log               252M  92.9G      252M  /var/log
zroot/var/mail              112K  92.9G      112K  /var/mail
zroot/var/tmp               112K  92.9G      112K  /var/tmp

Above to show you my current usage for comparison.
What you can see is that I'm using the defaults in System > Settings > Logging. That is Preserve logs days I have it empty, giving me the last 31 days (enable full help at top right). All the options there I have disabled (unticked). I just noticed, there is an option there to reset log files. Make sure you are comfortable doing so.
This is just a thought. I could be entirely wrong and the size of logs play no part in the rendering of the UI pages, but that's what I would do to investigate.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on November 08, 2022, 11:43:40 PM
I'm currently looking at around 186GB of logs. I have a feeling this might be it :). Currently working on getting them off the drive, thanks already.
I'll update once I'm done!
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on November 09, 2022, 12:15:12 AM
I deleted most of the logs, sized down the latest.log as well.
After logging out and back in, issues persisted. I deleted all logs via GUI. Issues still persisted. After a reboot it seems to now be running better, I clicked around a bit and it seems to stay responsive.
I'll update again, if anything changes.

Thanks already for all your help!
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 05, 2022, 02:28:07 PM
Sadly I have to come back to this.

After deactivating the logs it has gotten better, until just now, we experienced another web GUI crash. It's no longer responsive whatsoever, the page doesn't even load. I ran /usr/local/etc/rc.restart_webgui in hopes that would help, as all other services are running just fine. This took about 3 minutes to complete, but I'm still left with a non-responding OPNSense web server.

Update: I was finally able to access the webgui after 5+ minutes, but still everything takes ages to do, nothing loads, errors everywhere..

Any help is appreciated.
Title: Re: Need Help - Debugging slow webgui
Post by: crissi on December 05, 2022, 04:14:08 PM
Recognized the slow Gui as well. DNS Lookup to FW Hostname is working ok.
Deleted all Logs, slowness still persists.

Have still these errors in System: Log Files: Web GUI

2022-12-05T11:06:05 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-05T11:06:05 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 80455
2022-12-04T14:15:51 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-04T14:15:51 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 34252


In my Case when I connect via IP Address to the Box the GUI is reacting fast..., connecting via FW Hostname is still slow...

BR
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 05, 2022, 04:36:47 PM
For me this isn't the case. Even via IP the webgui is basically unusable.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 07, 2022, 01:11:49 PM
Update:
I found a workaround:
1. restart configd through webgui
2. disable monit
3. restart the monit service via ssh:
# configctl monit restart
4. enable monit

How I found it:
I looked at zombie processes, as I thought those might give me an idea of what could be the problem.

# ps aux | grep 'Z'
USER       PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED        TIME COMMAND
root     11085   0.0  0.0      0     0  -  Z    12:51       0:00.11 <defunct>

# ps -o ppid= -p 11085
80828

# ps aux -p 80828
USER   PID %CPU %MEM   VSZ  RSS TT  STAT STARTED    TIME COMMAND
root 80828  0.0  0.1 22572 8976  -  I    Sat00   0:09.21 /usr/local/bin/monit -c /usr/local/etc/monitrc


Now I knew it had to do with monit. I disabled it first, but that didn't help. I tried restarting through configctl:

# configctl monit restart
unable to connect to configd socket (@/var/run/configd.socket)

Therefore I knew configctl was not working properly, and I restarted configd from the webgui.

Now I was able to run the restart command from above, which led to a much more responsive webserver already. Now I just re-enabled monit, and it's running fine.

I hope it stays that way, otherwise I know at least have a workaround.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 08, 2022, 12:57:13 PM
Update:

I just updated to opnsense-22.7.9_3, and the logs supported my suspicion of configd being broken:

From the update logs:

...
Starting configd.
Bus error (core dumped)
/usr/local/etc/rc.d/configd: WARNING: failed to start configd
...


Does anyone know how to fix this?
Title: Re: Need Help - Debugging slow webgui
Post by: cookiemonster on December 08, 2022, 03:54:10 PM
You can't like this but I'm beginning to think there might be hardware problems behind apparently unrelated problems "Bus error (core dumped)"
I would personally be inclined to save the config you can. Reinstall and restore config.
To see if the problems go. I imagine they won't but at least you'll be able to switch from chasing software problems.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 09, 2022, 10:24:16 AM
I guess that might be the case. We've reinstalled multiple times by now, and as it's not stayed better, I guess hardware has the highest probability. Do you or anyone else have any insights on what the `Bus error` could stand for?
Title: Re: Need Help - Debugging slow webgui
Post by: Patrick M. Hausen on December 09, 2022, 10:32:21 AM
Log in via SSH and type

dmesg

That possibly shows something.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 09, 2022, 10:54:26 AM
dmesg shows a bunch of errors:

pid 88001 (php-cgi), jid 0, uid 0: exited on signal 10 (core dumped)
pid 42076 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
...
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (53 occurrences)
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (62 occurrences)
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (62 occurrences)
sonewconn: pcb 0xfffff800210a0900 (local:/var/run/configd.socket): Listen queue overflow: 46 already in queue awaiting acceptance (135 occurrences)
...
pid 12580 (php), jid 0, uid 0: exited on signal 11 (core dumped)
pid 50157 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 21 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 41295 (python3.9), jid 0, uid 0: exited on signal 10 (core dumped)
pid 61620 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 53937 (php), jid 0, uid 0: exited on signal 11 (core dumped)
pid 10294 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 85021 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 26932 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
...


those are the ones I'm seing the most, apart from arp changes.
Title: Re: Need Help - Debugging slow webgui
Post by: Patrick M. Hausen on December 09, 2022, 10:58:19 AM
Hints at a memory or CPU problem. Overheating can also cause these.
Title: Re: Need Help - Debugging slow webgui
Post by: ianf on December 09, 2022, 11:09:38 AM
Core temps sit between 34 and 39°C, so I guess RAM or CPU. I'll look into replacing our unit then.
Title: Re: Need Help - Debugging slow webgui
Post by: SOUK on January 21, 2023, 12:48:35 PM
Quote from: crissi on December 05, 2022, 04:14:08 PM
Recognized the slow Gui as well. DNS Lookup to FW Hostname is working ok.
Deleted all Logs, slowness still persists.

Have still these errors in System: Log Files: Web GUI

2022-12-05T11:06:05 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-05T11:06:05 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 80455
2022-12-04T14:15:51 Error lighttpd (server.c.1588) server started (lighttpd/1.4.67)
2022-12-04T14:15:51 Error lighttpd (server.c.2097) server stopped by UID = 0 PID = 34252


In my Case when I connect via IP Address to the Box the GUI is reacting fast..., connecting via FW Hostname is still slow...

BR

I have the exact same problem