WebGUI unaccessible for 3-4 minutes --> when modyfing "Interfaces"

Started by AES777GCM, November 13, 2024, 07:20:37 PM

Previous topic - Next topic
Okay - so in your case - the browser has an impact and everything else worked fine for you. That's weird but okay.

As I mentioned in the thread, my both Systems! I gave a try [with 24.7.x] behave the same: After Applying any switch action in INTERFACE SECTION you can see that save is done, but instead of "coming back after doing his things" the WebGui Server hangs up for 3-4 minutes.

In this state of lighttpd task being "locked" you can't even reset this task. I have two choices then: Wait for 3-4 minutes; somehow the task is then being killed automatically and comes back or killing the "stalled" lighttpd manually.

The Browser could not be an impact for me in this time.

"locked" means your lighttpd is in the process of already being reconfigured. No idea why it's taking so long. You could add

log_msg("my message here");

commands to https://github.com/opnsense/core/blob/master/src/etc/inc/plugins.inc.d/webgui.inc#L137-L174

My guess is lighttpd doesn't stop or start and hangs the PHP script which keeps the lock over the lifetime of the PHP script. I will not speculate over where it may hang. You'll find it if you want to and can report back.


Cheers,
Franco

Hi Franco,

I have no clue how to insert this code line cause I don't see any /src directory when using shell.
I help myself right now by killing lighttpd and reloading all services again.

For other who may have same issue and needs a step by step manual:

1. Open a ssh session
2. Insert "su" as first line and authorize with ROOT Password
3. Change to menu item 8) SHELL
0) Logout                              7) Ping host
  1) Assign interfaces                   8) Shell
  2) Set interface IP address            9) pfTop
  3) Reset the root password            10) Firewall log
  4) Reset to factory defaults          11) Reload all services
  5) Power off system                   12) Update from console
  6) Reboot system                      13) Restore a backup

Enter an option: 8

root@obelix:/home/udo #


Now find out how many lighttpd tasks your system has
root@obelix:/home/udo # ps aux | grep lighttpd
root    17036   0.0  0.1  22784  10068  -  S    18:05     0:00.35 /usr/local/sbin/lighttpd -f /usr/local/etc/lighttpd_w
root    31424   0.0  0.0  14448   4084  -  S    18:04     0:00.02 /usr/local/sbin/lighttpd -f /var/etc/lighttpd-acme-ch
root    48669   0.0  0.0  12716   2396  0  S+   18:18     0:00.00 grep lighttpd
root@obelix:/home/udo #


In this example I got 2 running services. When System stalls, it is is NOT the acme service, so have a look at the PID of the other one --> 17036

So - when system stalls after "Applying", just kill this "locked" task and everything is fine again.
root@obelix:/home/udo # kill 17036


Th variant for lazy Guys is more like a HAMMER Method, but same effective.
I know via PS-AUX that I have 2 lighttpd tasks. So what, if we kill all lighttpd tasks and reload all services again?

root@obelix:/home/udo # ps aux | grep lighttpd
root    95866   2.8  0.1  20220   9564  -  S    18:24     0:00.02 /usr/local/sbin/lighttpd -f /usr/local/etc/lighttpd_w
root    26470   1.3  0.0  14448   4088  -  S    18:24     0:00.00 /usr/local/sbin/lighttpd -f /var/etc/lighttpd-acme-ch
root    12748   0.0  0.0  12716   2392  0  S+   18:24     0:00.00 grep lighttpd
root@obelix:/home/udo # pkill lighttpd -f
root@obelix:/home/udo # pkill lighttpd -f
root@obelix:/home/udo # configctl service reload all
OK
root@obelix:/home/udo #


Sorry, but I am not a professional coder and getting the things work "somehow" is my intention.

But nevertheless - thanks tho all Folks working directly and indirectly on codebase and keeps IT a little safer for all of us.

cheers,
Udo

I've been experiencing weird issues, particularly when setting up interfaces. I thought my whole setup was being bricked as I was not waiting long enough. I wonder if this bug is what I'm experiencing. I made a separate thread already but just thought I'd post to say I may also be experiencing this. Next time it happens I will give it time to see if it fixes itself. I will have a continuous ping, then do a change to interfaces, then lose everything and the ping also fails. Can't access web interface from any interfaces.

At one point I also did a reload service within the local console, and during that process I got one ping in then it failed again. So that did something, sort of.

I had one issue where one interface was handing out the wrong DHCP range, but then it fixed itself overnight. So it seems to me there may be some issues where things don't happen instantly when you make changes and there's something happening in the background.

I don't want to hijack this thread though and I already made my own, but starting to wonder if it's related and just thought I'd mention it.

This thread reminds me of race conditions in the PHP and shell scripts that I pointed out in https://github.com/opnsense/core/issues/6351 (make sure to show comments which might be hidden by github due to the long conversation thread)

I am a professional programmer (lighttpd developer) and made suggestions how to eliminate the race conditions I identified, and some -- but not all -- changes were implemented.  It might be useful if some one else with scripting and process management experience would review that code again and provide feedback to Franco, as my posts in https://github.com/opnsense/core/issues/6351 were not enough.

Glenn, with all due respect your preachy behaviour is betting a bit old on the subject like lighttpd start race or OCSP.

I did address your concerns in the meantime and more regarding it but now it's gotten worse and your attitude needs a shift now that a proper lock is in place. We're still running into lockups... not that I am overly surprised to hear, but I also cannot reproduce. Can you?


Cheers,
Franco

The locking code. To be frank I cannot see anything other than lighttpd locking up as witnessed by the user having to kill it. :)

https://github.com/opnsense/core/commit/4182f1993

Thx for Glenn and Franco talking about the problem. Only by accepting a problem hope to fix it can be assumed.

@Franco:: If you wanna reproduce it - I'd like to send you my Fujitsu (you can keep it afterwards) so you can see I'm not telling a fairy tale. The fact I got this issue on 2 pieces of hardware (and being not alone in the world) tells one simple truth: There must be a problem somewhere in Code.

If Glenn is able to help - pls let him help.


People affected can try this patch:

# opnsense-patch https://github.com/opnsense/core/commit/988dbae92

But this is just speculation from not being able to reproduce it here. The client environment seems to affect the current behaviour and I don't understand the trigger condition.


Cheers,
Franco

Seems to help in my case. Haven't tested it extensively though.

In my case the patch is a big step forward.
Just to tell everybody what this patch actually is:

In usr/local/www/interfaces.php ...

--> line 586 should be commented out
/*plugins_configure('early');*/
--> line 587 should be inserted
configd_run('webgui restart 3', true);

After a reboot I got no issues after switching anything in Menu Section "INTERFACES" and "APPLY" anymore - WebGUI came back as expected.

I just had one time the issue that "Widgets failed to load" in dashboard came up; but after rebooting and trying again some weird things in GUI everything looked okay.

I recommend to have this changed "patched" Version in further Versions of OPNSENSE.

I want to say biggest "thank you" to Franco for his engagement. And I will have a look if my used "ESET INTERNET SECURITY" might have an impact due to "false driven" HTTP/HTTPS traffic inspection.

I'll test next days and keep you informed.

Cheers,
Udo



Quote from: AES777GCM on December 10, 2024, 07:11:27 PM
In usr/local/www/interfaces.php ...

--> line 586 should be commented out
/*plugins_configure('early');*/
--> line 587 should be inserted
configd_run('webgui restart 3'),true);

Just to be sure this isn't missed... running the command line as below will actually patch the file, no need for commenting out or typo-related mishaps :)

# opnsense-patch https://github.com/opnsense/core/commit/988dbae92

Let's see if this holds up... in any case this is 24.7.11 material so we don't need to wait too long for it.


Cheers,
Franco

Yesterday I kicked "UBlock Origin" out of my Firefox Extension after reading about an issue another user had with UBlock Origin lite.
Today I updated to latest Version - 24.7.11 and everything is working fine now - at least it seems after some minutes of extensive testing.

Thx to Franco and Team - you did a great job and for me you delivered an early x-mas surprise.

@all: Pls try new Version and post your experience under this.

Greetings,
Udo

Hi Udo,

Thanks for the feedback. Happy to hear it's better now.


Cheers,
Franco