Possible Bug, PHP-CGI Crash

Started by MasterXBKC, January 10, 2018, 12:51:47 AM

Previous topic - Next topic
So at first i thought it was my code, or else a change that came down in PHP 7.1, but now im not so sure.

Ive begun seeing a log of 503 errors where the web admin becomes un-available, and remains so untill you use option 11 to restart the services.

Ive found a way to re-produce it also.

With my pfmontor checkin agent installed on the device, if i run it on the ssh shell it runs fine, but it seems that if any other process is using php or php-cgi at the same time as i run the script, it crashes the php-cgi background processes that the web admin uses.  or if they are running to quickly.

To reproduce the issue, all i have to do is run my php script in rapid succession from ssh using either of:
php pfmonitor.checkinopn.php
or
php-cgi pfmonitor.checkinopn.php

Up+Enter a few times and the web interface dies, and the php-cgi background processes all dissappear from ps aux.

All my script does is read some files, and post the contents to a external url using php curl at this point, i had commented out all the other functions.

running it once works fine, running it, then immediately again a few times, or if the opnsense itself or the web interface is also doing something at the same time, and bang, it crashes the php-cgi's.

like i said i thought it was my code at first, but now i dont think so.
Member of FBIs Infragard Program
Certified Information Systems Security Officer
Certified Vulnerability Assessor
PFMonitor Remote Management, Backup, & Live Monitoring for PFSense and OPNSense
OPNSense Units: R720XD XL, R720XD XL, R720XD, R720XD, R710, DL360G7, QNAP

Running too many scripts at once may starve the lighttpd children pool. Never heard of this from a user report though, but then uncontrollably executing a lot of processes can do a lot of things to a system and I hope we can avoid doing so.

Then again, none of this should matter for a CLI script that should not meddle with CGI at all?

Grasping at straws here, forgive me. :)


Cheers,
Franco

I've seen it to, a restart of services brings it back, but as yet I cannot seem to create it at will. Everything continues to work, just the GUI has toddled off.
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

January 10, 2018, 06:45:29 PM #3 Last Edit: January 10, 2018, 06:49:06 PM by MasterXBKC
I can recreate it quite easily.

If i run my script with php, it only happens rarely when im running it manually at the shell, but if i run it 6-7 times rapidly from the shell using php-cgi, it happens almost every time.

The problem is my script runs once every 60 seconds just to poll data, and even that seems to hose it up after between 5-30 minutes, and only since the jump to 17.7.11 i think, it could have been in the prior build too but i skipped that one on most devices.


I did find a possible reason for this but im not sure, i found old documentation that if lighty ends up waiting for a response from php, and does not get a response back quickly enough it can cause it to just hang up because it is expecting a response that either php is busy, or else the finished results.  And in the absence of either, it goes out to lunch indefinitely.

I believe pfsense encountered this issue as well with their nginix/php-fpm setup, but im not sure how they mitigated it, or if they did at all, but it doesnt seem to happen anymore.
Member of FBIs Infragard Program
Certified Information Systems Security Officer
Certified Vulnerability Assessor
PFMonitor Remote Management, Backup, & Live Monitoring for PFSense and OPNSense
OPNSense Units: R720XD XL, R720XD XL, R720XD, R720XD, R710, DL360G7, QNAP

Hmm...

Well having come from the darkside there were a lot of issues with a 502 error that used to occur when pfblockerNG was installed, one of my testers used to be able to get it to do it without pfblocker too. Now he came up with a fix... let me see if I can find it, it might be related.

OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

OK, found it, but Franco would need to take a look and see if it applies to opnsense. I'm finding my way around but  it's a big system. :)

@franco

Take a look at this and see if it can be applied. Chris one of my testers did this for pf, it fixed a lot of issues and certainly fixed his.

https://github.com/marjohn56/pfsense/blob/2c131b10b25db593331048d4f2b28fbf9bf5662e/src/etc/rc.php_ini_setup
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

@Franco

If you wanna have access to a bunch of machines suffering from the issue just let me know... ;-)

BTW: I'm the one who triggered the whole topic, as I've just started using pfMon with some of my machines.

Greetz
Mircsicz

This might or might not be related too, when i installed a new opnsense recently from ISO on VMware, i saw that during the bootup it said:
Less than 512mb of ram detected, not enabling opcache,

But this machine had 4 or 8 GB of ram....
Member of FBIs Infragard Program
Certified Information Systems Security Officer
Certified Vulnerability Assessor
PFMonitor Remote Management, Backup, & Live Monitoring for PFSense and OPNSense
OPNSense Units: R720XD XL, R720XD XL, R720XD, R720XD, R710, DL360G7, QNAP

Less is more. I think if we wrap that in configd to serialise it shouldn't happen...

Quote from: franco on January 10, 2018, 11:20:40 PM
Less is more. I think if we wrap that in configd to serialise it shouldn't happen...

is this something to expect in the next release?
Member of FBIs Infragard Program
Certified Information Systems Security Officer
Certified Vulnerability Assessor
PFMonitor Remote Management, Backup, & Live Monitoring for PFSense and OPNSense
OPNSense Units: R720XD XL, R720XD XL, R720XD, R720XD, R710, DL360G7, QNAP


Quote from: franco on January 15, 2018, 03:52:19 PM
It's something we need to change in your plugin.

https://docs.opnsense.org/development/backend/configd.html


Cheers,
Franco

This was occurring even just running it on the shell, but if you think that will fix it....
Member of FBIs Infragard Program
Certified Information Systems Security Officer
Certified Vulnerability Assessor
PFMonitor Remote Management, Backup, & Live Monitoring for PFSense and OPNSense
OPNSense Units: R720XD XL, R720XD XL, R720XD, R720XD, R710, DL360G7, QNAP

What's the purpose of running it repeatedly other than triggering the bug? It simply needs a funnel to not waste time and system resources. A service can't be started repeatedly if it's not built to interlace on the work chunks it computes. :)


Cheers,
Franco

it is meant to be a cron, it periodically sends the data in, and retrieves any user commands to run.
Member of FBIs Infragard Program
Certified Information Systems Security Officer
Certified Vulnerability Assessor
PFMonitor Remote Management, Backup, & Live Monitoring for PFSense and OPNSense
OPNSense Units: R720XD XL, R720XD XL, R720XD, R720XD, R710, DL360G7, QNAP

Hi Guys,

what's the status of this issue. I'ld like to move on with... ;-)