I've been having a long-standing issue with monit that recurs with some regularity across several routers. The issue initially started when I upgraded to 24.1 and has persisted across all of the minor version updates I've installed. It is still occurring on 24.7.12_4.
The issue I'm experiencing is that monit hangs intermittently, roughly once a week, leaving the router in a state where it won't fully reboot unless I connect to the router and manually kill monit (using pkill -9 monit)
Right before monit hangs, it typically throws a message to the log that looks like this:
"'cputemp' program timed out after 10 m. Killing program with pid 23"
but notably, it's not always the 'cputemp' program - monit seems to hang the exact same way, but it chokes on a random script each time, regardless of how simple or complex the script is.
You can verify that monit is hung by going to the console and running ps axuHd - the output shows monit in a "wait" state while it waits for the program to die, even though the program is no longer running.
Once monit gets into this state, if you attempt to shut down the router from the UI (or even from ssh), it will stop some of the services on the router, but will never fully complete the shutdown, leaving the router in an inconsistent state that requires manual intervention. It even prevents the cron reboot from running properly, so I can't rely on an automatic reboot as a workaround for this issue as the reboot never comes.
I've tried a number of things - I've disabled, replaced and re-written scripts. I've added debug statements to log my scripts to see why they're hanging - and can confirm that the scripts start and stop successfully even when monit says they timeout. This really seems to be a monit issue, and since it's persisting across several routers (that have different sets of scripts), I'm kind of feeling like this is a bigger issue than the scripts I'm using.
I really believe the issue may be monit, but I feel that if the module were broken there would be many more posts about it, so I'm wondering if perhaps there's any gotchas I've missed in configuring the module or anything that other folks had to do to make theirs more stable.
Additionally, if there's any troubleshooting or additional information that I could use to narrow this down, I'd be more than happy to take any suggestions.
I am having the same issue, can't really pinpoint the source script (each time is a different one, PHP, Bash, CShell, different length of time since reboot, no errors other than the lockup).
I even put a healthcheck call on each script to know when it stops, and still cannot understand why.
Any good ideas for troubleshooting?
How do you monitor and recover from this if the problem is with the tool that is supposed to do that???
thanks