OPNsense 24.7.b_127-amd64 - runaway process trips monit alert @ 3:06am every day

Started by saltyzip, June 22, 2024, 12:47:23 PM

Previous topic - Next topic
OPNsense 24.7.b_127-amd64
FreeBSD 13.2-RELEASE-p11
OpenSSL 3.0.14
Intel(R) Celeron(R) J4105 CPU @ 1.50GHz (4 cores, 4 threads)

Hi, Since upgrading to the latest development version on the 20th June, I'm having it appears a runaway process trip monit into alerting at exactly 3:06am every day that eventually consumes all the resources on my router. I can't even access the GUI or remote in via SSH, just sits and times out, so have to hard reset.

I have Monit running and that is what alerted me to this issue, some examples at the bottom of this thread, my inbox fills up with these alerts by the time I wake up and can take some action on it.

The internet appears to keep running for me in the background while these alerts are sent and me being locked out from accessing the GUI. Yesterday I rebooted the router around 8 am before work and alerts stopped, all was good again. This morning I had a lie in and the internet went off completely at 10am, so the runaway process does eventually take out my router it seems if left unchecked.

I've got nothing in the cron to run at 3:06am, only thing I have is an ACME certificate renew job that runs on the 1st of every month.

I've run the Audit for Health and Connectivity, no issues shown, when I run security it mentions "openvpn-2.6.10 is vulnerable".

[Alert from 21st]
Resource limit matched Service OPNsense.localdomain

        Date:        Fri, 21 Jun 2024 03:06:37
        Action:      alert
        Host:        OPNsense.localdomain
        Description: loadavg (1min) of 2.8 matches resource limit [loadavg (1min) > 2.0]

Your faithful employee,
Monit

Resource limit matched Service OPNsense.localdomain

        Date:        Fri, 21 Jun 2024 03:08:39
        Action:      alert
        Host:        OPNsense.localdomain
        Description: cpu usage of 99.5% matches resource limit [cpu usage > 75.0%]

Your faithful employee,
Monit

[Alert from 22nd]
Resource limit matched Service OPNsense.localdomain

        Date:        Sat, 22 Jun 2024 03:06:38
        Action:      alert
        Host:        OPNsense.localdomain
        Description: loadavg (1min) of 3.3 matches resource limit [loadavg (1min) > 2.0]

Your faithful employee,
Monit

Just been looking at the reporting health graphs, and found some supporting evidence. I've attached the graphs for memory, CPU Temp, Processer and States and it looks like @3:02 all capturing of stats just stops, memory table below as an example. Up until that point however everything looks to be ticking along nicely, and then it hits an iceberg and sinks fast. I can't add the mbuf graph, but that is similar to all the others, flatlined graph with mbuf usage  showing 10k as current which is nothing.

659   Sat Jun 22 2024 03:01:00 GMT+0100 (British Summer Time)   0.61510877435   5.7981159604   82.780266382   0   7.9750526226
660   Sat Jun 22 2024 03:02:00 GMT+0100 (British Summer Time)   0   0   0   0   0
661   Sat Jun 22 2024 03:03:00 GMT+0100 (British Summer Time)   0   0   0   0   0
662   Sat Jun 22 2024 03:04:00 GMT+0100 (British Summer Time)   0   0   0   0   0
663   Sat Jun 22 2024 03:05:00 GMT+0100 (British Summer Time)   0   0   0   0   0
664   Sat Jun 22 2024 03:06:00 GMT+0100 (British Summer Time)   0   0   0   0   0
665   Sat Jun 22 2024 03:07:00 GMT+0100 (British Summer Time)   0   0   0   0   0
.....

Any thoughts on how best to diagnose this one, I haven't got logging switched on to save my SSD, so can't offer any further breadcrumbs at this time?

Thanks
S.