I have noticed over the last few weeks, since 24.7 and now 25.1.1 that around 3:18 to 3:38 am every morning the Inactive Memory rises by around 20 %.
Free memory is at 78% after reboot then the next morning it drop so 52%, then the next day 40%, then 30% at which point OOM kicks in killing off Caddy Plugin or Unbound or both.
I have tried disabling IDS and removing most of my blocklists from Crowdsec but the behaviour is the same. With Caddy and IDS off I get a few more days before OOM starts killing things.
I have no cronjobs around this time and have been through every log file available from the GUI but cannot figure out why or what is going on at that time.
The dashboard memory widget happily says I am using 1.2/7GB RAM (I am using a 1GB MFS).
Can anyone point me in the right direction to track down why this could be happening please ?
Are you sure about the "no cronjobs" statement? There are four sources for cron jobs in OpnSense:
1. /etc/cron.d/*
2. /etc/crontab
3. /var/cron/tabs/nobody
4. /var/cron/tabs/root
Only the last of which you can see in the web UI.
Thanks Meyergru I checked them (attached) and nothing jumps out at me ?
Just to add I tried one by one restarting services from the dashboard and this did claw back a 10% increase in free mem but only temporarily.
A reboot of the firewall and free mem went back up to near 80%.
Inactive memory still on the rise.
What could be using it up and other than rebooting the firewall, how do I get it back ?
Dashboard still says 1.2/7GB in use.
I searched for others with similar memory issues - I am not with Zen Internet and not using IPv6 so that's not it.
Is it normal to have 43 php-cgi processes running ?
I have 43 running as well. But I counted using "ps auxwww | fgrep php-cgi | wc", where you probably only see the first page of "top". If there are more processes, then there could be stalled ones that hog memory. The inactive memory is the difference between "SIZE" and "RES" columns. So it is either many processes building up and never stopping (i.e. hung tasks) or some process(es) that eat up memory with time.
Yes 43 for me - counted them with ps -faxd | grep php-cgi
The only thing left I can think of doing is disabling or uninstalling CrowdSec, Caddy and Unbound DNS.
I also have a ProofPoint Emerging Threats alias blocklist that updates every 12 hours which I could stop.
I need OpenVPN running to access the firewall from work so can't lose that too!
It would not leave much left for OPNsense firewall to do and I'd lose most of the functionality.
I installed the firewall over three years ago on a HP T730 8GB and have added a Intel I350 2 port NIC and it's been fantastic till the memory issues lately.
As I suggested, you should first try to isolate if there are hung tasks (# of processes is rising) or if there are specific processes that build up in size.
I noticed this as well today, as I've monitored a bit memory usage trying to figure out why opnsense runs out of mem every two weeks.
I also notice there are 43 cgi-bins. And problem occurs around this time:
root@OPNsense:~ # grep 3.\*configctl /var/cron/tabs/root
1 3 1 * * (/usr/local/sbin/configctl -d filter schedule bogons) > /dev/null
I wonder what does it do?
See mem graph:
(https://media.mementomori.social/media_attachments/files/114/066/495/030/044/619/original/6bac01b8c2522608.jpg)
My guess is it just reads lot of files, thus leaving them into memory buffers for quick access until memorybis needed for something else. Hence the jump. But why >40 php-cgi, is that normal?
Normally before the box dies something starts leaking mem and system goes down in half an hour.
1. AFAICT, there is a config file for lighttpd that starts 20 CGI workers, which seems normal, but could be less:
#### fastcgi module
## read fastcgi.txt for more info
fastcgi.server = ( ".php" =>
( "localhost" =>
(
"socket" => "/tmp/php-fastcgi.socket",
"max-procs" => 2,
"bin-environment" => (
"PHP_FCGI_CHILDREN" => "20",
"PHP_FCGI_MAX_REQUESTS" => "100"
),
"bin-path" => "/usr/local/bin/php-cgi"
)
)
)
Each of these workers will restart after having serviced 100 requests.
Also, I found that there seem to be ~20 of these workers that have been started when the firewall was last rebooted. Maybe the max-procs starts two master processes.
Whatever, these CGI workers are most likely not the culprit, as they take up only a few KBytes each.
2. The call to "/usr/local/sbin/configctl -d filter schedule bogons" is obviously to fetch the bogon list and update the firewall alias for that. When I called that directly, I saw no apparent jump in memory usage.
FWIW, I see no such behavior here, but YMMV depending on what plugins / tools you use. So it is up to you to look for processes with a big difference in SIZE and RES numbers (or for many similar processes that make up the large numbers).
I think I might be getting somewhere.
I tried the boguns update mentioned earlier which made no difference to inactive memory so I decided to run from SSH the 'periodic daily' cron task.
I watched as from a freshly booted system the inactive memory climb from 80M to 1200M within a few seconds and stay there.
The prompt took a good 2 minutes to come back then said 'eval: mail: not found'
I cannot see anywhere in the gui to configure mail. I think older releases had it in System/Settings/Notifications but that is not present.
I'm assuming the mail error is the cause of the inactive memory issue here ?
Can anyone point me in the right direction please ?
There is lots of jobs that are done within period daily, namely any script that is in /etc/periodic/daily/. There is a job for ZFS scrubbing, for example. This may eat up space on a freshly booted system, but not on the second run - or does it in your case? Also, that is ARC cache, not inactive memory.
The "mail" error is due to the fact that "periodic" output would be mailed to root if the "mail" executable was installed.
Is this periodic daily actually enabled though? If yes, good candidate to drill into. Then ca you post the contents, to try and see where the mail evaluation is made in the code.
@meyergru I remember looking at these jobs in the past, chaising a different ghost. I do not know if these are actually enabled. For instance 800.scrub-zfs . To my knowledge there is no auto zfs pool scrub out of the box, needs a cron job created. I could be well off the mark but I remember doing this reasoning and moving on.
Indeed my ghost was found somewhere else.
Yes, "periodic daily" is enabled in /etc/crontab. It is being run at 3:01am, however, it can only be the culprit for such things if memory use jumps once per day, but there are no processes that stay around afterwards.
As I said, @goobs should look for processes whose memory footprint rises.
I have now added periodic.conf overrides so logs are saved to /var/log/daily.log etc instead of "root" as there is no mail installed.
That got rid of the mail error and on inspection of the daily.log I can see no issues.
I do not use ZFS and these tasks are not part enabled in the daily other than the default list zfs pools.
Again from a fresh boot I run periodic daily and Inactive memory rises from 100M to 1280M.
From another ssh shell I can see pkg appear in top processes during the time the inactive memory rises. Then pkg goes and I briefly see 'xz' then it is over.
I am thinking to turn off all parts of the periodic daily then enable one at a time to see which element is causing the issue.
Update:
After changing anything set to "YES" to "NO" in periodic.conf and running #periodic daily then one by one setting No to YES repeating the test I was able to track the memory rise down to the security section.
I don't know why the daily security check causes Inactive Mem to go from 100M to 1250M but it does.
There were some issues reported in the security log as attached but searching forums.freebsd.org suggested this compat_var was deprecated so I don't know it is still present in the system.
https://github.com/freebsd/pkg/commit/6a077c32f445bfb10bab5536910b6b7329ce43d3
Something is messed up with my system that security check has issue with and for some reason keeps raising Inactive Mem.
Am I at the stage where I wipe and start again or can any guru shed light on the issue please ?
Update 2:
I have built a test instance of OPNsense 25.1 in Hyper-V and repeated the above periodic.conf and tests.
The output of dailysecurity.log is the same mentioning security_daily_compat_var : not found , so I know my system is not missing something that a rebuild will put back.
Also the inactive mem rises by 1000M after running periodic daily so that behaviour is the same. Again, no zfs and this is a vanilla install so no plug-ins etc.
Subsequent runs of periodic daily, periodic weekly and period monthly do not add that much to inactive mem so perhaps this is just how it is and I need to allow for the expansion when choosing plug-ins to avoid OOM reaper killing of Unbound, Caddy etc. ?
<3>pid 52659 (caddy), jid 0, uid 0, was killed: failed to reclaim memory
That didn't last long. Still had 4GB unused memory too.
Time for a beer or ten.