memory leak in 22.1 ?

Started by bongo, April 22, 2022, 07:25:04 AM

Previous topic - Next topic
some time ago i reported a "swap issue" shown on the screen of my OPNsense pc when it stopped working, a few days after updating to 22.1.

in the mean time, i tracked the memory usage of OPNsense.
after a reboot, memory usage is about 40% (of 8GB installed). with each day, memory usage rises by about 10%.
so after a week, it seems to reach 100%.
in this situation, i am no longer able to reach the internet from any machine in the lan. i'm also no longer able to reach the OPNsense machine.
on the screen attached to OPNsense, i couldn't find any useful info. so i tried to log in locally, but after typing username/password, the OPNsense machine played dead.
i was also not able to reboot in a controlled way. the only way out was a power cycle.

this problem started when upgrading to 22.1. i never had such issues before.

i'm running OPNsense on a pc hardware with 8GB memory and an SSD.

is there any known issue resulting in this memory leak?
how shall i proceed?
the only thing that works for me so far is to reboot every few days.

you can collect the output of
# swapinfo -h
# ps axmfv

or similar tools like top, htop etc.

how to do so?
i think this is not done from the web gui?
is this done locally on the machine? is there a way to get a shell there (so far, i did everything from the web gui)?
what will i do with the info i get from these commands?

It's funny really. This week a user reported the cause of this behaviour stemming from circular logging removal and the use of /var MFS slowly but steadily filling up your RAM because that's what the system is configured to do.

Moral of the story /var MFS option was never a good idea. We are currently discussing options.


Cheers,
Franco

You can enable ssh and get a shell or you can attach directly a monitor and keyboard. In case of a crash the last option might be even better ...
Via WebGUI you can try System: Diagnostics: Activity and sort descending on the RES or SIZE column
On the console the ps or top command will show you the mem usage of the process too.
And here you have to pay attention to a process with constantly increasing memory usage.

thanx for the hints.
i tried to follow memory usage of the most hungry processes by taking a picture of the System: Diagnostics: Activity page once a day.
please take a look at the 2 pics attached.

the 1st one represents a total memory usage of 52%, while the 2nd one, taken today, i.e. 2 days later, uses 88% of memory.
so, based on the total memory usage, i know that OPNsense will fail to operate in about 1 day.

i'm not sure how to interprete the two values size and res. are those total memory usage (including swap) and physical memory usage?
i expected to see a rise of memory usage per process on both values for at least 1 process, but it does not really look like that. while the size value is only very little higher after 2 days, the res value even decreased...but total physical memory usage rised from 52 to 88% in the same time.
i might be wrong, but this looks to me like additional memory usage is not related to a process, i.e. a memory leak.
what do you think?
how to go on?

with the actual situation, i need to always reboot the firewall after about 5 days to get it up and running. this means that i cannot leave the system untouched for a week or more ;-(


But are you using /var MFS? See above.


Cheers,
Franco

Oh, soory. franco in fact answered the question!If you use RAM disks aka. MFS this disk is eating slowly your mem, not a process (this was my assumption)
AFAIK a bug report is already there: https://github.com/opnsense/core/issues/5727
The usage of RAM disks can be configured @System: Settings: Miscellaneous : Disk / Memory Settings

Do you have RAM disks enabled?But I am not an expert on these settings, so please be carefully and consult more documentation before changing this (https://docs.opnsense.org/manual/settingsmenu.html#miscellaneous etc...)

i've already removed all ticks under
System: Settings: Miscellaneous : Disk / Memory Settings
this morning to check out if this helps.

according to my understanding, data is now written to the SSD instead of the RAMDISK. right?
so will this now fill up my SDD instead...until full, without having a chance then to recover with just a reboot?
i'm a bit unsure if it was a good idea to remove those ticks.

This depends on the amount of traffic and logging on your device ...
I do not use RAM disks on none of my setups and have disk usages on /var/log between 500M and 18G ...
It is a good idea to have monitoring enabled for all productive devices. So you can have some alerts before your system is starving due to outofmem or diskfull states ...

Quotei've already removed all ticks under
System: Settings: Miscellaneous : Disk / Memory Settings
this morning to check out if this helps.
Just to make sure: the changes will only take effect after a reboot!

Yes, we are weighing options for https://github.com/opnsense/core/issues/5727 at the moment. My favourite approach would be to move /var MFS to only use /var/log MFS with an upper cap of 50% RAM, maybe adjustable.

For things not /var/log we've since moved on and are not as picky about reducing write cycles anymore (configuration files into /usr/local/etc instead /var/etc for example).

How does that sound?


Cheers,
Franco

i'm not a specialist, but i think having a size limit for log files would for sure be useful...and if the limit is reached, old logs are deleted.

i also think that data that are written very often and do not need to be persistent should use the ramdisk instead of the ssd, to avoid that the ssd gets killed over time by writing too often, and exessively wearing it.

Well, first thing that's not what I said and secondly sure but as I said avoiding write cycles on anything other than logs seems a lot lower nowadays measured against our frequent package updates.


Cheers,
Franco

yes, but limiting /var/log on MFS to 50% or RAM is in the end the same as limiting the max size of the log folder, just in a more dynamic way. istn't it?

and as i understand your statement, the only thing that is written frequently is the logs. so this should be still written to ramdisk while all other data should go to ssd. right?