Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.

Started by arkanoid, May 15, 2022, 12:25:59 PM

Previous topic - Next topic
you nailed it! After stopping zabbix-agentd, the wired memory consumption stopped growing and the nvlist chart went flat. No memory released though, it's still at 611MB even after the process has been killed.

It's still a bug though... Hopefully they'll fix it kernel side somewhere in the next FreeBSD release. That was fun to troubleshoot though, not something we casually dig in every day 😊

sure it is not easy at all to spot this.

Not only top doesn't show it linked to zabbix-agentd, but nvlist is hard to inspect.

Do you know how can I open an issue on this? Is this a freebsd thing, or zabbix thing?

Hmmm, I'm not sure where I would open up the bug to be honest. This definitely looks like a FreeBSD bug. I mean, Zabbix only "uses" something provided by the OS/Kernel that "should" work properly from what we know. Unless it is Zabbix itself that leaks this in some way in their code but it seems to be kernel-related.

If I were to open up a bug, I would try to reproduce it in a simple fashion like a command-line script or something easy reproducible. Similar to what they did here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255971 ideally trying not to use Zabbix to make sure it's now Zabbix itself that leaks in some way this nvlist thing (I wouldn't know how)...

And then open up a bug with FreeBSD and/or OPNSense. What they fixed in the above bug may not have made its way yet to the FreeBSD version that we are running of OPNSense. I'm not sure how to determine if the version we have contains this bugfix or even if it is really relevant to the bug you found here but it certainly looks very similar to you. Since they discovered this bad behaviour in the 13.0-Stable and that's what we're running in the latest OPNSense, maybe we don't have this bugfix yet, I'm not sure how to determine this 100%.

That being said, I would most likely at least open up a bug with OPNSense to start with considering this ends up crashing the whole thing and you're not doing anything "funky" and you're using the released version of both the zabbix agent and OPNSense. So you're probably not alone out there with this issue... I would probably put a link to this forum's post in the ticket and fully explain what triggers this bug (running zabbix agent) and what happens (nvlist leak leading to OOM and crash) and probably also put the link to the FreeBSD bug as well asking if this may be related in some way.

If you know which metric in the zabbix monitoring triggers this leak it would be helpful, this way they could easily replicate/troubleshoot and fix it. On your side, you could also try to exclude this metric for now from the agent and monitor everything else instead of nothing at all.

Hopefully, this is already fixed by the previous bug and we just need an updated version of FreeBSD or at least a patched one.

Interestingly, I just tried this (Similar to what is in the bug ticket of FreeBSD):

pfctl -sa

And wrapping it in a small script:

echo "before: " ; vmstat -m | grep "InUse\|nvlist" ; pfctl -sa > /dev/null ; echo "after:" ; vmstat -m | grep "InUse\|nvlist"

And that seems to leak. I'm not pushing it to see how far it will go but it does not seems to release the InUse and MemUse for far.

echo "before: " ; vmstat -m | grep "InUse\|nvlist" ; pfctl -sa > /dev/null ; echo "after:" ; vmstat -m | grep "InUse\|nvlist"
before:
         Type InUse MemUse Requests  Size(s)
       nvlist 22030  2979K 66258359  16,32,64,128,256,2048,4096,8192
after:
         Type InUse MemUse Requests  Size(s)
       nvlist 22036  2980K 66975304  16,32,64,128,256,2048,4096,8192


This command also seems to increase the InUse and never go down with FreeBSD 13.1 (testing opnsense 22.7 kernel right now) so I'm not sure the zabbix agent monitoring that creates a leak will be fixed in a more recent kernel or os release unfortunately.

You should probably open up a bit with opnsense in case it's something they can or need to fix on their side for zabbix and or something in regards to nvlist if you need this working.

It's rather strange... the fixes are already in 22.1 which means there are more leaks in the ioctl code.


Cheers,
Franco

Leak seems to be create by "pfctl -s info" diagnostics only which is gathered by a number of things in the core system. I'm trying to see if this is an easy patch.


Cheers,
Franco

https://github.com/opnsense/src/commit/4d3b9e4a34

# echo "before: " ; vmstat -m | grep "InUse\|nvlist" ; pfctl -sa > /dev/null ; echo "after:" ; vmstat -m | grep "InUse\|nvlist"
before:
         Type InUse MemUse Requests  Size(s)
       nvlist     0     0K   200866  16,32,64,128,256,2048,4096,8192
after:
         Type InUse MemUse Requests  Size(s)
       nvlist     0     0K   320016  16,32,64,128,256,2048,4096,8192


Good enough for now? :)

For anyone willing to try:

# opnsense-update -kzr 22.1.8-nvlist2
# opnsense-shell reboot


Cheers,
Franco

After more digging https://github.com/opnsense/src/commit/79e0c974ae85 seems to be on FreeBSD main branch only since April, which is a bit unfortunate. I thought it would fix all cases but there is another leak in the code which was merged into main via https://reviews.freebsd.org/D35385 yesterday.

We'll be offering both patches in 22.1.9 most likely to avoid further issues with this particular memory leak as the core system regularly polls pfctl as well. Not as fast as Zabbix but still problematic.


Cheers,
Franco

Thanks a lot for all the patches and digging Franco !

I can't test the patches myself since I'm on the test release for FreeBSD 13.1 (22.7...), hopefully arkanoid could test it, especially with Zabbix, since it seemed to leak a lot, maybe from other places than the ones we found...

I'm very glad you could nail it down, memory leaks could be very tricky to figure out sometimes. It was another fun bug hunting  :)