OPNsense Forum

Archive => 22.1 Legacy Series => Topic started by: arkanoid on May 15, 2022, 12:25:59 PM

Title: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:25:59 PM
As title says, I have to reboot my firewall every ~7 days as the OOM Killer kicks in and kills all processes, leaving only kernel and nothing else (ssh and web interface are killed too)

OPNsense 22.1.2_1-amd64
FreeBSD 13.0-STABLE
OpenSSL 1.1.1m 14 Dec 2021

Please find attached the result of these commands executed with an uptime of 30h

df -h

vmstat -z | tail +3 | awk -F '[:,] *' 'BEGIN { total=0; cache=0; used=0 } {u = $2 * $4; c = $2 * $5; t = u + c; cache += c; used += u; total += t; name=$1; gsub(" ", "_", name); print t, name, u, c} END { print total, "TOTAL", used, cache } ' | sort -n | perl -a -p -e 'while (($j, $_) = each(@F)) { 1 while s/^(-?\d+)(\d{3})/$1,$2/; print $_, " "} print "\n"' | column -t

vmstat -o | sort -nr | head -n 3000

vmstat -m | sort -rk3


more attachments in the next posts
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:35:50 PM
vmstat -o | sort -nr | head -n 3000
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:36:15 PM
vmstat -z | tail +3 | awk -F '[:,] *' 'BEGIN { total=0; cache=0; used=0 } {u = $2 * $4; c = $2 * $5; t = u + c; cache += c; used += u; total += t; name=$1; gsub(" ", "_", name); print t, name, u, c} END { print total, "TOTAL", used, cache } ' | sort -n | perl -a -p -e 'while (($j, $_) = each(@F)) { 1 while s/^(-?\d+)(\d{3})/$1,$2/; print $_, " "} print "\n"' | column -t
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:36:40 PM
df -h
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:37:53 PM
Some screenshots over time of

top -o size


first one is just after boot
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:40:02 PM
more top
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:40:20 PM
more top
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 15, 2022, 12:40:38 PM
more top
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on May 16, 2022, 08:34:23 AM
So let's start with something directly relevant... you have /var MFS enabled, yes?


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 16, 2022, 11:48:38 AM
I already posted the result of `df -h`

as second confirmation, please find attached a screenshot of the webgui

here's the output of `mount`

/dev/gpt/rootfs on / (ufs, local, noatime, soft-updates)
devfs on /dev (devfs)
devfs on /var/dhcpd/dev (devfs)
devfs on /var/unbound/dev (devfs)
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 16, 2022, 11:50:51 AM
here's the memory load after 53h uptime

last pid: 63460;  load averages:  0.28,  0.39,  0.44  up 2+04:54:04    11:49:44
53 processes:  1 running, 52 sleeping
CPU:  1.1% user,  0.0% nice, 16.3% system,  0.0% interrupt, 82.6% idle
Mem: 39M Active, 989M Inact, 458M Wired, 249M Buf, 2465M Free
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on May 16, 2022, 12:16:17 PM
Sorry, did not look at attached files / searched with browser find.

At least it's not /var/log overflowing main memory if /var MFS is off.

What's your configuration? Which heavy tasks do you have enabled? OpenVPN, IPsec, Intrusion Detection (without or without IPS?), Unbound blocklists, Web Proxy running, additional plugins...


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 16, 2022, 01:19:03 PM
Firewall is used just as Wireguard VPN concentrator, so:

OpenVPN: No

IPsec: No

Intrusion Detection (without or without IPS?): No and no

Unbound blocklists: No

Web Proxy running: No

additional plugins...

os-dyndns (installed)   1.27_3   179KiB   OPNsense   Dynamic DNS Support   
os-iperf (installed)   1.0_1   24.6KiB   OPNsense   Connection speed tester   
os-vmware (installed)   1.5_1   610B   OPNsense   VMware tools   
os-wireguard (installed)   1.10   47.1KiB   OPNsense   WireGuard VPN service   
os-zabbix-agent (installed)   1.11   50.1KiB   OPNsense   Zabbix monitoring agent

only os-wireguard is truly essential, in order from top to least important:
wireguard
zabbix-agent
dyndns
iperf3
vmware


top after ~54h
last pid: 35841;  load averages:  0.31,  0.47,  0.41                                                                                                                                       up 2+06:21:38  13:17:18
54 processes:  1 running, 53 sleeping
CPU:  0.0% user,  0.0% nice, 15.8% system,  0.0% interrupt, 84.2% idle
Mem: 80M Active, 998M Inact, 466M Wired, 256M Buf, 2406M Free
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on May 16, 2022, 02:17:29 PM
wireguard-kmod installed?


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 16, 2022, 02:20:19 PM
Yes, wireguard-kmod   0.0.20211105 installed. But I hardly believe it is the cause of the problem.

I've been facing this problem since early 2021.

Latest change I made to the system is switching from userland wireguard-go implementation to the kernel driver, and this happened at the beginning of march.

While this greatly improved CPU usage, it didn't solve the problem. I used to have OOM and I still have OOM problems.

Please find attached a zabbix graph before and after switching the wireguard implementation.

G.C.


steady rise continues
last pid: 42872;  load averages:  0.59,  0.46,  0.45 up 2+07:23:49  14:19:29
54 processes:  1 running, 53 sleeping
CPU:  0.2% user,  0.0% nice, 10.1% system,  0.0% interrupt, 89.7% idle
Mem: 49M Active, 992M Inact, 476M Wired, 263M Buf, 2432M Free
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 16, 2022, 02:24:33 PM
output of

pkg info
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 16, 2022, 10:51:48 PM
Edit
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 17, 2022, 01:16:25 PM
top -o size after 3 days 6 hours
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: Patrick M. Hausen on May 17, 2022, 02:18:38 PM
Could you please write the text directly into your postings instead of attachments and screenshots? That would be way more convenient for everybody trying to read what you write.

There are code tags for that:

This is pasted code
or command output.


Thanks,
Patrick
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 17, 2022, 02:47:55 PM
roger that, I was just trying to manually collect data even while on the go, so I used screenshots. I'll try to stick with text data.

Now I've wrapped up a shell script to collect

vmstat -m

over time, hopefully I'll get some good data for further analysis. Please redirect me if there's a better way to debug this.

In the meantime, here's latest top -o size

last pid: 36860;  load averages:  0.44,  0.42,  0.46  up 3+07:50:54  14:46:34
53 processes:  1 running, 52 sleeping
CPU:  0.2% user,  0.0% nice, 13.1% system,  0.0% interrupt, 86.7% idle
Mem: 51M Active, 1041M Inact, 547M Wired, 331M Buf, 2314M Free
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: Patrick M. Hausen on May 17, 2022, 03:04:12 PM
Just a guess: is your OPNsense UI accessible from the Internet? Possibly someone is probing the UI for bugs causing php-cgi processes with high memory consumption to pile up?
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 17, 2022, 07:22:28 PM
nope, the only port available on the Internet is the vpn one, and tcpdump confirms that there's no unexpected traffic

I'd expect peaks in memory, but all I see is a continuous slow leak
last pid: 70854;  load averages:  0.29,  0.46,  0.48    up 3+12:26:21  19:22:01
57 processes:  3 running, 54 sleeping
CPU: 13.4% user,  0.0% nice, 33.3% system,  0.0% interrupt, 53.3% idle
Mem: 65M Active, 1066M Inact, 558M Wired, 340M Buf, 2265M Free
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 17, 2022, 11:40:54 PM
Being curious about your issue... Can you post the output of

vmstat -z | tr ':' ',' | sort -nk4 -t','

We see that your wired memory is increasing slowly over time, does it stabilize at some point or really ends up consuming both free and inactive ? Mine is stable around 800M. It varies between 8 to 12% of the total memory 8G in my case so 800M is around 10%). Do you know how big the numbers were right before an OOM (including free and inactive) ?

We also see your free being converted into inactive memory. This does not seem abnormal at first glance... This inactive memory can be reused if need be.

You should not be running into an OOM with those numbers unless there is something else eating up the memory or trying to allocate something very rapidly right before the OOM that we do not see here that would explain the OOM.
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 18, 2022, 12:11:13 AM
Also, there is indeed a shift in memory from free to inactive that was less present in pre-22 version but I never ran into an OOM because of it, then again I have 8GB of ram, not 4GB so I'm likely less prone to OOM.

What does your Reporting/Health looks like (if you have it) ?

Attached is mine (System/Memory from Reporting/Health) for the last 77 days (inverse turned on, resolution high), each peak is usually a reboot or an upgrade/reboot.
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 18, 2022, 12:33:39 AM
Quote from: RedVortex on May 17, 2022, 11:40:54 PM
Being curious about your issue... Can you post the output of

vmstat -z | tr ':' ',' | sort -nk4 -t','

Thanks for stopping on this issue. Here's the output: https://termbin.com/0uzj

Quote from: RedVortex
We see that your wired memory is increasing slowly over time, does it stabilize at some point or really ends up consuming both free and inactive ? Mine is stable around 800M. It varies between 8 to 12% of the total memory 8G in my case so 800M is around 10%). Do you know how big the numbers were right before an OOM (including free and inactive) ?

It doesn't stabilize over time, it keeps going until OOM kills all processes leaving only kernel alive. Before switching to wireguard kmod it was killing wireguard-go too and so all vpn connections, while now it leaves vpn alive but no other services available.

I'm not sure about the numbers before OOM, my external monitor (zabbix) records "Available memory is defined as free+cached+buffers memory" and last OOM happened when this value was 1.5GB, so not really explaining anything.

Quote from: RedVortex
We also see your free being converted into inactive memory. This does not seem abnormal at first glance... This inactive memory can be reused if need be.

This can be explaining by my manual activation and kill of iperf3 server in tmux terminal. It consumes a lot of memory and releases it afterwards. This is not the cause of OOM as I am aware of this behaviour and problem happens even at night when no iperf3 running and admin is sleeping  :'(
EDIT: can't really be sure of this, just retried and no massive spike in memory usage.

Quote from: RedVortex
You should not be running into an OOM with those numbers unless there is something else eating up the memory or trying to allocate something very rapidly right before the OOM that we do not see here that would explain the OOM.

There's still a possibility that the problem is not caused by a steady rise but a massive spike in memory usage, but so far it seems that the steady rise of wired memory is the only player in the field.


I've been collecting the output of `vmstat -m` by the minute in the last hours. Please find attached the resulting plot of "MemUse" column when filtering out the constant features.

I can share the python code that generates this if required.

Just a question, I've been told to monitor vmstat -m, but you're suggesting to monitor vmstat -z instead. Which one should I use? Thanks
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 18, 2022, 12:42:39 AM
Quote from: RedVortex on May 18, 2022, 12:11:13 AM
Also, there is indeed a shift in memory from free to inactive that was less present in pre-22 version but I never ran into an OOM because of it, then again I have 8GB of ram, not 4GB so I'm likely less prone to OOM.

What does your Reporting/Health looks like (if you have it) ?

Attached is mine (System/Memory from Reporting/Health) for the last 77 days (inverse turned on, resolution high), each peak is usually a reboot or an upgrade/reboot.

I've re-enabled the monitoring service only lately as I tried to disable not essential services before diving into the problem in detail, but i can share the memory report for the last 60h
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 18, 2022, 12:16:05 PM
It seems that among all vmstat -m vars, nvlist is the only one exposing a almost linear grow.

Does it mean anything to you?
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 19, 2022, 03:19:15 PM
I feel like there's no enough traction for problems like kernel leak and out-of-memory  :-\
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 19, 2022, 03:55:57 PM
I'm wondering about something in regards to your nvlist.

Could you stop the Zabbix monitoring and agent and make sure it doesn't run anymore, even locally. See if it still leaks ?

I wonder if something is grabbing a list of states or something and there might be a leak in there somewhere. nvlist could be used for something similar.

Something along those lines: https://bugs.freebsd.org/bugzilla//show_bug.cgi?id=255971
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 20, 2022, 12:02:45 PM
Before shutting down zabbix-agentd, I'll attach an updated chart displaying linear grow of nvlist up to now
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 20, 2022, 09:46:46 PM
you nailed it! After stopping zabbix-agentd, the wired memory consumption stopped growing and the nvlist chart went flat. No memory released though, it's still at 611MB even after the process has been killed.
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 24, 2022, 02:12:05 AM
It's still a bug though... Hopefully they'll fix it kernel side somewhere in the next FreeBSD release. That was fun to troubleshoot though, not something we casually dig in every day 😊
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: arkanoid on May 24, 2022, 02:31:04 AM
sure it is not easy at all to spot this.

Not only top doesn't show it linked to zabbix-agentd, but nvlist is hard to inspect.

Do you know how can I open an issue on this? Is this a freebsd thing, or zabbix thing?
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 24, 2022, 09:04:45 PM
Hmmm, I'm not sure where I would open up the bug to be honest. This definitely looks like a FreeBSD bug. I mean, Zabbix only "uses" something provided by the OS/Kernel that "should" work properly from what we know. Unless it is Zabbix itself that leaks this in some way in their code but it seems to be kernel-related.

If I were to open up a bug, I would try to reproduce it in a simple fashion like a command-line script or something easy reproducible. Similar to what they did here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255971 ideally trying not to use Zabbix to make sure it's now Zabbix itself that leaks in some way this nvlist thing (I wouldn't know how)...

And then open up a bug with FreeBSD and/or OPNSense. What they fixed in the above bug may not have made its way yet to the FreeBSD version that we are running of OPNSense. I'm not sure how to determine if the version we have contains this bugfix or even if it is really relevant to the bug you found here but it certainly looks very similar to you. Since they discovered this bad behaviour in the 13.0-Stable and that's what we're running in the latest OPNSense, maybe we don't have this bugfix yet, I'm not sure how to determine this 100%.

That being said, I would most likely at least open up a bug with OPNSense to start with considering this ends up crashing the whole thing and you're not doing anything "funky" and you're using the released version of both the zabbix agent and OPNSense. So you're probably not alone out there with this issue... I would probably put a link to this forum's post in the ticket and fully explain what triggers this bug (running zabbix agent) and what happens (nvlist leak leading to OOM and crash) and probably also put the link to the FreeBSD bug as well asking if this may be related in some way.

If you know which metric in the zabbix monitoring triggers this leak it would be helpful, this way they could easily replicate/troubleshoot and fix it. On your side, you could also try to exclude this metric for now from the agent and monitor everything else instead of nothing at all.

Hopefully, this is already fixed by the previous bug and we just need an updated version of FreeBSD or at least a patched one.
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 24, 2022, 09:27:24 PM
Interestingly, I just tried this (Similar to what is in the bug ticket of FreeBSD):

pfctl -sa

And wrapping it in a small script:

echo "before: " ; vmstat -m | grep "InUse\|nvlist" ; pfctl -sa > /dev/null ; echo "after:" ; vmstat -m | grep "InUse\|nvlist"

And that seems to leak. I'm not pushing it to see how far it will go but it does not seems to release the InUse and MemUse for far.

echo "before: " ; vmstat -m | grep "InUse\|nvlist" ; pfctl -sa > /dev/null ; echo "after:" ; vmstat -m | grep "InUse\|nvlist"
before:
         Type InUse MemUse Requests  Size(s)
       nvlist 22030  2979K 66258359  16,32,64,128,256,2048,4096,8192
after:
         Type InUse MemUse Requests  Size(s)
       nvlist 22036  2980K 66975304  16,32,64,128,256,2048,4096,8192

Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on May 29, 2022, 04:48:58 AM
This command also seems to increase the InUse and never go down with FreeBSD 13.1 (testing opnsense 22.7 kernel right now) so I'm not sure the zabbix agent monitoring that creates a leak will be fixed in a more recent kernel or os release unfortunately.

You should probably open up a bit with opnsense in case it's something they can or need to fix on their side for zabbix and or something in regards to nvlist if you need this working.
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on June 02, 2022, 11:28:31 AM
It's rather strange... the fixes are already in 22.1 which means there are more leaks in the ioctl code.


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on June 02, 2022, 01:29:52 PM
Leak seems to be create by "pfctl -s info" diagnostics only which is gathered by a number of things in the core system. I'm trying to see if this is an easy patch.


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on June 02, 2022, 01:59:08 PM
https://github.com/opnsense/src/commit/4d3b9e4a34

# echo "before: " ; vmstat -m | grep "InUse\|nvlist" ; pfctl -sa > /dev/null ; echo "after:" ; vmstat -m | grep "InUse\|nvlist"
before:
         Type InUse MemUse Requests  Size(s)
       nvlist     0     0K   200866  16,32,64,128,256,2048,4096,8192
after:
         Type InUse MemUse Requests  Size(s)
       nvlist     0     0K   320016  16,32,64,128,256,2048,4096,8192


Good enough for now? :)

For anyone willing to try:

# opnsense-update -kzr 22.1.8-nvlist2
# opnsense-shell reboot


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: franco on June 03, 2022, 11:49:54 AM
After more digging https://github.com/opnsense/src/commit/79e0c974ae85 seems to be on FreeBSD main branch only since April, which is a bit unfortunate. I thought it would fix all cases but there is another leak in the code which was merged into main via https://reviews.freebsd.org/D35385 yesterday.

We'll be offering both patches in 22.1.9 most likely to avoid further issues with this particular memory leak as the core system regularly polls pfctl as well. Not as fast as Zabbix but still problematic.


Cheers,
Franco
Title: Re: Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.
Post by: RedVortex on June 04, 2022, 06:01:41 AM
Thanks a lot for all the patches and digging Franco !

I can't test the patches myself since I'm on the test release for FreeBSD 13.1 (22.7...), hopefully arkanoid could test it, especially with Zabbix, since it seemed to leak a lot, maybe from other places than the ones we found...

I'm very glad you could nail it down, memory leaks could be very tricky to figure out sometimes. It was another fun bug hunting  :)