Wired memory ramps up until OOM Killer kicks in every 7 days. Reboot. Repeat.

Started by arkanoid, May 15, 2022, 12:25:59 PM

Previous topic - Next topic



Could you please write the text directly into your postings instead of attachments and screenshots? That would be way more convenient for everybody trying to read what you write.

There are code tags for that:

This is pasted code
or command output.


Thanks,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

roger that, I was just trying to manually collect data even while on the go, so I used screenshots. I'll try to stick with text data.

Now I've wrapped up a shell script to collect

vmstat -m

over time, hopefully I'll get some good data for further analysis. Please redirect me if there's a better way to debug this.

In the meantime, here's latest top -o size

last pid: 36860;  load averages:  0.44,  0.42,  0.46  up 3+07:50:54  14:46:34
53 processes:  1 running, 52 sleeping
CPU:  0.2% user,  0.0% nice, 13.1% system,  0.0% interrupt, 86.7% idle
Mem: 51M Active, 1041M Inact, 547M Wired, 331M Buf, 2314M Free

Just a guess: is your OPNsense UI accessible from the Internet? Possibly someone is probing the UI for bugs causing php-cgi processes with high memory consumption to pile up?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

nope, the only port available on the Internet is the vpn one, and tcpdump confirms that there's no unexpected traffic

I'd expect peaks in memory, but all I see is a continuous slow leak
last pid: 70854;  load averages:  0.29,  0.46,  0.48    up 3+12:26:21  19:22:01
57 processes:  3 running, 54 sleeping
CPU: 13.4% user,  0.0% nice, 33.3% system,  0.0% interrupt, 53.3% idle
Mem: 65M Active, 1066M Inact, 558M Wired, 340M Buf, 2265M Free

Being curious about your issue... Can you post the output of

vmstat -z | tr ':' ',' | sort -nk4 -t','

We see that your wired memory is increasing slowly over time, does it stabilize at some point or really ends up consuming both free and inactive ? Mine is stable around 800M. It varies between 8 to 12% of the total memory 8G in my case so 800M is around 10%). Do you know how big the numbers were right before an OOM (including free and inactive) ?

We also see your free being converted into inactive memory. This does not seem abnormal at first glance... This inactive memory can be reused if need be.

You should not be running into an OOM with those numbers unless there is something else eating up the memory or trying to allocate something very rapidly right before the OOM that we do not see here that would explain the OOM.

Also, there is indeed a shift in memory from free to inactive that was less present in pre-22 version but I never ran into an OOM because of it, then again I have 8GB of ram, not 4GB so I'm likely less prone to OOM.

What does your Reporting/Health looks like (if you have it) ?

Attached is mine (System/Memory from Reporting/Health) for the last 77 days (inverse turned on, resolution high), each peak is usually a reboot or an upgrade/reboot.

Quote from: RedVortex on May 17, 2022, 11:40:54 PM
Being curious about your issue... Can you post the output of

vmstat -z | tr ':' ',' | sort -nk4 -t','

Thanks for stopping on this issue. Here's the output: https://termbin.com/0uzj

Quote from: RedVortex
We see that your wired memory is increasing slowly over time, does it stabilize at some point or really ends up consuming both free and inactive ? Mine is stable around 800M. It varies between 8 to 12% of the total memory 8G in my case so 800M is around 10%). Do you know how big the numbers were right before an OOM (including free and inactive) ?

It doesn't stabilize over time, it keeps going until OOM kills all processes leaving only kernel alive. Before switching to wireguard kmod it was killing wireguard-go too and so all vpn connections, while now it leaves vpn alive but no other services available.

I'm not sure about the numbers before OOM, my external monitor (zabbix) records "Available memory is defined as free+cached+buffers memory" and last OOM happened when this value was 1.5GB, so not really explaining anything.

Quote from: RedVortex
We also see your free being converted into inactive memory. This does not seem abnormal at first glance... This inactive memory can be reused if need be.

This can be explaining by my manual activation and kill of iperf3 server in tmux terminal. It consumes a lot of memory and releases it afterwards. This is not the cause of OOM as I am aware of this behaviour and problem happens even at night when no iperf3 running and admin is sleeping  :'(
EDIT: can't really be sure of this, just retried and no massive spike in memory usage.

Quote from: RedVortex
You should not be running into an OOM with those numbers unless there is something else eating up the memory or trying to allocate something very rapidly right before the OOM that we do not see here that would explain the OOM.

There's still a possibility that the problem is not caused by a steady rise but a massive spike in memory usage, but so far it seems that the steady rise of wired memory is the only player in the field.


I've been collecting the output of `vmstat -m` by the minute in the last hours. Please find attached the resulting plot of "MemUse" column when filtering out the constant features.

I can share the python code that generates this if required.

Just a question, I've been told to monitor vmstat -m, but you're suggesting to monitor vmstat -z instead. Which one should I use? Thanks

Quote from: RedVortex on May 18, 2022, 12:11:13 AM
Also, there is indeed a shift in memory from free to inactive that was less present in pre-22 version but I never ran into an OOM because of it, then again I have 8GB of ram, not 4GB so I'm likely less prone to OOM.

What does your Reporting/Health looks like (if you have it) ?

Attached is mine (System/Memory from Reporting/Health) for the last 77 days (inverse turned on, resolution high), each peak is usually a reboot or an upgrade/reboot.

I've re-enabled the monitoring service only lately as I tried to disable not essential services before diving into the problem in detail, but i can share the memory report for the last 60h

It seems that among all vmstat -m vars, nvlist is the only one exposing a almost linear grow.

Does it mean anything to you?

I feel like there's no enough traction for problems like kernel leak and out-of-memory  :-\

I'm wondering about something in regards to your nvlist.

Could you stop the Zabbix monitoring and agent and make sure it doesn't run anymore, even locally. See if it still leaks ?

I wonder if something is grabbing a list of states or something and there might be a leak in there somewhere. nvlist could be used for something similar.

Something along those lines: https://bugs.freebsd.org/bugzilla//show_bug.cgi?id=255971

Before shutting down zabbix-agentd, I'll attach an updated chart displaying linear grow of nvlist up to now