Simple install appears to have a memory leak since 22.1

Started by clvrmnky, March 02, 2022, 07:21:34 PM

Previous topic - Next topic
See https://forum.opnsense.org/index.php?topic=26867.15 but I did not want to bump that one as there is chatter there related to other issues. But someone observed unbound behaving badly in some cases over there as well.

Since upgrading to 22.1 my simple SOHO OPNsense box has been eating memory slowly and constantly until it starts to swap and requires a reboot. The symptoms are always the same:


  • Starts at some smallish 20%18% (based on the widget in the Lobby)
  • Memory use grows constantly over time with no interaction on my part
  • Eventually memory use spikes to high 90% and starts swapping
  • After about a week I give up and reboot

Memory use is currently at 57% with an uptime of  "4 days, 10:11". Previously it never went over 25%.

Here is the output from "top -o res -a 25":

last pid: 79521;  load averages:  0.42,  0.39,  0.23  up 4+10:04:01    13:03:26
48 processes:  1 running, 47 sleeping
CPU:  0.7% user,  0.0% nice,  0.4% system,  0.2% interrupt, 98.7% idle
Mem: 45M Active, 855M Inact, 755M Wired, 276M Buf, 680M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
  451 root          1  52    0   103M    61M accept   0   3:07   0.00% /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.8)
25622 unbound       4  20    0    91M    45M kqread   3   5:31   0.00% /usr/local/sbin/unbound -c /var/unbound/unbound.conf
53754 root          1  20    0    47M    34M accept   2   0:02   1.07% /usr/local/bin/php-cgi
54074 root          1  52    0    47M    34M accept   0   0:03   0.20% /usr/local/bin/php-cgi
50340 root          1  52    0    43M    34M accept   0   0:02   0.00% /usr/local/bin/php-cgi
11311 root          1  52    0    43M    34M accept   0   0:03   0.00% /usr/local/bin/php-cgi
83438 root          1  52    0    43M    33M accept   3   0:02   0.20% /usr/local/bin/php-cgi
47627 root          1  52    0    43M    33M accept   0   0:00   0.00% /usr/local/bin/php-cgi
73866 root          1  20    0    43M    29M wait     2   0:00   0.00% /usr/local/bin/php-cgi
73559 root          1  20    0    43M    29M wait     1   0:00   0.00% /usr/local/bin/php-cgi
88041 www           1  52    0    43M    21M accept   3   0:00   0.00% php-fpm: pool www (php-fpm)
88012 www           1  52    0    43M    21M accept   3   0:00   0.00% php-fpm: pool www (php-fpm)
87894 root          1  52    0    43M    21M accept   3   0:00   0.00% php-fpm: pool webgui (php-fpm)
88002 root          1  52    0    43M    21M accept   1   0:00   0.00% php-fpm: pool webgui (php-fpm)
87876 root          1  20    0    43M    21M kqread   3   0:11   0.00% php-fpm: master process (/usr/local/etc/php-fpm.conf) (php-fpm)
  447 root          1  52    0    30M    19M wait     3   0:02   0.00% /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.8)
26205 root          1  20    0    25M    14M select   0   1:15   0.00% /usr/local/bin/python3 /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain clvrmnky.org (python3.8)
13474 dhcpd         1  20    0    25M    13M select   3   0:02   0.00% /usr/local/sbin/dhcpd -user dhcpd -group dhcpd -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid re0
23790 root          1  20    0    23M    11M select   1   0:44   0.00% /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.8)
23095 root          1  20    0    23M    11M select   1   0:49   0.00% /usr/local/bin/python3 /usr/local/opnsense/service/configd_ctl.py -e -t 0.5 system event config_changed (python3.8)
23014 root          4  20    0    42M    11M kqread   3  12:16   0.00% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
91128 root          2  20    0    18M  8496K nanslp   1   0:09   0.00% /usr/local/bin/monit -c /usr/local/etc/monitrc
45708 help          1  25    0    18M  7828K select   3   0:00   0.00% sudo su
73312 root          1  20    0    17M  7656K kqread   1   0:18   0.00% /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf
22982 root          1  52    0    19M  7368K wait     2   0:00   0.00% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid


There are 4 unbound threads, and they just keep growing. I use both blocklists (one small one) and DNS-over-TLS, but this is not new since 22.1. Most other options are default. The CPU time for unbound doesn't look right, either, but I don't recall what that might have been in the past.

All those processes with large RES values will continue to grow, as well. It isn't swapping yet, but it will be by the weekend.

However, even with all that, if I restart unbound memory use doesn't change. In fact, it'll often get worse. I don't know who is holding onto all this memory, but this feels like a classic memory leak somewhere.

The "health" widget shows free memory dropping at a constant rate, but it is unclear where it is going. Perhaps to "inactive".

Prior to 22.1 this exact same hardware and configuration has uptime measured in months. I only took it down to accept scheduled updates requiring a reboot. I literally "set it and forget it" and now I have to visit the admin page few days to know when to take the time to bounce it.

But is it rising beyond 57%? It's been said before that memory use is not indicative of actual memory consumption since used memory is never freed in the system so never reported as such anymore. And leaks are growing. After all we are looking at a large operating system bump from 12.1 to 13.0-STABLE.


Cheers,
Franco

Yes. Well beyond 90% and swapping, eventually, as I said in my report. This isn't a case where there was a static bump. It starts at ~18% as always and then continually grows until it has to be manually rebooted.

It is now at 61% since I posted my original report. 4-5% per day or so is typical. It never levels off.

There is a leak in 22.1. I don't know where, but it's there.

As per https://forum.opnsense.org/index.php?topic=27387.msg132879#msg132879 it is the logger that is eating up all the memory.

7 days of uptime requires a restart, or resetting all the logs.

New behaviour since 22.1

Try shutting off the blocklists....I had the same issue on a different instance with unbound active....never did understand why they would cause it.....but once I shut the lists off....the memory leaking stopped.

Yes, I did have to adjust blocklists early in my investigation. Specifically, one blocklist in particular was causing unbound to stall on startup and eat up lots of free memory. However, once unbound was stable it did not grow anymore. In fact some processes, including the 4 unbound threads, had a smaller memory footprint after 7 days.

It would still be swapping, though. Turns out it is the logger, which was growing unbounded (no pun intended) until the entire system was impacted. I've turned off the logger for now.

I have no solutions, but as a data point I can offer that my install is not-so-simple and I'm not observing a memory leak in 22.1.2_1 at this point. 

My unbound is set to recurse all queries and use DNSSEC, it was loaded with the Steven Black blocklist.  I'm also running Sensei, OpenVPN and Wireguard, multiple physical and virtual interfaces, mdns repeater, ACME, Adguard (only recently), HAProxy, UPNP w/ ACLs (for game console only), dyndns (just switched to ddclient) and more.

I have found since going to 22.x RAM utilization is +10% compared to 21.x but it seems to be holding steady at ~3.3G out of the 8G available according to the dashboard.  As a matter of fact, "wired" RAM is slowly falling as uptime increases.

EDIT: system is an i5-3k Dell Optiplex 8GB RAM w/ ZFS root on single SATA SSD running an i340-T4 NIC.