OPNsense Forum

Archive => 22.1 Legacy Series => Topic started by: mmaridev on February 10, 2022, 12:20:01 pm

Title: Memory getting saturated
Post by: mmaridev on February 10, 2022, 12:20:01 pm
Hi,

upgraded almost 2 weeks ago from 21 and now having a problem with the RAM. Its usage gets higher and higher up to when important processes like unbond get killed and the network stops working.
Had nothing like this before the upgrade. Any clue?

Thanks in advance,
Title: Re: Memory getting saturated
Post by: clvrmnky on February 16, 2022, 03:01:12 pm
I've noticed a lot of memory being used and not being freed. So far it isn't climbing, but my uptime has only been a few days with 22.1.

What are the top size and res in System: Diagnostics: Activity for you? I see many unbound and netdata threads chewing up lots of memory. Top X threads:

20667   unbound   20   0   1066M   882M   kqread   2   0:00   0.03%   /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
   20667   unbound   20   0   1066M   882M   kqread   0   1:39   0.01%   /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
   20667   unbound   20   0   1066M   882M   kqread   3   0:00   0.00%   /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
   20667   unbound   20   0   1066M   882M   kqread   2   0:00   0.00%   /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
   3127   netdata   40   19   130M   57M   nanslp   2   0:07   0.23%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{PLUGIN[freebsd]}
   3127   netdata   40   19   130M   57M   piperd   3   0:01   0.03%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{PLUGINSD[apps]}
   3127   netdata   40   19   130M   57M   nanslp   2   0:01   0.03%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{PLUGIN[idlejitt}
   3127   netdata   40   19   130M   57M   piperd   1   0:01   0.02%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{PLUGINSD[python}
   3127   netdata   40   19   130M   57M   piperd   0   0:01   0.02%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{PLUGINSD[charts}
   3127   netdata   40   19   130M   57M   nanslp   1   0:00   0.01%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{GLOBAL_STATS}
   3127   netdata   40   19   130M   57M   nanslp   3   0:00   0.01%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{STATSD}
   3127   netdata   40   19   130M   57M   kqread   1   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{DBENGINE}
   3127   netdata   40   19   130M   57M   select   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{WEB_SERVER[stat}
   3127   netdata   40   19   130M   57M   select   1   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{WEB_SERVER[stat}
   3127   netdata   40   19   130M   57M   select   1   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{WEB_SERVER[stat}
   3127   netdata   40   19   130M   57M   select   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{STATSD_COLLECTO}
   3127   netdata   40   19   130M   57M   select   0   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{WEB_SERVER[stat}
   3127   netdata   52   19   130M   57M   pause   0   0:01   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{netdata}
   3127   netdata   40   19   130M   57M   nanslp   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{HEALTH}
   3127   netdata   40   19   130M   57M   nanslp   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{PLUGINSD}
   3127   netdata   40   19   130M   57M   nanslp   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{SERVICE}
   3127   netdata   40   19   130M   57M   uwait   3   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{netdata}
   3127   netdata   40   19   130M   57M   kqread   0   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{DAEMON_SPAWN}
   3127   netdata   40   19   130M   57M   uwait   0   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{netdata}
   3127   netdata   40   19   130M   57M   uwait   0   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{netdata}
   3127   netdata   40   19   130M   57M   uwait   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{netdata}
   3127   netdata   52   19   130M   57M   kqread   2   0:00   0.00%   /usr/local/sbin/netdata -u netdata -P /var/db/netdata/netdata.pid{DAEMON_COMMAND}
   19798   root   52   0   62M   42M   accept   1   0:02   0.00%   /usr/local/bin/php-cgi
   451   root   52   0   72M   39M   accept   0   0:01   0.00%   /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.8){python3.8}
   451   root   52   0   72M   39M   wait   3   0:00   0.00%   /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.8){python3.8}
   43400   netdata   40   19   63M   33M   select   0   0:06   0.19%   /usr/local/bin/python3.8 /usr/local/libexec/netdata/plugins.d/python.d.plugin 1{python3.8}
   43400   netdata   40   19   63M   33M   select   3   0:05   0.02%   /usr/local/bin/python3.8 /usr/local/libexec/netdata/plugins.d/python.d.plugin 1{python3.8}
   1941   root   20   0   48M   32M   select   1   0:52   0.01%   /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.8)
   77770   root   20   0   53M   31M   select   0   0:03   0.00%   /usr/local/bin/php-cgi
   43499   root   21   0   49M   29M   accept   1   0:02   0.00%   /usr/local/bin/php-cgi
   16758   root   52   0   47M   27M   accept   1   0:02   0.00%   /usr/local/bin/php-cgi
   44900   root   20   0   47M   26M   accept   1   0:02   0.00%   /usr/local/bin/php-cgi
   43818   root   52   0   47M   25M   accept   3   0:02   0.00%   /usr/local/bin/php-cgi
Title: Re: Memory getting saturated
Post by: j_s on February 16, 2022, 09:58:28 pm
I'm having similar issues.

Let me start with this though... I don't typically recommend enabling netdata unless it is necessary.  At my prior employer we had so many problems with netdata memory leaks that we decided to disable it except when doing diagnostic testing.  I don't know if that will solve your problem, but it definitely won't solve mine.

For me, in top unbound wants to use 9GB of RAM.  This is a business with about 40 bitcoin miners on it.  They all point to the same place, and I have no explanation for how/why unbound would need that much RAM.  I haven't done anything crazy with unbound to explain this behavior.

I got an alert that my RAM is now 96% full (system has 8GB of RAM) and is up to date.

Anyone aware of any memory leak or other issue with Unbound?  I'm tempted to restart the service just to hopefully get unbound under control.  But ideally I like to troubleshoot the problem and fix it permanently rather than a temporary fix like restarting a service.  Notice in the screenshots that its even started using swapspace.  I've ordered more RAM for this box so it can be upped to 16GB, but I got this strong suspicion that unbound will eventually fill that up too.
Title: Re: Memory getting saturated
Post by: clvrmnky on February 17, 2022, 01:24:56 am
There isn't any real reason I enabled Netdata. It's not like I need that sort of visibility into things. I might have been using it to inspect IPv6 traffic a bit deeper.

I ended up disabling it right after posting my reply, and didn't see any change in memory use -- maybe a slight drop.

Hours later it's back up to 65-70%92% All unbound, python3, PHP and syslog-ng.

Something different is going on with 22.1.
Title: Re: Memory getting saturated
Post by: j_s on February 18, 2022, 03:36:46 am
So after I posted that, an update came out to 22.1.1_1, which I updated yesterday evening.  Naturally, the reboot reset the memory usage.  I'm at 2.2GB of RAM used now, and I'm going to continue to monitor it.  Right now unbound is at 1.3GB of RAM.  I'll provide more updates as time goes on.
Title: Re: Memory getting saturated
Post by: clvrmnky on February 18, 2022, 04:53:48 am
Are you using blocklists in unbound? It looks like some blocklists are blowing out memory on each unbound process/thread. Even starting/stopping unbound is painful and error-prone with some blocklists.

And NetFlow , if you have it enabled, also seems to be a memory hog now. Hard to disable it, though. I had to "Clear All" all the settings and reboot, which took a long time because between NetFlow, unbound, and the unbound control processes something was stuck.

I removed one of the blocklists and restarted unbound and I'm back to 40% in the dashboard. unbound processes are HUGE now, but they are smaller with smaller blocklists. And the processes that start/stop them don't spike the CPU.

There is another thread where unbound and blocklists is causing CPU spikes and +20C more heat in some smaller boxes.
Title: Re: Memory getting saturated
Post by: j_s on February 18, 2022, 02:22:48 pm
I'm not using any blocklists.  The feature is disabled in my WebGUI.

However a quick update.  I've updated to 22.1.1_3 (no reboot required for that update, so none was performed) but in less than 12 hours, I'm now at 2.8GB of RAM used by the system, and 1.8GB for unbound.  So the system is using about 500MB of RAM more than it was less than 12 hours ago, and nearly 100% of that is for unbound exclusively.

Something seems very amiss with Unbound.  But I'm not an unbound expert, so I'm not even sure where to go to start troubleshooting this.  One thing I've changed relatively recently is that I added a domain override to another DNS server at our primary site.  Basically each site has it's own DNS, and each site is supposed to do DNS lookups to other site's DNS to resolve IPs within the company.  After removing the override and restarting unbound to apply the changes, unbound is now using 203MB of RAM.  So I'll monitor for the next 24-48 hours.

Another thing I find a bit odd is that when I try to look at the logging for Unbound DNS under Services -> Unbound DNS -> Log File nothing actually lists.  Basically I have two options... clear log and download log.  The place where the logging normally appears just sits there and shows "Loading..." and I see that each time I load the page another python 3.8 process loads and goes to 100%.  After I enabled the multiselect option and chose all of them, I had a number of processes at 100% on the router.  Each one spawned when I checked another one of the level of logging to display, and they all seemed to get stuck and run at 100%.

root    17645   0.0  0.3  32328 22024  -  Is   06:01       0:03.33 |-- /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.8)
root    26456   0.0  0.3  65236 27904  -  I    06:01       0:03.74 | `-- /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.8)
root    59161 100.0  0.2  25036 13692  -  R    08:06       1:56.32 |   |-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 0 --offset 0 --f                                                            ilter --module core --filename resolver --severity Emergency,Alert,Critical,Error,Warning (python3.8)
root    74015 100.0  0.2  25292 14020  -  R    08:04       3:56.40 |   |-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 20 --offset 0 --                                                            filter --module core --filename resolver --severity Emergency,Alert,Critical,Error,Warning (python3.8)
root    98532 100.0  0.2  25036 13692  -  R    08:07       0:59.04 |   |-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 20 --offset 0 --                                                            filter --module core --filename resolver --severity Warning (python3.8)
root    98575 100.0  0.2  25036 13692  -  R    08:07       0:58.54 |   |-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 20 --offset 0 --                                                            filter --module core --filename resolver --severity Emergency,Warning (python3.8)
root    98721 100.0  0.2  25036 13692  -  R    08:07       0:58.18 |   |-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 20 --offset 0 --                                                            filter --module core --filename resolver --severity Emergency,Alert,Warning (python3.8)
root    98773 100.0  0.2  25036 13692  -  R    08:07       0:57.84 |   |-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 20 --offset 0 --                                                            filter --module core --filename resolver --severity Emergency,Alert,Critical,Warning (python3.8)
root    99094 100.0  0.2  25036 13692  -  R    08:07       0:57.44 |   `-- /usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 20 --offset 0 --                                                            filter --module core --filename resolver --severity Emergency,Alert,Critical,Error,Warning (python3.8)


I started writing this post, and then the WebGUI returned this:

{"errorMessage":"/usr/local/opnsense/mvc/app/controllers/OPNsense/Base/ApiControllerBase.php:96: Error at /usr/local/opnsense/mvc/app/controllers/OPNsense/Diagnostics/Api/LogController.php:105 - Trying to access array offset on value of type null (errno=8)","errorTrace":"#0 /usr/local/opnsense/mvc/app/controllers/OPNsense/Diagnostics/Api/LogController.php(105): OPNsense\\Base\\ApiControllerBase->APIErrorHandler(8, 'Trying to acces...', '/usr/local/opns...', 105, Array)\n#1 [internal function]: OPNsense\\Diagnostics\\Api\\LogController->__call('coreAction', Array)\n#2 [internal function]: Phalcon\\Dispatcher\\AbstractDispatcher->callActionMethod(Object(OPNsense\\Diagnostics\\Api\\LogController), 'coreAction', Array)\n#3 [internal function]: Phalcon\\Dispatcher\\AbstractDispatcher->dispatch()\n#4 /usr/local/opnsense/www/api.php(26): Phalcon\\Mvc\\Application->handle('/api/diagnostic...')\n#5 {main}","errorTitle":"An API exception occured"}

The opnsense box became very sluggish as I maxed out all 8 CPUs on the system, and load averages went over 10.  As its 6AM here, I do not have the option of rebooting the opnsense box.  This box was intended as a stop-gap while we got our HA system up and running, so a reboot takes that site down.  As nobody is onsite, and a few workloads will go haywire if the internet goes down, I'm stuck waiting for either someone to arrive at the office to assist me during the opnsense reboot or the python processes eventually timeout or otherwise terminate.  Right now they've been running for almost 15 minutes, so I'm getting a little concerned that they aren't going to do either and keep running at 100%.
Title: Re: Memory getting saturated
Post by: Fright on February 18, 2022, 04:06:03 pm
@j_s
have you tried to reset logs?
System: Settings: Logging
Title: Re: Memory getting saturated
Post by: cookiemonster on February 18, 2022, 04:14:26 pm
I am not certain is the place to look but I see /var/log/resolver/resolver_{date}.log that seems to be unbound logs.
Title: Re: Memory getting saturated
Post by: j_s on February 18, 2022, 05:18:50 pm
@j_s
have you tried to reset logs?
System: Settings: Logging

I couldn't find it earlier.  I was up at 3AM because I couldn't sleep.  However, I just reset the logging, and now the logging works in Unbound properly.  Thanks for that!

I am not certain is the place to look but I see /var/log/resolver/resolver_{date}.log that seems to be unbound logs.

Yeah, I am gonna let this run for 24 hours or so with the settings changes.  Once that is determined to be fine, I'll try re-enabling the stuff I disabled (we really do need working DNS cross-site, but not in the short term) and then I'll monitor the logging.  I tried looking for unbound.log and others, but not that particular file.  As the logs are reset, I don't have them to look at right now.  But if my next post involved "the problem is back" you can bet I'll be looking at that log file. ;)

Thanks everyone.  Currently going to monitor over the weekend and wait.
Title: Re: Memory getting saturated
Post by: j_s on February 19, 2022, 03:56:22 am
Okay, so it's been less than 12 hours and unbound is already over 1GB of RAM used.

The resolver logs are clearly identifying the issue:

Code: [Select]
<165>1 2022-02-18T21:04:51-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9890"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.94.
<165>1 2022-02-18T21:04:51-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9891"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.96.
<165>1 2022-02-18T21:04:51-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9892"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.98.
<165>1 2022-02-18T21:04:51-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9893"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.97.
<165>1 2022-02-18T21:04:51-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9894"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.99.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9895"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.137.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9896"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.141.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9897"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.131.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9898"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.122.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9899"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.129.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9900"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.138.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9901"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.128.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9902"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.124.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9903"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.109.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9904"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.125.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9905"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.121.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9906"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.110.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9907"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.127.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9908"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.133.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9909"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.101.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9910"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.108.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9911"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.107.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9912"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.134.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9913"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.102.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9914"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.35.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9915"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.89.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9916"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.95.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9917"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.93.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9918"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.90.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9919"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.91.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9920"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.117.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9921"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.116.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9922"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.115.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9923"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.140.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9924"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.139.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9925"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.142.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9926"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.132.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9927"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.135.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9928"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.113.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9929"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.126.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9930"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.112.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9931"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.118.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9932"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.111.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9933"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.119.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9934"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.120.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9935"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.123.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9936"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.136.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9937"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.130.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9938"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.104.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9939"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.105.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9940"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.100.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9941"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.114.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9942"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.47.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9943"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.103.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9944"] dhcpd entry changed WhatsMiner.site.business @ 172.18.10.6.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9945"] dhcpd entry changed WhatsMiner.site.business @ 172.18.10.17.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9946"] dhcpd entry changed WhatsMiner.site.business @ 172.18.10.4.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9947"] dhcpd entry changed WhatsMiner.site.business @ 172.18.10.7.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9948"] dhcpd entry changed WhatsMiner.site.business @ 172.18.10.34.
<165>1 2022-02-18T21:04:54-05:00 tempfirewall.site.business unbound 44293 - [meta sequenceId="9949"] dhcpd entry changed WhatsMiner.site.business @ 172.18.12.94.


The log from the last 12ish hours is already 75MB, and it's just chocked full of 450,000 ines of this stuff.  If I grep out everything with "Whatsminer" I get about 100 lines of logging from that 75MB file.  So the redundant name (or those devices themselves) seem to be a problem somehow.  I think I might rename one of them and see if it disappears from the logs.


We have two /24 subnets (172.18.12.x and 172.17.10.x) and a bunch of whatsminers being tested.

For our testing, I had dhcp leases default to 1200 seconds and maximum of 1800 seconds.  We went short because we were testing things with DHCP and DNS, and having a shorter renewal time allowed us to test stuff faster.  However, if you look at the above snippet, the same IP is at the top and bottom... just 3 seconds apart.  The logs are nothing but these entries, with never more than 3-4 seconds between them.

We also kept the default name because they're simply for testing, and will get a final name when they go to their final location.

In a future production environment we really want to have the leases with a relatively short time.  I was thinking an hour or two to coincide with the old RFC1541 spec although there is no minimum anymore from what I remember.

So is the problem that all these devices have the same hostname and unbound isn't enjoying that at all?  Naturally, if I do a DNS lookup for that device name, I get all of the devices.  Useless, for actual DNS resolution, but not something I'm upset about as we're just testing and they will eventually be renamed to their final names.

I did grab one device that was showing up nonstop in the list, and I changed its name to something unique, and the log entries for that specific IP stopped immediately.  So i'm inclined to think having 80+ devices with the same name is "not good to do", even if we're just lazy and trying to do testing.

Anyone agree or disagree with this, or offer any insight into this?  I'd love to understand this behavior better.
 
Title: Re: Memory getting saturated
Post by: Fright on February 19, 2022, 06:45:10 am
Quote
having 80+ devices with the same name is "not good to do", even if we're just lazy and trying to do testing
for this particular case (test environment, short lease time, same hosts names etc), does it even make sense to register leases on unbound?
imho this only gives an excessive load on the server and the log.
Title: Re: Memory getting saturated
Post by: j_s on February 19, 2022, 12:42:54 pm
Quote
having 80+ devices with the same name is "not good to do", even if we're just lazy and trying to do testing
for this particular case (test environment, short lease time, same hosts names etc), does it even make sense to register leases on unbound?
imho this only gives an excessive load on the server and the log.

So I would say both yes and no.

Yes, because in our production environment, we now know we need to add a singular device and change the name, then go to the next device instead of powering on a ton of devices with identical names thinking that at some point "in the future" we'll rename them and then walking away for long periods of time.  Quite literally, we simply didn't rename them because we were lazy and figured this would give us pre-production time to iron out any bugs we might have with the systems and network infrastructure along with building construction that is ongoing.  We also may or may not have control of the hostnames in some cases, so this would clearly be a serious problem if we didn't have access to change the hostnames.  For me, internally, I now know about this issue and I need to be able to raise a red flag if this does happen.

At the same time, "no", because this really shouldn't have happened.  Generally, it's an obvious bad idea to even have 2 devices with the same hostname. 

I'm generally a fan of "do it once and do it right" rather than doing it twice (or more than twice).

I do really wish that you could tell dhcpd not to give out dhcp leases to two devices with the same name, but I asked that https://forum.opnsense.org/index.php?topic=25612 and nobody responded, so either my question is "you idiot, why are you wanting this" (entirely possible, I do stupid things sometimes) or there really isn't a way.  I did try to read up the docs on dhcpd, and I didn't see any obvious way to simply deny duplicate hostnames.  But back to the issue with this thread.

This also breaks viewing of the logs and could cause production down situations.  Unbound (and the WebGUI) doesn't expect to parse files sized in 10s of MB or more.  Like I mentioned above, trying to view the logs in the WebGUI ultimately took the router's CPU to 100%.  While that didn't seem to have any consequences "this time" I may not be so lucky the next time.

So at this stage, it is important for me to consider that duplicate hostnames are bad and need to be weeded out as a high priority in a production environment.  Additionally, having a VLAN that is a "test environment" will also not avoid potential catastrophe for the production environment unless I'm going to have independent dhcp and/or dns that is somehow isolated from the production systems so as to avoid problems.

Quick data point as well:  I did a "drill whatsminer.site.business" and I got these fun 2 lines at the end of the query...

;; WARNING: The answer packet was truncated; you might want to
;; query again with TCP (-t argument), or EDNS0 (-b for buffer size)

That to me is Unbound laughing at me for being dumb. ;)
Title: Re: Memory getting saturated
Post by: Fright on February 19, 2022, 03:23:21 pm
i have noticed one call on your logs:
Quote
/usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 0 --offset 0
did you really request an unbond logs without limiting the number of rows returned? then I think this can also be a problem (and maybe its better to either remove this option or limit resources for queryLog.py script)
Title: Re: Memory getting saturated
Post by: exuvo on February 19, 2022, 06:57:44 pm
I have had the same problem after the new release and in my case the culprit was firewall logs were growing to 700MB per day (in /var/log/filter which is tmpfs).

Not sure if log settings got reset or something but i unchecked "Log packets matched from the default pass rules put in the ruleset" which was responsible for the majority of log messages and also lowered "Preserve logs (Days)" to 3 days (from default 31 days).

Attached the last 60 hour memory graph, empty space is from lost data as my router ran out of memory and started killing all processes. Could not even ssh in but saw the OOM messages on the connected monitor.
Title: Re: Memory getting saturated
Post by: cookiemonster on February 19, 2022, 10:39:30 pm
Different problems.
OP's is specific to his host naming, or rather lack of, causing Unbound to report many notices, as it should, amplified by the number of hosts.
Title: Re: Memory getting saturated
Post by: j_s on February 21, 2022, 12:18:28 pm
i have noticed one call on your logs:
Quote
/usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 0 --offset 0
did you really request an unbond logs without limiting the number of rows returned? then I think this can also be a problem (and maybe its better to either remove this option or limit resources for queryLog.py script)

Technically, yes.  When you first open the logging page, you have to view the page to make any changes or filter the results.  For me, just the fact of opening the page created the first thread, which was basically never going to finish.

Obviously there was nothing I could do about the first page load, but that seemed to create additional problems even if I added filters.

I do think the queryLog.py should have some kind of resource limit, as there is no way I could have avoided the first thread from getting "stuck" no matter what I did.  I'd assume if your log file is of some arbitrary sufficient size, that no amount of filters will probably "save" you from a dead WebGUI.
Title: Re: Memory getting saturated
Post by: Fright on February 21, 2022, 06:31:07 pm
Quote
Obviously there was nothing I could do about the first page load, but that seemed to create additional problems even if I added filters.
please try to not use "All" in rows count select some time  ;)
if its already there you can try to clear browser store (deleting history or something like
https://www.leadshook.com/help/how-to-clear-local-storage-in-google-chrome-browser/
)
Title: Re: Memory getting saturated
Post by: Fright on February 22, 2022, 06:52:18 am
@j_s
@AdSchellevis merged the pr. can you try?
Code: [Select]
opnsense-patch 4b5a074?
gui should no longer be blocked by requesting a huge number of log rows.
Title: Re: Memory getting saturated
Post by: j_s on February 22, 2022, 08:28:57 pm
@j_s
@AdSchellevis merged the pr. can you try?
Code: [Select]
opnsense-patch 4b5a074?
gui should no longer be blocked by requesting a huge number of log rows.


So for me, the default was "critical".  Each day the resolver log file is around 200MB.  As before, just going to Services -> Unbound DNS -> Log File does cause a single thread to go crazy and the WebGUI sits at "Loading...".

I killed the stuck python process from above and applied your patch.

I then reloaded the lobby and went to Services -> Unbound DNS -> Log File and again a thread went to 100% and ran for at least 90 seconds (that's when I came back here to report my findings).  The UI again sat at "Loading..."  However this time I was able to "Clear log" without issues.  The "stuck" python thread did eventually terminate itself about 10 seconds after I cleared the logs.  I still couldn't ever view them since my 1GB+ of log files from the last few days was clearly too much to parse.
Title: Re: Memory getting saturated
Post by: comet on February 23, 2022, 12:17:12 am
We are also seeing an issue with memory usage, prior to the last upgrade it never got above about 4% but in this new version it just keeps climbing over time.  Now when we reboot it shows 4 to 5% but over the space of several hours it just keeps climbing, so after 24 hours it's up to 13%.  It's a remote system (relative) so we can't let it get so high that it crashes, but the weird part is that no one is even home this week so there should only be very minimal traffic through the router.  And we are not doing any really advanced type stuff, it's basically just acting as a router.

We have no idea where to even start to look for the problem, but again, we're not advanced users so some of this discussion has been completely over our heads.  We do see quite a bit of firewall traffic (about a page full of entries in a minute and a half), and a lot of it is labelled "Default deny rule" but we don't know what to make of that.

Has anyone actually figured out what the problem is, and is there a fix in the works?
Title: Re: Memory getting saturated
Post by: Fright on February 23, 2022, 03:59:32 pm
@j_s
sorry for the delay - generated 1GB+ logs  ;)
from what I see: with the patch installed and "All" rows count settings in browser storage, the script does not request more than 5000 lines, the execution does not take so much time, and the gui is not blocked (with requests of 20 -1000 lines everything works fine and smooth).
although imho a reasonable max limit for the number of simultaneously requested and displayed rows (rows count select) is somewhere around 1000.
if you really constantly need to log such huge data and search in them with a large chunks, then I would look towards using dbs
Title: Re: Memory getting saturated
Post by: j_s on February 25, 2022, 01:36:58 pm
@Fright

I definitely don't need those kinds of logs.  The log problem only came to light because of my Unbound "excessive RAM usage" problem.  I plan to get the duplicate hostname issue taken care of in the near future, at which point my logging should go quiet again.

@comet

I don't think anyone has a fix for the memory problem.  Do you have a bunch of hosts with the same hostname?  Do your logs show anything?

I'm managing about a dozen Opnsense systems, and I'm only seeing this issue on 1 machine and that machine also has tons of logging due to having like 100 devices with all the same hostname.

I'm somewhat convinced that unbound has a memory leak for some kind of edge case.  I just don't know how me saying what I think would help the unbound team, and I suspect even if I offered ssh access to the system in question, I don't know that I'd expect the unbound team to respond.  I just don't think this problem is freuquenty enough to say that it really is a problem that isn't a "me" problem.