Memory getting saturated

Started by mmaridev, February 10, 2022, 12:20:01 PM

Previous topic - Next topic
Different problems.
OP's is specific to his host naming, or rather lack of, causing Unbound to report many notices, as it should, amplified by the number of hosts.

Quote from: Fright on February 19, 2022, 03:23:21 PM
i have noticed one call on your logs:
Quote/usr/local/bin/python3 /usr/local/opnsense/scripts/systemhealth/queryLog.py --limit 0 --offset 0
did you really request an unbond logs without limiting the number of rows returned? then I think this can also be a problem (and maybe its better to either remove this option or limit resources for queryLog.py script)

Technically, yes.  When you first open the logging page, you have to view the page to make any changes or filter the results.  For me, just the fact of opening the page created the first thread, which was basically never going to finish.

Obviously there was nothing I could do about the first page load, but that seemed to create additional problems even if I added filters.

I do think the queryLog.py should have some kind of resource limit, as there is no way I could have avoided the first thread from getting "stuck" no matter what I did.  I'd assume if your log file is of some arbitrary sufficient size, that no amount of filters will probably "save" you from a dead WebGUI.

QuoteObviously there was nothing I could do about the first page load, but that seemed to create additional problems even if I added filters.
please try to not use "All" in rows count select some time  ;)
if its already there you can try to clear browser store (deleting history or something like
https://www.leadshook.com/help/how-to-clear-local-storage-in-google-chrome-browser/
)

@j_s
@AdSchellevis merged the pr. can you try?
opnsense-patch 4b5a074
?
gui should no longer be blocked by requesting a huge number of log rows.

Quote from: Fright on February 22, 2022, 06:52:18 AM
@j_s
@AdSchellevis merged the pr. can you try?
opnsense-patch 4b5a074
?
gui should no longer be blocked by requesting a huge number of log rows.


So for me, the default was "critical".  Each day the resolver log file is around 200MB.  As before, just going to Services -> Unbound DNS -> Log File does cause a single thread to go crazy and the WebGUI sits at "Loading...".

I killed the stuck python process from above and applied your patch.

I then reloaded the lobby and went to Services -> Unbound DNS -> Log File and again a thread went to 100% and ran for at least 90 seconds (that's when I came back here to report my findings).  The UI again sat at "Loading..."  However this time I was able to "Clear log" without issues.  The "stuck" python thread did eventually terminate itself about 10 seconds after I cleared the logs.  I still couldn't ever view them since my 1GB+ of log files from the last few days was clearly too much to parse.

We are also seeing an issue with memory usage, prior to the last upgrade it never got above about 4% but in this new version it just keeps climbing over time.  Now when we reboot it shows 4 to 5% but over the space of several hours it just keeps climbing, so after 24 hours it's up to 13%.  It's a remote system (relative) so we can't let it get so high that it crashes, but the weird part is that no one is even home this week so there should only be very minimal traffic through the router.  And we are not doing any really advanced type stuff, it's basically just acting as a router.

We have no idea where to even start to look for the problem, but again, we're not advanced users so some of this discussion has been completely over our heads.  We do see quite a bit of firewall traffic (about a page full of entries in a minute and a half), and a lot of it is labelled "Default deny rule" but we don't know what to make of that.

Has anyone actually figured out what the problem is, and is there a fix in the works?
I'm a home user of OPNsense, not a networking expert.  I'd much appreciate it if you'd keep that in mind if replying to something I posted.  Many thanks!

@j_s
sorry for the delay - generated 1GB+ logs  ;)
from what I see: with the patch installed and "All" rows count settings in browser storage, the script does not request more than 5000 lines, the execution does not take so much time, and the gui is not blocked (with requests of 20 -1000 lines everything works fine and smooth).
although imho a reasonable max limit for the number of simultaneously requested and displayed rows (rows count select) is somewhere around 1000.
if you really constantly need to log such huge data and search in them with a large chunks, then I would look towards using dbs

@Fright

I definitely don't need those kinds of logs.  The log problem only came to light because of my Unbound "excessive RAM usage" problem.  I plan to get the duplicate hostname issue taken care of in the near future, at which point my logging should go quiet again.

@comet

I don't think anyone has a fix for the memory problem.  Do you have a bunch of hosts with the same hostname?  Do your logs show anything?

I'm managing about a dozen Opnsense systems, and I'm only seeing this issue on 1 machine and that machine also has tons of logging due to having like 100 devices with all the same hostname.

I'm somewhat convinced that unbound has a memory leak for some kind of edge case.  I just don't know how me saying what I think would help the unbound team, and I suspect even if I offered ssh access to the system in question, I don't know that I'd expect the unbound team to respond.  I just don't think this problem is freuquenty enough to say that it really is a problem that isn't a "me" problem.