Unbound errors in log file

Started by lar.hed, December 31, 2023, 12:28:43 PM

Previous topic - Next topic
Quote from: lar.hed on January 04, 2024, 12:11:43 PM
Quote from: cookiemonster on January 03, 2024, 11:59:10 PM
right. Then OP as per this, maybe you could do this when you hit the problem:
$diff -u /var/unbound/root.hints /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints
That will compare them and report differences.

Absolutely! No problem what so ever - but I think I might need to re-enable the blocklist part for this? Or?

I leave the blocklist disabled for now, until I get "other orders" from you  ;D

Edit - for the moment the two files are identical, just so we all know.
I can only suggest :)
We have two different theories, one that the problem is the hints file. The other that is the blocklist(s). Can't rule out a combination though.
I suggest to leave a suitable time with the lists disabled. If the problem occurs, compare the hints file. Report what are the contents and the differences if you can.
If enough time has passed, re-enable the lists and do the same thing when it reoccurs.
This will leave us to continue looking at what is the use of blocklist doing IMHO.

Quote from: Fright on January 04, 2024, 03:29:57 PM
looks like a 'template generation' vs 'unbound start' racing for me
@lar.hed how many "generate template OPNsense/Unbound/*" messages in Backend log you see in time when this happens?
Glad to have you around on the discussion @Fright, sorry typing at the same time. I'll take a step back to allow your query to blossom. Most likely you'll get to the bottom of it.

see:

https://forum.opnsense.org/index.php?topic=37912.msg185803#new

o unbound: update root hints

might this be related in any way?

Running unbound with latest OPN and DoT/ block lists without problems...
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

@cookiemonster Hi) Don't worry - this is just a guess for now.
I can't reproduce the root hints error, but if I let the unbound work (accumulate cache?), and then execute (can be done several times in a row):
configctl -dq unbound restart ; configctl -dq unbound restart ; configctl -dq unbound restart ; configctl -dq unbound restart ; configctl -dq unbound restart
On my test VM the Python module is somehow 'breaks':
[34585:0] fatal error: failed to setup modules
[34585:0] error: pythonmod: function init is missing in dnsbl_module.py

which in my opinion looks similar
perhaps the issue reproducibility is highly hardware dependent (test vm is quite slow in performance and does not have an SSD or some)

Quote from: CJ on January 04, 2024, 02:35:36 PM
Quote from: lar.hed on January 03, 2024, 08:23:10 PM
Question: Why should one use Unbound if I can not use Block Lists, Block Countries and DoT?

Not sure what Block Countries is, but I use several DNSBL and DoT with Unbound and have no problems.

Ah So Sorry, my mistake - Block Countries is actually part of firewall packet filtering, Aliases - again I am sorry for the mix up!

Quote from: Fright on January 04, 2024, 03:29:57 PM
looks like a 'template generation' vs 'unbound start' racing for me
@lar.hed how many "generate template OPNsense/Unbound/*" messages in Backend log you see in time when this happens?

Hmmm Zero.

I have none so far in my backend log - although I have to admit I am not used to the backend log, when does it rotate and so on might be part of me not finding anything? Even so I have 2023-12-22 as the lowest date with "unbound" records (they are "unbound cache dump" - so most likely something else? since I had the error after that date but no records at thoose dates in backend log even if I only search for "unbound")

January 04, 2024, 08:11:58 PM #36 Last Edit: January 04, 2024, 08:14:19 PM by lar.hed
Btw I got some errors in the Unbound Log file again - however no 100% CPU or anything, just this error:

2024-01-04T18:13:30 Notice unbound Backgrounding unbound logging backend.
2024-01-04T18:13:30 Notice unbound daemonize unbound dhcpd watcher.
2024-01-04T18:13:30 Notice unbound [26817:0] notice: init module 2: iterator
2024-01-04T18:13:30 Notice unbound [26817:0] notice: init module 1: validator
2024-01-04T18:13:30 Notice unbound [26817:0] notice: init module 0: python
TypeError: an integer is required (got type NoneType)
os.write(self._pipe_fd, res.encode())
File "dnsbl_module.py", line 227, in log_entry
logger.log_entry(query)
File "dnsbl_module.py", line 548, in operate
2024-01-04T18:13:29 Error unbound [35261:7] error: pythonmod: python error: Traceback (most recent call last):
2024-01-04T18:13:29 Error unbound [35261:7] error: pythonmod: Exception occurred in function operate, event: module_event_moddone
2024-01-04T18:13:29 Notice unbound Closing logger


And just to be clear: Block lists are still disabled. even though I can see in the log above a reference to dnsbl_module.py. That said, I also have safe search enabled (the row below the Enable) - but I can not see why safe search should be active if blocklist is disabled. Well, just to add to the confusion I guess?

QuoteWell, just to add to the confusion I guess?
yes, a little  ;), but not because of the dnsbl_module.py module reference - the name of the module is a bit misleading: in fact, this is not a block list module per se, but a Python module (pyunbound) in general, including it's involved in the stats work (Reporting: Unbound DNS) and may be more in business edition.
Something else is confusing:
how is it possible that unbound (re)starts without templates generation (some custom procedures used to control unbound state?)?
and TypeError in py-module. it looks like logger closing pipe and dnsbl_module should catch this with BrokenPipeError exception (just "Logging backend closed connection" message in unbound log).
So far I have not been able to reproduce this but I'll try again )

Ah - I made a fool of my self then, falling into that very well known "Assumption is the mother of all f*ckups" - right  8)

About your confusing thoughts: No I have nothing special custom procedure or anything (I have not come that far yet, give me another couple of months?). However there might be one thing, I do have my own "URLs of Blocklists" setup - can not really see a problem with that, but who knows, I just proves I should not make any assumptions...

I will attach a few screen dumps, from all settings, so it easier to validate what I have (if one has not already grabbed that from my config copy above).


Oh just remembered that I do have Monit setup for monitoring Unbound now days, so yes there is something hidden there - can not say if it is what you might be looking for?

The stop script, that is called is just one line (well except for the first line then):

#!/bin/csh

pgrep "unbound" | grep -v "$$" | xargs kill -9


It is the only way I can stop Unbound when it hits 100% CPU on one core - nothing else works.

January 06, 2024, 07:09:52 PM #41 Last Edit: January 06, 2024, 07:18:11 PM by Fright
played a little more.
It turns out that the TypeError-error can be reproduced on unbound restart if a queue is artificially created in the log_entry function (like event.wait(0.5)) (it reproduced with or without https://github.com/opnsense/core/commit/862381660d05ef1f69d69be9bf542f0d03ea1b03 patch. I think the lock should be set earlier in the log_entry function. but not tested it)
However, it seems to me that this error is not a cause, but a consequence: something is accumulating a queue and the service is consuming CPU time. The question is which of this is the cause and which is the effect.
IMHO you could try disabling statistics collection (Reporting: Settings -> unset the "Enables local gathering of statistics."), restart unbound and observe - maybe we can reduce the number of variables in the equation.

About monit settings - i'm not sure that the '/usr/sbin/service' is the right method for unbound (re)start. What config unbound uses if started with this command?
but this explains the lack of 'templates' entries in the backend log, yes  :)

Quote from: Fright on January 06, 2024, 07:09:52 PM
played a little more.
It turns out that the TypeError-error can be reproduced on unbound restart if a queue is artificially created in the log_entry function (like event.wait(0.5)) (it reproduced with or without https://github.com/opnsense/core/commit/862381660d05ef1f69d69be9bf542f0d03ea1b03 patch. I think the lock should be set earlier in the log_entry function. but not tested it)
However, it seems to me that this error is not a cause, but a consequence: something is accumulating a queue and the service is consuming CPU time. The question is which of this is the cause and which is the effect.
IMHO you could try disabling statistics collection (Reporting: Settings -> unset the "Enables local gathering of statistics."), restart unbound and observe - maybe we can reduce the number of variables in the equation.

Done - or should I re-enable blocking lists also?

Quote from: Fright on January 06, 2024, 07:09:52 PMAbout monit settings - i'm not sure that the '/usr/sbin/service' is the right method for unbound (re)start. What config unbound uses if started with this command?
but this explains the lack of 'templates' entries in the backend log, yes  :)

Aahhhh yea I have found a few different ways to stop/start unbound, and in the end this is the one I so to speak happend to choose - it seems to be the wrong one then? May I ask for the correct version to start Unbound?

QuoteDone - or should I re-enable blocking lists also?
hm. It would be interesting either way. but if it works stably with BL turned on, then the second one won't be needed )
Quotecorrect version to start Unbound?
'/usr/local/sbin/pluginctl -c unbound_start' i think

Done, and changed. Now we wait...... 8)