So I have, still, Unbound problemes....
With no clue what so ever it just hangs on one core (100%) and stops responding to DNS requests. This has made me, on top of other challenges (like filter rules and live view), a bit confused. So I have decided to see if anyone could provide any information on my log file (attached)?
What I have done so far is change from SMPT server name to SMTP IP adress so I can get emails from Monit which monitors if a) Unbound hits 100% on one core or b) the CPU goes very high (which so far only happened when Unbound hits 100%, but I think it is a good indicator that something might not be okay so it is a bit more of a general Monit script so to speak), if can email me (since no longer depending on Unbound to resolve SMTP name) and I have been starting to develop a script that will do kill -9 (yes, this is what seems to be needed when this occurs - I know it is last resort, but I can not have a firewall that hangs - need to fix it somehow) and then pluginctl unbound start. This will be completed next time this happens - which I never knows when it will happen, funny enough...
So, anyone that can bring light into my challenge? Or do I go the hard way (kill -9)?
Can you try this ? Post the output i it fails somewhere.
https://forum.opnsense.org/index.php?topic=35527.msg176809#msg176809 (https://forum.opnsense.org/index.php?topic=35527.msg176809#msg176809)
I for sure will - next time, it may just take a few days....
Right, so this morning Unbound hit 100% CPU on one core - I copy&pasted the command from the other thread, and here is the result:
Quoteroot@OPNsense:~ # service unbound stop && rm -v /var/unbound/root.hints & cp -v /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints /var/unbound/root.hints && service unbound onestart
[1] 13720
/usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints -> /var/unbound/root.hints
Cannot 'stop' unbound. Set unbound_enable to YES in /etc/rc.conf or use 'onestop' instead of 'stop'.
Obtaining a trust anchor../var/unbound/root.hints
.
Starting unbound.
[1] + Done ( service unbound stop && rm -v /var/unbound/root.hints )
And Unbound is still killing one core with 100% CPU - so that did not do any difference I say.
A bit curios though about the "cannot stop" - since well I have to use Kill -9 as I said - and this time I got some more messages, the one with "unbound_enable" to YES makes me most curios though since Unbound IS enabled. And running, just 100% CPU running and because of that not responding to any requests....
Use onestop. Or use configctl unbound <stop|start> - OPNs does not use service normally, hence no unbound_enable.
I have tried:
- the configctl stop - does not stop the Unbound process
- onestop - does not stop the Unbound process
The only way so far is:
pgrep "unbound" | grep -v "$$" | xargs kill -9
Or just kill the process manually of course - but I did a small script for this...
Does not stop does not help diagnosing anything. So, what is returned when running those commands?
Sorry for not being clear about that.
Now I can not remember any return at all, except that the process just keeps running 100% in one core. I will double check this next time this happens, and as I wrote before, I never know or understand why Unbound goes 100% CPU on one core. For example the error output from the log for this morning occurrence is only this:
Quote2024-01-03T10:28:26 Notice unbound Backgrounding unbound logging backend.
2024-01-03T10:28:26 Notice unbound daemonize unbound dhcpd watcher.
2024-01-03T10:28:26 Critical unbound [38957:2] fatal error: Could not initialize thread
2024-01-03T10:28:26 Error unbound [38957:2] error: Could not set root or stub hints
2024-01-03T10:28:26 Error unbound [38957:2] error: reading root hints /root.hints 2:2: Syntax error, could not parse the RR's TTL
2024-01-03T10:28:26 Notice unbound [38957:0] notice: init module 2: iterator
2024-01-03T10:28:26 Notice unbound [38957:0] notice: init module 1: validator
2024-01-03T10:28:15 Notice unbound [38957:0] notice: init module 0: python
2024-01-03T10:28:12 Notice unbound Closing logger/quote]
EDIT: CHanged to "Notice" on Log output.... It may give more of a hint?
If you don't mind me butting in, from your log and previous posts on the other thread, my thinking is that you have something in your additional Unbound settings that tickle the problem and eventually crashes it.
I imagine you are still using either blocklists or DoT or both and there is something in either that triggers it.
The other thread I always thought it was strange why just in your case you'd have a problem with the root.hints file. Script to get around it is masking the problem. Finding the root of the problem is the real pickle.
So, indulge my thinking. Are you using blocklists, DoT?
Be prepared, It might be that you'd want to consider we'd have to go to a mailing list https://nlnetlabs.nl/support/mailing-lists/ to get to it.
Yes, I use blocklists and DoT.
No Worries!
Here is the config part of Unbound (or UnboundPlus as it is called) - feel free :-)
<unboundplus version="1.0.8">
<general>
<enabled>1</enabled>
<port>53</port>
<stats>1</stats>
<active_interface/>
<dnssec>1</dnssec>
<dns64>0</dns64>
<dns64prefix>64:ff9b::/96</dns64prefix>
<noarecords>0</noarecords>
<regdhcp>1</regdhcp>
<regdhcpdomain/>
<regdhcpstatic>1</regdhcpstatic>
<noreglladdr6>0</noreglladdr6>
<noregrecords>0</noregrecords>
<txtsupport>0</txtsupport>
<cacheflush>0</cacheflush>
<local_zone_type>transparent</local_zone_type>
<outgoing_interface/>
<enable_wpad>0</enable_wpad>
</general>
<advanced>
<hideidentity>0</hideidentity>
<hideversion>0</hideversion>
<prefetch>0</prefetch>
<prefetchkey>0</prefetchkey>
<dnssecstripped>0</dnssecstripped>
<serveexpired>0</serveexpired>
<serveexpiredreplyttl/>
<serveexpiredttl/>
<serveexpiredttlreset>0</serveexpiredttlreset>
<serveexpiredclienttimeout/>
<qnameminstrict>0</qnameminstrict>
<extendedstatistics>0</extendedstatistics>
<logqueries>0</logqueries>
<logreplies>0</logreplies>
<logtagqueryreply>0</logtagqueryreply>
<logservfail>0</logservfail>
<loglocalactions>0</loglocalactions>
<logverbosity>1</logverbosity>
<valloglevel>0</valloglevel>
<privatedomain/>
<privateaddress>0.0.0.0/8,10.0.0.0/8,100.64.0.0/10,169.254.0.0/16,172.16.0.0/12,192.0.2.0/24,192.168.0.0/16,198.18.0.0/15,198.51.100.0/24,203.0.113.0/24,233.252.0.0/24,::1/128,2001:db8::/32,fc00::/8,fd00::/8,fe80::/10</privateaddress>
<insecuredomain/>
<msgcachesize/>
<rrsetcachesize/>
<outgoingnumtcp/>
<incomingnumtcp/>
<numqueriesperthread/>
<outgoingrange/>
<jostletimeout/>
<cachemaxttl/>
<cachemaxnegativettl/>
<cacheminttl/>
<infrahostttl/>
<infrakeepprobing>0</infrakeepprobing>
<infracachenumhosts/>
<unwantedreplythreshold/>
</advanced>
<acls>
<default_action>allow</default_action>
</acls>
<dnsbl>
<enabled>1</enabled>
<safesearch>1</safesearch>
<type>atf,oisd0,oisd1,oisd2,aa,ag,bla0,bla,blc,bld,blf,blf0,blg,blm,blp,blp0,blp1,blr,blr0,bls,blt,blt0,blt1,el,ep,nc,pa,pt,sa,st,sb,ws,yy</type>
<lists>https://raw.githubusercontent.com/larhedse/hostnamelistan/master/BaraHostLista.txt</lists>
<whitelists>dl-debug.dropbox.com</whitelists>
<blocklists/>
<wildcards/>
<address/>
<nxdomain>0</nxdomain>
</dnsbl>
<forwarding>
<enabled>0</enabled>
</forwarding>
<dots>
<dot uuid="f6bbdefd-289e-4e72-9ca0-fa089405a36c">
<enabled>1</enabled>
<type>dot</type>
<domain/>
<server>1.1.1.1</server>
<port>853</port>
<verify>cloudflare-dns.com</verify>
</dot>
<dot uuid="c9ec320c-281c-4183-bfbf-70169696bc92">
<enabled>1</enabled>
<type>dot</type>
<domain/>
<server>1.0.0.1</server>
<port>853</port>
<verify>cloudflare-dns.com</verify>
</dot>
</dots>
<hosts/>
<aliases/>
<domains/>
</unboundplus>
Where did you get this config file from? i wonder so compare mine with the UI.
Meantime the inner exception i see the common denominator a failure in python processing dnsbl_module.py which made me suspect the blocklist element, and makes me think the "integer required" is either the problem or a symptom. From there I'd be thinking something in the list can't be handled.
I'd remove it and observe if makes a difference. Ideally parse the list for anything obvious.
I don't use said blocklists on Unbound. As I've said in the past, I've seen them as causing difficult to diagnose problems. Instead I use AdGuardHome and lists go there.
That part of the config file comes from the backup config file (under System -> Backups -> Download configuration).
Just so I understand: Should I disable dnsbl? And move just that (blocking countries and lists) to AdGuardHome (which I know nothing about, never used so newbie on that)?
It is worth noting that I can not see any kind of pattern regarding WHEN this happens (when the CPU goes 100% on the Unbound process, on one core - I am great full that I have a bunch more cores for the rest...). It is not like x hours since last, or y number of calls or anything like "full disk" or "log full" or anything like that. It just starts to not respond and up goes the CPU% on that core (and the CPU get a bit warmer of course - and this is a passive cold CPU). And I can not get any for me useful information from the log in the GUI - is there anywhere else I could look?
My thinking is that this last screenshot of the log has been the most useful right now. Pity wasn't in code brackets, hate working off screenshots. I digress.
My thinking (which can very well be off) is that something on that blocklist(s) are what trip an Unbound process .
dnsbl_module.py seems to choke , being called by operate - module_event_moddone. The rest in this log is result of it, like the process running away and consuming cpu cycles. That's my take.
So I would disable block list and see if that resolves. Might need to wait some time.
If not, remove DoT too.
And if you must have blocklists, use AdGH as workaround.
p.s. still IMHO this is one for the Unbound vendor to help solve, not for OPN. Hence the hint to their mail lists.
Sorry for that - still adapting to this forums software - I might be a bit to old ;-) Or something :-)
Anyway, I turned off block lists - now it is just a waiting game......
Unbound crashed this time - just stopped without any for me visible reason. From Unbound Log.
2024-01-03T20:17:27 Informational monit 'unbound' trying to restart
2024-01-03T20:17:23 Error monit 'unbound' process is not running
Question: Why should one use Unbound if I can not use Block Lists, Block Countries and DoT?
We haven't proven that my thinking theory is right and the block list is the problem or a contributor. I could very well be wrong.
Albeit your question is generic, I imagine other users don't have a problem, and therefore the cause is specific. In light of this, I can't answer your question but can keep trying to help if you allow.
Again, with my thinking that needs proving, may I ask where are you getting the blocklist from and how are you adding it to the Unbound configuration? Same for whitelist you seem to have included.
Maybe someone could finally post the contents of /var/unbound/root.hints when hitting this issue.
maybe useful but looking at the log screenshot, that seems -to me-, something that fails after.
Yeah, maybe, considering unbound is a recursive resolver, unless configured otherwise, it obviously cannot work without knowing what the root servers are.
agreed but I'm looking at the actual error posted and the line appears clear and the function has:
265 def _load_dnsbl(self):
266 with open(self.dnsbl_path, 'r') as f:
267 try:
268 self.dnsbl = json.load(f)
269 log_info('dnsbl_module: blocklist loaded. length is %d' % len(self.dnsbl['data']))
270 with open(self.size_file, 'w') as sfile:
271 sfile.write(str(len(self.dnsbl['data'])))
272 if mod_env and type(self.dnsbl.get('config')) is dict:
273 mod_env['context'].set_config(self.dnsbl['config'])
274 except (json.decoder.JSONDecodeError, KeyError) as e:
275 if not self.dnsbl:
276 log_err("dnsbl_module: unable to bootstrap blocklist, this is likely due to a corrupted \
277 file. Please re-apply the blocklist settings.")
278 self.dnsbl_available = False
279 return
280 else:
281 log_err("dnsbl_module: error parsing blocklist: %s, reusing last known list" % e)
See the line 277 referenced in the inner exception and what the code is trying to do. Seems to me more relevant to blocklist, but maybe I am on the wrong track.
Well, the idea was: if the default config file was not pointing to the (malformed) one, it'd use the built-in ones just fine. Remove that line form the template and try to reproduce the error after that.
Are you referring to the root.hints file? If yes, then I don't know what line from what template you're suggesting to modify. The particular template where this class and function is, has no reference to the hints file.
# grep root.hints /usr/local/etc/inc/plugins.inc.d/unbound.inc
root-hints: /var/unbound/root.hints
https://github.com/opnsense/core/blob/master/src/etc/inc/plugins.inc.d/unbound.inc#L286
right. Then OP as per this, maybe you could do this when you hit the problem:
$diff -u /var/unbound/root.hints /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints
That will compare them and report differences.
Quote from: cookiemonster on January 03, 2024, 11:59:10 PM
right. Then OP as per this, maybe you could do this when you hit the problem:
$diff -u /var/unbound/root.hints /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints
That will compare them and report differences.
Absolutely! No problem what so ever - but I think I might need to re-enable the blocklist part for this? Or?
I leave the blocklist disabled for now, until I get "other orders" from you ;D
Edit - for the moment the two files are identical, just so we all know.
One thing is a bit important here: The previous version of Unbound (1.04?) that was in OPNsense pre-23.7, worked 100% with my config. All this started after I updated to 23.7, so what ever it is that I am having challenges with, it is part of something that changed (1.04 -> 1.08).
Quote from: lar.hed on January 03, 2024, 08:23:10 PM
Question: Why should one use Unbound if I can not use Block Lists, Block Countries and DoT?
Not sure what Block Countries is, but I use several DNSBL and DoT with Unbound and have no problems.
Quote from: doktornotor on January 03, 2024, 11:22:34 PM
Yeah, maybe, considering unbound is a recursive resolver, unless configured otherwise, it obviously cannot work without knowing what the root servers are.
Unbound shouldn't be running in recursive mode since the OP has DoT configured.
looks like a 'template generation' vs 'unbound start' racing for me
@lar.hed how many "generate template OPNsense/Unbound/*" messages in Backend log you see in time when this happens?
Quote from: lar.hed on January 04, 2024, 12:11:43 PM
Quote from: cookiemonster on January 03, 2024, 11:59:10 PM
right. Then OP as per this, maybe you could do this when you hit the problem:
$diff -u /var/unbound/root.hints /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints
That will compare them and report differences.
Absolutely! No problem what so ever - but I think I might need to re-enable the blocklist part for this? Or?
I leave the blocklist disabled for now, until I get "other orders" from you ;D
Edit - for the moment the two files are identical, just so we all know.
I can only suggest :)
We have two different theories, one that the problem is the hints file. The other that is the blocklist(s). Can't rule out a combination though.
I suggest to leave a suitable time with the lists disabled. If the problem occurs, compare the hints file. Report what are the contents and the differences if you can.
If enough time has passed, re-enable the lists and do the same thing when it reoccurs.
This will leave us to continue looking at what is the use of blocklist doing IMHO.
Quote from: Fright on January 04, 2024, 03:29:57 PM
looks like a 'template generation' vs 'unbound start' racing for me
@lar.hed how many "generate template OPNsense/Unbound/*" messages in Backend log you see in time when this happens?
Glad to have you around on the discussion @Fright, sorry typing at the same time. I'll take a step back to allow your query to blossom. Most likely you'll get to the bottom of it.
see:
https://forum.opnsense.org/index.php?topic=37912.msg185803#new
o unbound: update root hints
might this be related in any way?
Running unbound with latest OPN and DoT/ block lists without problems...
@cookiemonster Hi) Don't worry - this is just a guess for now.
I can't reproduce the root hints error, but if I let the unbound work (accumulate cache?), and then execute (can be done several times in a row):
configctl -dq unbound restart ; configctl -dq unbound restart ; configctl -dq unbound restart ; configctl -dq unbound restart ; configctl -dq unbound restart
On my test VM the Python module is somehow 'breaks':
[34585:0] fatal error: failed to setup modules
[34585:0] error: pythonmod: function init is missing in dnsbl_module.py
which in my opinion looks similar
perhaps the issue reproducibility is highly hardware dependent (test vm is quite slow in performance and does not have an SSD or some)
Quote from: CJ on January 04, 2024, 02:35:36 PM
Quote from: lar.hed on January 03, 2024, 08:23:10 PM
Question: Why should one use Unbound if I can not use Block Lists, Block Countries and DoT?
Not sure what Block Countries is, but I use several DNSBL and DoT with Unbound and have no problems.
Ah So Sorry, my mistake - Block Countries is actually part of firewall packet filtering, Aliases - again I am sorry for the mix up!
Quote from: Fright on January 04, 2024, 03:29:57 PM
looks like a 'template generation' vs 'unbound start' racing for me
@lar.hed how many "generate template OPNsense/Unbound/*" messages in Backend log you see in time when this happens?
Hmmm Zero.
I have none so far in my backend log - although I have to admit I am not used to the backend log, when does it rotate and so on might be part of me not finding anything? Even so I have 2023-12-22 as the lowest date with "unbound" records (they are "unbound cache dump" - so most likely something else? since I had the error after that date but no records at thoose dates in backend log even if I only search for "unbound")
Btw I got some errors in the Unbound Log file again - however no 100% CPU or anything, just this error:
2024-01-04T18:13:30 Notice unbound Backgrounding unbound logging backend.
2024-01-04T18:13:30 Notice unbound daemonize unbound dhcpd watcher.
2024-01-04T18:13:30 Notice unbound [26817:0] notice: init module 2: iterator
2024-01-04T18:13:30 Notice unbound [26817:0] notice: init module 1: validator
2024-01-04T18:13:30 Notice unbound [26817:0] notice: init module 0: python
TypeError: an integer is required (got type NoneType)
os.write(self._pipe_fd, res.encode())
File "dnsbl_module.py", line 227, in log_entry
logger.log_entry(query)
File "dnsbl_module.py", line 548, in operate
2024-01-04T18:13:29 Error unbound [35261:7] error: pythonmod: python error: Traceback (most recent call last):
2024-01-04T18:13:29 Error unbound [35261:7] error: pythonmod: Exception occurred in function operate, event: module_event_moddone
2024-01-04T18:13:29 Notice unbound Closing logger
And just to be clear: Block lists are still disabled. even though I can see in the log above a reference to dnsbl_module.py. That said, I also have safe search enabled (the row below the Enable) - but I can not see why safe search should be active if blocklist is disabled. Well, just to add to the confusion I guess?
QuoteWell, just to add to the confusion I guess?
yes, a little ;), but not because of the dnsbl_module.py module reference - the name of the module is a bit misleading: in fact, this is not a block list module per se, but a Python module (pyunbound) in general, including it's involved in the stats work (Reporting: Unbound DNS) and may be more in business edition.
Something else is confusing:
how is it possible that unbound (re)starts without templates generation (some custom procedures used to control unbound state?)?
and TypeError in py-module. it looks like logger closing pipe and dnsbl_module should catch this with BrokenPipeError exception (just "Logging backend closed connection" message in unbound log).
So far I have not been able to reproduce this but I'll try again )
Ah - I made a fool of my self then, falling into that very well known "Assumption is the mother of all f*ckups" - right 8)
About your confusing thoughts: No I have nothing special custom procedure or anything (I have not come that far yet, give me another couple of months?). However there might be one thing, I do have my own "URLs of Blocklists" setup - can not really see a problem with that, but who knows, I just proves I should not make any assumptions...
I will attach a few screen dumps, from all settings, so it easier to validate what I have (if one has not already grabbed that from my config copy above).
Part 2 of screenshots...
Oh just remembered that I do have Monit setup for monitoring Unbound now days, so yes there is something hidden there - can not say if it is what you might be looking for?
The stop script, that is called is just one line (well except for the first line then):
#!/bin/csh
pgrep "unbound" | grep -v "$$" | xargs kill -9
It is the only way I can stop Unbound when it hits 100% CPU on one core - nothing else works.
played a little more.
It turns out that the TypeError-error can be reproduced on unbound restart if a queue is artificially created in the log_entry function (like event.wait(0.5)) (it reproduced with or without https://github.com/opnsense/core/commit/862381660d05ef1f69d69be9bf542f0d03ea1b03 patch. I think the lock should be set earlier in the log_entry function. but not tested it)
However, it seems to me that this error is not a cause, but a consequence: something is accumulating a queue and the service is consuming CPU time. The question is which of this is the cause and which is the effect.
IMHO you could try disabling statistics collection (Reporting: Settings -> unset the "Enables local gathering of statistics."), restart unbound and observe - maybe we can reduce the number of variables in the equation.
About monit settings - i'm not sure that the '/usr/sbin/service' is the right method for unbound (re)start. What config unbound uses if started with this command?
but this explains the lack of 'templates' entries in the backend log, yes :)
Quote from: Fright on January 06, 2024, 07:09:52 PM
played a little more.
It turns out that the TypeError-error can be reproduced on unbound restart if a queue is artificially created in the log_entry function (like event.wait(0.5)) (it reproduced with or without https://github.com/opnsense/core/commit/862381660d05ef1f69d69be9bf542f0d03ea1b03 patch. I think the lock should be set earlier in the log_entry function. but not tested it)
However, it seems to me that this error is not a cause, but a consequence: something is accumulating a queue and the service is consuming CPU time. The question is which of this is the cause and which is the effect.
IMHO you could try disabling statistics collection (Reporting: Settings -> unset the "Enables local gathering of statistics."), restart unbound and observe - maybe we can reduce the number of variables in the equation.
Done - or should I re-enable blocking lists also?
Quote from: Fright on January 06, 2024, 07:09:52 PMAbout monit settings - i'm not sure that the '/usr/sbin/service' is the right method for unbound (re)start. What config unbound uses if started with this command?
but this explains the lack of 'templates' entries in the backend log, yes :)
Aahhhh yea I have found a few different ways to stop/start unbound, and in the end this is the one I so to speak happend to choose - it seems to be the wrong one then? May I ask for the correct version to start Unbound?
QuoteDone - or should I re-enable blocking lists also?
hm. It would be interesting either way. but if it works stably with BL turned on, then the second one won't be needed )
Quotecorrect version to start Unbound?
'/usr/local/sbin/pluginctl -c unbound_start' i think
Done, and changed. Now we wait...... 8)
This was fast ;D
Sorry folks, it still goes 100% CPU on one core.
Did a diff on the root.hints file - no difference at all.
The change for local gathering for statistics, seems to made no difference.
The Unbopund log (debug level this time):
2024-01-07T18:42:24 Notice unbound daemonize unbound dhcpd watcher.
2024-01-07T18:42:24 Informational unbound [37651:6] info: generate keytag query _ta-4f66. NULL IN
2024-01-07T18:42:24 Informational unbound [37651:0] info: start of service (unbound 1.19.0).
2024-01-07T18:42:24 Notice unbound [37651:0] notice: init module 2: iterator
2024-01-07T18:42:24 Notice unbound [37651:0] notice: init module 1: validator
2024-01-07T18:42:18 Informational unbound [37651:0] info: dnsbl_module: blocklist loaded. length is 3668143
2024-01-07T18:42:14 Informational unbound [37651:0] info: dnsbl_module: updating blocklist.
2024-01-07T18:42:14 Notice unbound [37651:0] notice: init module 0: python
2024-01-07T18:38:38 Informational unbound [85050:0] info: service stopped (unbound 1.19.0).
2024-01-07T18:28:13 Informational unbound [85050:4] info: generate keytag query _ta-4f66. NULL IN
2024-01-07T18:22:46 Informational unbound [85050:2] info: generate keytag query _ta-4f66. NULL IN
2024-01-07T18:14:47 Informational unbound [85050:6] info: generate keytag query _ta-4f66. NULL IN
2024-01-07T18:02:18 Notice unbound dhcpd entry changed Chromecast-Ultra.local @ 10.168.3.39.
2024-01-07T18:02:17 Notice unbound daemonize unbound dhcpd watcher.
2024-01-07T18:02:17 Informational unbound [85050:5] info: generate keytag query _ta-4f66. NULL IN
2024-01-07T18:02:17 Critical unbound [85050:3] fatal error: Could not initialize thread
2024-01-07T18:02:17 Informational unbound [85050:3] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
2024-01-07T18:02:17 Informational unbound [85050:3] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:17 Error unbound [85050:3] error: Could not set root or stub hints
2024-01-07T18:02:17 Warning unbound [85050:1] warning: root hints /root.hints:48 skipping type AAAA
2024-01-07T18:02:17 Informational unbound [85050:0] info: start of service (unbound 1.19.0).
2024-01-07T18:02:17 Error unbound [85050:3] error: reading root hints /root.hints 2:12: Syntax error, could not parse the RR's type
2024-01-07T18:02:17 Notice unbound [85050:0] notice: init module 2: iterator
2024-01-07T18:02:17 Notice unbound [85050:0] notice: init module 1: validator
2024-01-07T18:02:11 Informational unbound [85050:0] info: dnsbl_module: blocklist loaded. length is 3668143
2024-01-07T18:02:07 Informational unbound [85050:0] info: dnsbl_module: updating blocklist.
2024-01-07T18:02:07 Notice unbound [85050:0] notice: init module 0: python
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.524288 1.000000 2
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 19
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 204
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 51
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 209
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 11
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 4
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 16
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 19
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.001024 0.002048 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 50
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0399017 median[50%]=0.0628707 [75%]=0.181509
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.106474 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 7: requestlist max 2 avg 0.0733788 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 7: 1272 queries, 686 answers from cache, 586 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.524288 1.000000 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 24
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 246
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 55
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 236
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 24
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 13
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 15
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 19
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.001024 0.002048 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 57
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0388426 median[50%]=0.0628285 [75%]=0.183421
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.107350 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 6: requestlist max 3 avg 0.0998553 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 6: 1603 queries, 912 answers from cache, 691 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.524288 1.000000 2
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 27
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 225
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 47
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 204
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 21
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 6
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 10
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 14
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 53
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0405183 median[50%]=0.0649738 [75%]=0.190346
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.112311 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 5: requestlist max 2 avg 0.0901639 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 5: 1286 queries, 676 answers from cache, 610 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.524288 1.000000 2
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 22
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 238
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 58
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 200
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 13
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 10
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 9
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 22
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.001024 0.002048 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 52
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.040919 median[50%]=0.0728806 [75%]=0.189036
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.111973 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 4: requestlist max 2 avg 0.110048 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 4: 1793 queries, 1166 answers from cache, 627 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.524288 1.000000 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 23
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 251
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 49
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 210
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 21
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 4
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 22
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 17
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 54
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0397897 median[50%]=0.0652239 [75%]=0.189558
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.109149 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 3: requestlist max 2 avg 0.0996933 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 3: 1062 queries, 410 answers from cache, 652 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 21
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 215
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 58
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 219
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 10
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 7
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 12
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 16
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.001024 0.002048 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 50
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0411844 median[50%]=0.0639649 [75%]=0.182129
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.106370 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 2: requestlist max 2 avg 0.1 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 2: 1277 queries, 667 answers from cache, 610 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 1.000000 2.000000 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 28
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 261
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 56
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 205
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 20
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 3
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 11
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 19
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.001024 0.002048 5
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 52
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0415994 median[50%]=0.0836754 [75%]=0.19372
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.116842 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 1: requestlist max 2 avg 0.0996979 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 1: 1586 queries, 924 answers from cache, 662 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.524288 1.000000 1
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.262144 0.524288 22
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.131072 0.262144 240
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.065536 0.131072 43
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.032768 0.065536 203
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.016384 0.032768 18
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.008192 0.016384 4
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.004096 0.008192 9
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.002048 0.004096 13
2024-01-07T18:02:04 Informational unbound [28156:0] info: 0.000000 0.000001 59
2024-01-07T18:02:04 Informational unbound [28156:0] info: lower(secs) upper(secs) recursions
2024-01-07T18:02:04 Informational unbound [28156:0] info: [25%]=0.0408389 median[50%]=0.065536 [75%]=0.191147
2024-01-07T18:02:04 Informational unbound [28156:0] info: histogram of recursion processing times
2024-01-07T18:02:04 Informational unbound [28156:0] info: average recursion processing time 0.114222 sec
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 0: requestlist max 3 avg 0.109477 exceeded 0 jostled 0
2024-01-07T18:02:04 Informational unbound [28156:0] info: server stats for thread 0: 1048 queries, 436 answers from cache, 612 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-07T18:02:04 Informational unbound [28156:0] info: service stopped (unbound 1.19.0).
2024-01-07T17:59:48 Informational unbound [28156:6] info: generate keytag query _ta-4f66. NULL IN
I got the first Monit message at 18.06 - for high Unbound CPU.
So from 18.02 to 18.06 something happened, and as far as I can say, I have nothing else from restart yesterday, so all this is buildup until high CPU. Killed the Unbound process and everything started it self by Monit...
Hi
so how many "generate template OPNsense/Unbound/*" messages in Backend log you see around 18:02
IMHO: now it realy looks like the 'root.hints'-error causes CPU load and indicates a race:
in 18:02 something (newwanip?) triggers graceful unbound restart.
for some reason templates reloads twice (or something calls unbound restart two times in a row)
18:02:07 - unbound starts. main thread ([85050:0]) is initialized succesfully.
18:02:17 - fourth thread ([85050:3]) getting error reading root.hints. and the problems begins: thread could not be initialized.
and I think it's a template update overlay.
the following errors (high cpu, py-module error because of logging queue etc) are the consequences of the first one.
Technically, the root hints error itself can be solved by simply transferring the copying of the roots file from the templates to the start script, but in my opinion this will only mask a possible race problem
I can try to make a debug patch (setting config lock on unbound reload to handle concurrency. and couple of debug syslog messages to catch possible interference) to confirm this theory if you are ready to try, but I warn you right away that @franco is unlikely to accept it ;), so this is only for the sake of finding the reason.
do you want to continue with it?
Well no don't make that debug patch 8)
Here is the Backend log part that is relevant (timewise - nothing is filtered):
2024-01-07T18:05:07 Informational configd.py message 4dd90b66-4d22-4b19-92ff-b1d1e1531650 [] returned {"status": "ok"}
2024-01-07T18:05:05 Notice configd.py [4dd90b66-4d22-4b19-92ff-b1d1e1531650] refresh url table aliases
2024-01-07T18:05:05 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2024-01-07T18:05:05 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2024-01-07T18:05:05 Notice configd.py generate template container OPNsense/Filter
2024-01-07T18:05:05 Notice configd.py [199b3cd4-e735-4ad9-a83f-d268c00c96b8] generate template OPNsense/Filter
2024-01-07T18:05:05 Notice configd.py [95daef78-b6ce-4fb6-9737-0bf1bc0eeb6f] request pf current overall table record count and table-entries limit
2024-01-07T18:05:05 Notice configd.py [f869e948-9749-48b9-b397-becc20eb0d58] Reloading filter
2024-01-07T18:05:05 Notice configd.py [a4bad4e9-0f00-4676-8628-bbae07dbab33] Linkup stopping igb2
2024-01-07T18:02:09 Informational configd.py message ea12181a-83c5-41d2-a835-a458a224f9dd [] returned {"status": "ok"}
2024-01-07T18:02:07 Notice configd.py [ea12181a-83c5-41d2-a835-a458a224f9dd] refresh url table aliases
2024-01-07T18:02:07 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2024-01-07T18:02:07 Notice configd.py generate template container OPNsense/Filter
2024-01-07T18:02:07 Notice configd.py [13a450cc-6921-4ddc-9dd8-730900ce344c] generate template OPNsense/Filter
2024-01-07T18:02:07 Notice configd.py [df6da6c6-02e0-498b-a2d0-30d658c6e5bf] request pf current overall table record count and table-entries limit
2024-01-07T18:02:07 Notice configd.py [ba3697d2-4f99-4685-83cf-7024ebdb853d] Reloading filter
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //var/unbound/dnsbl_module.py
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //usr/local/etc/unbound_dhcpd.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //var/unbound/root.hints
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //usr/local/etc/unbound.opnsense.d/domainoverrides.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //var/unbound/private_domains.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //usr/local/etc/unbound.opnsense.d/dot.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //usr/local/etc/unbound.opnsense.d/safesearch.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //usr/local/etc/unbound/unbound-blocklists.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //var/unbound/advanced.conf
2024-01-07T18:02:07 Debug configd.py OPNsense/Unbound/* generated //usr/local/etc/unbound.opnsense.d/access_lists.conf
2024-01-07T18:02:06 Notice configd.py generate template container OPNsense/Unbound/core
2024-01-07T18:02:06 Notice configd.py [845009ff-3d47-4a02-8ecb-31dae814a7f4] generate template OPNsense/Unbound/*
2024-01-07T18:02:04 Notice configd.py [8bff90af-0236-4591-aecb-c8eee9a2d294] Unbound cache dump
2024-01-07T18:02:04 Notice configd.py [4d9a903a-6cc2-483d-8a35-87a314696807] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [b30542a5-be9f-45e7-b97b-5c11196af383] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [d45e17ac-cd7d-4729-ab0e-772a2f94a385] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [e7629d0f-173b-467e-be24-09bedb725278] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [6cea4525-a359-4cf3-bf50-41823acbc6c3] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [4d13d470-26fa-4614-9961-74f7b0c5fa2d] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [2939037b-d524-4f65-b0b7-629a7f30c203] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [4d3a183f-26b3-4ab2-b735-fa99667799c8] show system routing table
2024-01-07T18:02:03 Notice configd.py [7dd72e10-23ef-47e7-853e-ea3d4377d3da] IPsec list legacy VirtualTunnelInterfaces
2024-01-07T18:02:03 Notice configd.py [78cb954e-a948-4d41-960d-85f31c64dc3c] Linkup starting igb2
2024-01-07T18:01:00 Informational configd.py message b9bb91cc-09f0-49db-95cc-c4276d0d8252 [] returned OK
2024-01-07T18:01:00 Notice configd.py [b9bb91cc-09f0-49db-95cc-c4276d0d8252] Archive syslog files
Regarding roots.hint - since the files are 100% identical I can not see that as an issue at all. Also do remember that Unbound 1.04 worked without hickups, 1.08 "Unbound Plus" has this challenge.
I agree on your thoughts, all the things that happens are results of something else - it is not the 100% CPU on one core and so on - that is just the result of something else going wrong direction. I can use them, high CPU and high temp, as triggers in Monit to restart - but that is just a workaround kind of....
The igbd2 interface is where my PC (the one I write this on) is attached to. It goes up/down when start/stopping the PC. And in this case I think I had been working on the PC for a while, went away and it could have got into stand by maybe. However I would like to point out that this issue I have happens even when no one is home... So even though it may or may not be connected - it is well not that easy I think. I might also point into this thread where I had issues after the upgrade that started all - it got a thing about the igb2 issue:
https://forum.opnsense.org/index.php?topic=36807.msg180150#msg180150 (https://forum.opnsense.org/index.php?topic=36807.msg180150#msg180150)
It seems I go round and round on the same challenges: Unbound and link up/down - it is a bit of a challenge?
I guess I will regret this at a later date:
I think, after going over Unbound and FreeBSD issues, that this is actually an OS challenge. So I will have my bandaid running (high CPU on Unbound, kill Unbound and restart - it is by default the wrong solution I think, but I do not think we will find any OPNsense thingi to solve this, so it is what it is), and hope that when OPNsense comes out with next version of FreeBSD, this challenge will just go away by it self.
The alternative, for me, is to remove Unbound from the setup - and find other ways for DoT, DNSSEC, BlockLists and more things that Unbound can do...
I have been watching this thread for a bit. Some of the challenges you are experiencing, lar.hed, are similar to a case I just resolved. You might take a look at this thread to see if it's helpful. Not sure... it might be helpful:
https://forum.opnsense.org/index.php?topic=37159.0
Thanks for your tip enpassant.
I do notice one difference (well I have an i7 8550U CPU, while you use the i5 version) and that is you are using ZFS - I use UFS. Might not be that big a thing, except for my current theories of OS challenge deep down. I know it is a bit of a long shot, but when OPNsense gets FreeBSD 14.1-Relese sometime in the future we will know 8)
I will "any day now" reinstall and move to ZFS my self - I need the ability to roll back new versions of OPNsense - and see what and if something changes. In the end I would very much like to get Unbound to work. Just changing to another plugin/software might remove my challenges, but it has not solve the challenge, so I prefer to not change Unbound. And I do like Unbound from earlier year. And please remember that Unbound worked before I got 23.7.xxxxx with the exact same config. So No I will not give up on Unbound - this most likely is not an Unbound thing, it is more of a symptom....
Unbound started failing for me around the same time as your failure. I do think that these issues are related, so we will "wait and see." I wish you the best in getting it resolved. Try to hang in there. My experience is that you are one setting change away from your resolution. Be well!
I don't think it's an OS challenge as it's working for plenty of folks. I have DoT, DNSSEC Support, and Blocklists all enabled on my Unbound and it's working just fine. Zero issues with excessive cpu load or hanging. I am running ZFS but I don't think that would be a factor.
I know it's a huge pain, but it may be worthwhile to do a clean install and then slowly add things back in. Hopefully that will allow you to pinpoint the problem. Additionally, it's a good chance to switch over to ZFS, which I would recommend over running UFS. :)
Quote from: CJ on January 11, 2024, 02:41:14 PM
I don't think it's an OS challenge as it's working for plenty of folks. I have DoT, DNSSEC Support, and Blocklists all enabled on my Unbound and it's working just fine. Zero issues with excessive cpu load or hanging. I am running ZFS but I don't think that would be a factor.
You are most likely correct, as I wrote I might have to regret that statement - however there are some things I do not like with UFS (which is the FS I choose to run) which from time to time have it challenges on other installations I have knowledge about. But I never thought ZFS was an option with just one disc - so well, my misstake.
Quote from: CJ on January 11, 2024, 02:41:14 PMI know it's a huge pain, but it may be worthwhile to do a clean install and then slowly add things back in. Hopefully that will allow you to pinpoint the problem. Additionally, it's a good chance to switch over to ZFS, which I would recommend over running UFS. :)
Well that is most likely happen within 24 hours since I decided anyway to move to ZFS - so it will be a "bonus" with a clean install. Hmmm or are you saying a clean install and NO restore of backup file (which is more or less what I intended to do)? and build EVERYthing from ground up? Now THAT is a heck of a lot more work....
Quote from: lar.hed on January 11, 2024, 02:48:01 PM
You are most likely correct, as I wrote I might have to regret that statement - however there are some things I do not like with UFS (which is the FS I choose to run) which from time to time have it challenges on other installations I have knowledge about. But I never thought ZFS was an option with just one disc - so well, my misstake.
There are some things you lose out on when using only one disk with ZFS, but it's still beneficial over UFS. Keep in mind that your memory usage will increase as ZFS will use some of it for a cache. I'm not sure what OPNsense has it tuned for but my usage doubled. Although I'm still only using 2G out of the 8G in the system.
Quote from: lar.hed on January 11, 2024, 02:48:01 PM
Well that is most likely happen within 24 hours since I decided anyway to move to ZFS - so it will be a "bonus" with a clean install. Hmmm or are you saying a clean install and NO restore of backup file (which is more or less what I intended to do)? and build EVERYthing from ground up? Now THAT is a heck of a lot more work....
Yes. Not using the config file and doing things manually is what I'm recommending. I don't know your setup so I can't speak to the amount of work it would entail, but it would be the best method for accurately determining the source of the issue.
Depending on what you have and your time constraints, you may try doing selective imports of the config. Either import just the Unbound section and see if that causes problems by itself, or manually configure Unbound only, then once you confirm it's working correctly, slowly import sections until you recreate the issue.
So I did a clean re-install of OPNsense, this time on ZFS, and then did a full restore of config file (way to many firewall rules to rewrite everything). It has now been up for 50 hours, and counting. No Unbound 100% CPU core thingi, however Unbound decided to just vansih in one point, and Monit script for checking if Unbound is running restarted Unbound. But this was NOT due to 100% CPU - it was something else....
2024-01-14T11:38:30 Notice unbound daemonize unbound dhcpd watcher.
2024-01-14T11:38:30 Notice unbound [44346:0] notice: init module 2: iterator
2024-01-14T11:38:30 Notice unbound [44346:0] notice: init module 1: validator
2024-01-14T11:38:30 Notice unbound [44346:0] notice: init module 0: python
2024-01-14T11:37:59 Notice unbound dhcpd entry changed Galaxy-Tab-S7.local @ 10.168.3.34.
2024-01-14T11:37:54 Notice unbound daemonize unbound dhcpd watcher.
2024-01-14T11:37:54 Critical unbound [93074:4] fatal error: Could not initialize thread
2024-01-14T11:37:54 Critical unbound [93074:1] fatal error: Could not initialize thread
2024-01-14T11:37:54 Error unbound [93074:1] error: Could not set root or stub hints
2024-01-14T11:37:54 Error unbound [93074:1] error: reading root hints /root.hints 34:29: Syntax error, could not parse the RR's class
2024-01-14T11:37:54 Error unbound [93074:4] error: Could not set root or stub hints
2024-01-14T11:37:54 Error unbound [93074:4] error: reading root hints /root.hints 2:16: Syntax error, could not parse the RR's type
2024-01-14T11:37:54 Notice unbound [93074:0] notice: init module 2: iterator
2024-01-14T11:37:54 Notice unbound [93074:0] notice: init module 1: validator
2024-01-14T11:37:54 Notice unbound [93074:0] notice: init module 0: python
Is this the root.hints stuff that some have run into?
So just a moment ago I got the same 100% CPU stuff again, so running UFS vs ZFS seems of course not to matter at all. Here is the relevant part of Unbound log - any ideas? or comments?
2024-01-16T19:32:02 Informational unbound [63234:7] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T19:32:02 Notice unbound daemonize unbound dhcpd watcher.
2024-01-16T19:32:01 Informational unbound [63234:0] info: start of service (unbound 1.19.0).
2024-01-16T19:32:01 Notice unbound [63234:0] notice: init module 2: iterator
2024-01-16T19:32:01 Notice unbound [63234:0] notice: init module 1: validator
2024-01-16T19:32:01 Notice unbound [63234:0] notice: init module 0: python
2024-01-16T19:26:48 Informational unbound [79537:3] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T19:26:47 Notice unbound daemonize unbound dhcpd watcher.
2024-01-16T19:26:47 Critical unbound [79537:1] fatal error: Could not initialize thread
2024-01-16T19:26:47 Informational unbound [79537:0] info: start of service (unbound 1.19.0).
2024-01-16T19:26:47 Informational unbound [79537:1] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
2024-01-16T19:26:47 Informational unbound [79537:1] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:47 Error unbound [79537:1] error: Could not set root or stub hints
2024-01-16T19:26:47 Error unbound [79537:1] error: reading root hints /root.hints 2:37: Syntax error, could not parse the RR's type
2024-01-16T19:26:47 Notice unbound [79537:0] notice: init module 2: iterator
2024-01-16T19:26:47 Notice unbound [79537:0] notice: init module 1: validator
2024-01-16T19:26:47 Notice unbound [79537:0] notice: init module 0: python
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 13
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 23
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 26
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 10
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.002048 0.004096 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.032768 median[50%]=0.0604948 [75%]=0.116825
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.085532 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 7: requestlist max 2 avg 0.170455 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 7: 94 queries, 6 answers from cache, 88 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 15
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 9
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 24
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.002048 0.004096 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.0395947 median[50%]=0.0600747 [75%]=0.148548
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.094048 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 6: requestlist max 1 avg 0.1 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 6: 71 queries, 11 answers from cache, 60 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 16
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 9
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 20
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 7
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.002048 0.004096 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 5
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.0275017 median[50%]=0.0565248 [75%]=0.149504
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.088550 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 5: requestlist max 2 avg 0.223881 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 5: 79 queries, 12 answers from cache, 67 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 13
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 9
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 19
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.002048 0.004096 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 6
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.01536 median[50%]=0.0517389 [75%]=0.120149
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.082014 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 4: requestlist max 2 avg 0.112903 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 4: 69 queries, 7 answers from cache, 62 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 19
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 12
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 31
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 5
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 6
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 6
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.0335608 median[50%]=0.0565512 [75%]=0.139695
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.087598 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 3: requestlist max 2 avg 0.218391 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 3: 93 queries, 6 answers from cache, 87 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 15
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 13
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 27
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 9
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 6
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.001024 0.002048 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 6
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.0277618 median[50%]=0.0540065 [75%]=0.119729
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.081742 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 2: requestlist max 1 avg 0.185185 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 2: 347 queries, 266 answers from cache, 81 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 20
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 16
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 22
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 3
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.000000 0.000001 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.0435665 median[50%]=0.075776 [75%]=0.155648
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.101185 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 1: requestlist max 1 avg 0.101449 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 1: 81 queries, 12 answers from cache, 69 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.262144 0.524288 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.131072 0.262144 21
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.065536 0.131072 12
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.032768 0.065536 19
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.016384 0.032768 7
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.008192 0.016384 2
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.004096 0.008192 4
2024-01-16T19:26:46 Informational unbound [71805:0] info: 0.002048 0.004096 1
2024-01-16T19:26:46 Informational unbound [71805:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:26:46 Informational unbound [71805:0] info: [25%]=0.0379419 median[50%]=0.0709973 [75%]=0.168521
2024-01-16T19:26:46 Informational unbound [71805:0] info: histogram of recursion processing times
2024-01-16T19:26:46 Informational unbound [71805:0] info: average recursion processing time 0.099173 sec
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 0: requestlist max 2 avg 0.235294 exceeded 0 jostled 0
2024-01-16T19:26:46 Informational unbound [71805:0] info: server stats for thread 0: 84 queries, 16 answers from cache, 68 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:26:46 Informational unbound [71805:0] info: service stopped (unbound 1.19.0).
2024-01-16T19:00:44 Informational unbound [71805:2] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T19:00:43 Notice unbound daemonize unbound dhcpd watcher.
2024-01-16T19:00:43 Informational unbound [71805:0] info: start of service (unbound 1.19.0).
2024-01-16T19:00:43 Notice unbound [71805:0] notice: init module 2: iterator
2024-01-16T19:00:43 Notice unbound [71805:0] notice: init module 1: validator
2024-01-16T19:00:43 Notice unbound [71805:0] notice: init module 0: python
2024-01-16T19:00:43 Informational unbound [37599:0] info: 4.000000 8.000000 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 2.000000 4.000000 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 1.000000 2.000000 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.524288 1.000000 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 19
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 217
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 121
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 260
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 117
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 56
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 77
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 38
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 6
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 99
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.0130194 median[50%]=0.0471355 [75%]=0.124031
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.088653 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 7: requestlist max 5 avg 0.364173 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 7: 1374 queries, 358 answers from cache, 1016 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.524288 1.000000 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 27
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 221
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 123
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 272
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 130
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 56
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 80
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 36
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 8
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000128 0.000256 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 99
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.0140069 median[50%]=0.0469233 [75%]=0.123746
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.075034 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 6: requestlist max 4 avg 0.393365 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 6: 1696 queries, 641 answers from cache, 1055 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 1.000000 2.000000 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.524288 1.000000 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 33
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 258
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 134
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 303
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 141
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 72
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 80
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 53
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 3
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000064 0.000128 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 141
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.0111502 median[50%]=0.0456373 [75%]=0.125203
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.075539 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 5: requestlist max 3 avg 0.370915 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 5: 1631 queries, 407 answers from cache, 1224 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.524288 1.000000 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 31
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 224
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 125
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 252
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 119
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 59
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 85
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 38
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 4
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000032 0.000064 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 129
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.00951105 median[50%]=0.0456411 [75%]=0.125567
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.076967 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 4: requestlist max 4 avg 0.4 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 4: 1668 queries, 598 answers from cache, 1070 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.524288 1.000000 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 30
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 204
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 93
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 247
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 107
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 55
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 87
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 40
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 4
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 3
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000256 0.000512 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000128 0.000256 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000016 0.000032 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 131
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.00741517 median[50%]=0.0424525 [75%]=0.120149
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.073688 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 3: requestlist max 3 avg 0.366435 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 3: 1369 queries, 362 answers from cache, 1007 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 25
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 251
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 110
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 274
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 112
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 64
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 70
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 52
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 3
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000256 0.000512 2
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000128 0.000256 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 127
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.010368 median[50%]=0.0464014 [75%]=0.132639
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.075480 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 2: requestlist max 4 avg 0.411172 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 2: 1732 queries, 640 answers from cache, 1092 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 24
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 231
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 104
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 266
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 132
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 64
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 78
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 53
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 4
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 4
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000128 0.000256 1
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 99
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.01152 median[50%]=0.0444709 [75%]=0.12477
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.073098 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 1: requestlist max 3 avg 0.401887 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 1: 1433 queries, 373 answers from cache, 1060 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.262144 0.524288 26
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.131072 0.262144 240
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.065536 0.131072 132
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.032768 0.065536 272
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.016384 0.032768 112
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.008192 0.016384 54
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.004096 0.008192 92
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.002048 0.004096 39
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.001024 0.002048 4
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000512 0.001024 3
2024-01-16T19:00:43 Informational unbound [37599:0] info: 0.000000 0.000001 134
2024-01-16T19:00:43 Informational unbound [37599:0] info: lower(secs) upper(secs) recursions
2024-01-16T19:00:43 Informational unbound [37599:0] info: [25%]=0.00895052 median[50%]=0.0467426 [75%]=0.125611
2024-01-16T19:00:43 Informational unbound [37599:0] info: histogram of recursion processing times
2024-01-16T19:00:43 Informational unbound [37599:0] info: average recursion processing time 0.074082 sec
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 0: requestlist max 4 avg 0.382671 exceeded 0 jostled 0
2024-01-16T19:00:43 Informational unbound [37599:0] info: server stats for thread 0: 1454 queries, 346 answers from cache, 1108 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-01-16T19:00:43 Informational unbound [37599:0] info: service stopped (unbound 1.19.0).
2024-01-16T18:53:15 Informational unbound [37599:3] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T18:53:15 Informational unbound [37599:5] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T18:00:20 Informational unbound [37599:5] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T17:03:33 Informational unbound [37599:2] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T16:38:50 Informational unbound [37599:5] info: generate keytag query _ta-4f66. NULL IN
Note: /var/unbound/root.hints is identical to /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints - although do note that the Unbound process was killed -9 and the restarted.
I for one fail to see why the choice of filesystem would have an impact unless there were unclean shutdowns that could corrupt files, something UFS is not as good dealing with.
At least you moved to ZFS, so all in all is better. Restoring config just brought the problem across if is config-related though.
My suggestion still is to use Unbound on its own. No blocklists, no whitelists and no DoT. Let it run for a few days to see if it re-occurs the 100% cpu. And if you must have them, move them to AdGuardHome. Ideally not, and introduce the additions one at the time over time.
Oh, I see I need to do one major clarification: I did not move to ZFS (from UFS) to solve this Unbound challenge!
I did that move to be able to do snapshots ( https://forum.opnsense.org/index.php?topic=25540.msg122731 ) so that what happened to me, when I upgraded from a fully functioning OPNsense with a fully working Unbound to 23.7.x version which killed Unbound, is possible to recover from. As well known OPNsense can only backup the config file. Even if one happen to have an old config file, there is just nu way to download an old version with old versions of plugins (se below) to get that old config to work again. It is a one-way road with OPNsense. So to get around that I first thought I go ProxMox route (just to never figure out how to configure (use) my 8 networks ports with my somewhat odd setup) - but I then remembered (I used to work as Unix SysAdmin and DBA) I should be able to do snapshot. UFS snapshot is not included in OPNsense, however ZFS has what is needed included in OPNsense distribution. And from now I will be able to get a snapshot backup solution - I just need a stable OPNsense/Unbound.
Plugins: I currently only have CrowdSec and UDP Broadcast Relay as the only plugins. I would like to keep plugins at as few as ever possible. And this is only due to the fact that one can never (at least not any easy way) download an old version onto a new verison of OPNsense or something. One either runs latest&greatest or one does not do upgrades on working versions (read: Do a snapshot before any upgrade to be able to go back). This is a very harsh and hard way to solve problems. But it is either that or Monit scripts that restarts everything when it get stuck into something. Neither solution is perfect. BandId kind of solution, both they are...
When it comes to AdGuardHome: It would either be a plugin over external source, or a stand alone installation on a pi, or one could then opt for a pi-hole also. Either solution of the three are adding complexity. I do prefer a simpler, as in fewer, option.
Also, Unbound worked prior to 23.7 - and it seems to work for other people. Even with blocklists.
And then for the idea that I have that this is an OS thingi: I have this odd feeling that it is a memory issue. I have this since roots.hints file is never changed from what I can see. Even so thet roots file is part of the challenge. From the latest log I can see that it at 19.00 or so seems to download something, and restarts (19:00:43 first and last row of 19.00.43). That restart seems to stop since at 19:26:46 It stops again. Why is that?
And this is when it's gets fuggly... The 19:26 is the high CPU I kill-9 - So what ever reason that started at 19:00:43, is the reason I get 100% CPU. Now I have a Monit script that kills this and restarts it again.
This restart, the 2nd within 33 minutes or so, this one get's the root or stub error. It workd half o hour ago, but not this time. AND when my Monit script kills -9 and then restarts Unbound, it just works again.
So the restart at 19:00:43 is the reason this goes wrong. I just can not find WHY it does the restart at 19:00:43. We find out that - we find the challenge.
I still think the best path forward is to try with a clean install and no config, or at the very least only a partial config restore. All of the different moving parts makes things like this harder to troubleshoot.
2nd day in a row: 19:07 - Unbound 100%, and it is the same as yeasterday:
2024-01-17T19:07:32 Informational unbound [58736:0] info: service stopped (unbound 1.19.0).
However that PID is not the expected one... Because the PID that got kill -9 was 59852, and the PID before that was 22451 - and the PID that is now running after kill -9 is 85003. 4 Different Unbound processes.
The thing that bothers me is just that 22451 might very well have been running at around 19.00 and then for some reason Unbound decides to start another process, 59852 - and we still have the 58736 PID....
Why that is I do not know. What is Unbound downloading everyday?
Add an delay on startup as suggested before might very well solve this.
And another thing, this is from the Unbound log yesterday:
2024-01-16T19:32:02 Informational unbound [63234:7] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T19:32:02 Notice unbound daemonize unbound dhcpd watcher.
2024-01-16T19:32:01 Informational unbound [63234:0] info: start of service (unbound 1.19.0).
2024-01-16T19:32:01 Notice unbound [63234:0] notice: init module 2: iterator
2024-01-16T19:32:01 Notice unbound [63234:0] notice: init module 1: validator
2024-01-16T19:32:01 Notice unbound [63234:0] notice: init module 0: python
2024-01-16T19:26:48 Informational unbound [79537:3] info: generate keytag query _ta-4f66. NULL IN
2024-01-16T19:26:47 Notice unbound daemonize unbound dhcpd watcher.
2024-01-16T19:26:47 Critical unbound [79537:1] fatal error: Could not initialize thread
2024-01-16T19:26:47 Informational unbound [79537:0] info: start of service (unbound 1.19.0).
And this is from the "General" Log under system:
2024-01-16T19:32:01 Error opnsense /usr/local/sbin/pluginctl: The command '/bin/kill -'TERM' '79537''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 79537: No such process'
It is not strange that the kill does not work since the correct PID has changed - this got to be an error, the last startup has not updated the PID in time before getting killed, or trying to get killed...
Something is going on - let's see if I can monitor this closer tomorrow - it seems to do something at 19:00 (local time) so there is some sort of trigger starting all this.....
Regarding yesterdays 100% CPU, I did dig around in the System -> General log to see if I could find anything usefull. After all, what is triggering the restart of Unbound, right?
2024-01-17T19:10:53 Notice dhclient Creating resolv.conf
2024-01-17T19:09:43 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : unbound_configure_do())
2024-01-17T19:09:43 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : dnsmasq_configure_do())
2024-01-17T19:09:43 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(igb2)
2024-01-17T19:09:42 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt2(igb2)
2024-01-17T19:09:42 Error configctl error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 66, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2024-01-17T19:09:28 Notice kernel <6>igb2: link state changed to UP
2024-01-17T19:07:48 Notice kernel <6>igb2: link state changed to DOWN
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : unbound_configure_do())
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : dnsmasq_configure_do())
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
2024-01-17T19:07:42 Notice kernel <6>igb2: link state changed to UP
2024-01-17T19:07:42 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(igb2)
It seems to be rc.linkup that triggers the restart(s)...
Normal. What you see there are the routines called to setup services needed on an interface when it comes up.
You have this pc that you dedicate an interface to on your firewall , so when it wakes up or boot up, it "wakes" the interface (or so was my understaning, apologies if that's incorrect). Not saying it should bring anything down i.e. "break things" when you bring an interface up but the messages and config of unbound is what I'd expect.
Yes exactly - it is an "IP renew event", triggered by the interface.
Just as karlsson2k in the Unbound crashing thread: https://forum.opnsense.org/index.php?topic=35527.0
Although karlsson2k had WAN IP refresh, in my case I have PC dedicated to that interface (always the same IP address).
So IP is refreshed/renewed, and the:
2024-01-17T19:09:43 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : unbound_configure_do())
Does the restart of Unbound, and if one is lucky, this works - well most of the time anyway.
I do not think that OPNsense on smaller / slower boxes has this issue, but what do I know....?
What is the "Network Interfaces" option for Unbound set to?
Quote from: Patrick M. Hausen on January 18, 2024, 02:58:52 PM
What is the "Network Interfaces" option for Unbound set to?
Default = "All (recommended)"
Excellent - tell me more!
Or: It is the default, what should I replace that with - what and why?
Sorry, the other way round. "Dang, that's correct and not a possible source of problems."
When you explicitly set it to certain interfaces only and these interfaces go down, problems arise.
Ah, no worries then ;D
I'll keep looking around, I still have a few odd ideas what to look for.... I'll will give up when I am out of ideas... Maybe...
Quote from: lar.hed on January 18, 2024, 02:43:13 PM
Yes exactly - it is an "IP renew event", triggered by the interface.
Just as karlsson2k in the Unbound crashing thread: https://forum.opnsense.org/index.php?topic=35527.0
Although karlsson2k had WAN IP refresh, in my case I have PC dedicated to that interface (always the same IP address).
So IP is refreshed/renewed, and the:
2024-01-17T19:09:43 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : unbound_configure_do())
Does the restart of Unbound, and if one is lucky, this works - well most of the time anyway.
I do not think that OPNsense on smaller / slower boxes has this issue, but what do I know....?
Not quite the same.
Your particular case has one client on that interface, most others will have a switch on it with many clients instead, each will need a new IP lease.
The interface state is "down" when your PC is off because it is the only one and hence, will make the interface be down or up, triggering this 'interface up', which contains "unbound_configure". As I said it shouldn't break things but trying to point out that is not an ip address renewal triggering it.
Still thinking a basic setup first is the way to diagnose....
I appreciate your help!!
Sorry that I am stubborn then, but the interface goes up, down and then again up within seconds. I do not turn my PC on/off/on that fast - I would love to be able to do just that, but Windows does not seem to like that.... ???
And let's not forget that this exact config worked pre 23.7.
I have not changed anything inside Unbound or interfaces, not even a tiny bit - it is the same as before.
I have added two lines in my kill script. Copy root.hints and then fstat -u unbound. Anything else I could add that will be executed by automagic (Monit) if Unbound seems to hung again, anything that may or may not help out?
> but the interface goes up, down and then again up within seconds. I do not turn my PC on/off/on that fast - I would love to be able to do just that, but Windows does not seem to like that
:) yes, the up has a bit of a sequencing that makes it look that way. Cosmetic. Short of it is this: pc goes ON, interface will go UP albeit messages will show up, down, up.
> I have added two lines in my kill script. Copy root.hints and then fstat -u unbound. Anything else I could add that will be executed by automagic (Monit) if Unbound seems to hung again, anything that may or may not help out?
With much respect, I think your use of non-standard stops/starts/restarts i.e. work around the problem, could be making things worse.
You might or not still have that first mixture of (I think) DoT or DoH + blocklists (if I read correctly the size had over X million lines) + some whitelist(s), being got from what seemed your own curation and hosted on github + some Monit kill and start not using templates + maybe something else.
Hence just IMHO adding more layers to this is probably not helpful to a diagnostic. I apologise in advance if this is all vanilla now and I'm making the wrong assumptions.
Even if I was though, my stance is diagnose the problem instead of putting tactical fixes that just mask the problem. What do I know though :)
If I had to start all over as in zero config done (like suggested), I would go another route completely and install ProxMox, enable that firewall (filtering, no DPI or anything, however one could always add a virtual host with that and so on), and most likely install pi-hole in a separate virtual server and so on. I would most likely not give OPNsense another shot at the title...
Either we all fix this once and for all, or I'll leave it for ever. It is an OPNsense issue for sure, karlsson2k already stated this, and it did not occur with the exact same config of Unbound in 23.1 - it is a 23.7 introduced issue.
Oh and by the way, not being able to trust either restore of config or the upgrade process for a fully working unbound config is not something I consider a good solution. Yes my kill script is really bad to be honest, it is one of the worst work-arounds I have ever done (and I am an Unix Sysadm, and DBA, and a lot more regarding performance & tuning on really big systems - but I am not network admin), so it is even worst. However this is up to now the only solution that actually kind of works - just in a wrong way.... :-X
@lar.hed
Hi.
I'm sorry I'm missing (I don't have enough time right now to play around with OPNsense :( ).
It turned out that it is much easier (but still quite difficult) to reproduce this behavior on a bare-metal OPNsense. I made a test setup on i7-4770@3.40GHz(4 cores, 8 threads) with SSD.
I ran for several days with a cron job calling the "/usr/local/etc/rc.newwanip" at different intervals.
And although a race is possible in the OPN unbound startup procedure (plus there is a pretty wide window when the root hints file can be replaced), I could not force a race in the startup procedure to provoke an error reading the root's file.
In my opinion, there is a race in the threads creating procedure in the unbound itself:
- I don't understand why but the Unbound devs decided to read the root-hints file on each(!) thread init.
https://github.com/NLnetLabs/unbound/blob/352245160058e9419565f922d62ce01634280b9d/daemon/worker.c#L2278
-If you look at the unbound log, you can see that the thread creation processes overlap (the log shows messages from the threads not in the order in which they were launched)
-If you look at the error message itself in the format "reading root hints /root.hints 2:6: Syntax error, could not parse the RR's type", then the first number ("2") is the line number in the file + 1 ( in fact, we are talking about line No. 1), the second number ("6") is the offset.
https://github.com/NLnetLabs/unbound/blob/352245160058e9419565f922d62ce01634280b9d/iterator/iter_hints.c#L337
You may notice that there is an attempt, for example, to parse a comment line, which does not make sense. I also saw a record skip error due to a mismatch between the record type (AAAA) and the IP address format.
In my opinion, this indicates that due to the thread init race, a situation is possible when the record parsing procedure at some point is getting a line pointer from the other thread...
I didn't find anything in the changes in the last 6 months that could cause this behavior (but I'm very weak in the C language).
Looking at the thread startup procedure, I don't see the sources of other problems (each thread also loads forwarders settings, but they are read from memory, so imho this should not be a problem).
Therefore, IMHO a workaround at this moment could be to simply not to use the root-hints file:
You can try commenting out (adding a semicolon) all the lines in /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints file - in this case, unbound will use the hard-coded roots (they updated them and they actually match the roots in the file).
I'll try to dig deeper into the unbound code, but as I said, I'm very weak in the C and don't have much time right now (
Hi @Fright!
Thanks for helping out.
I can add this that I wrote in the other Unbound thread:
Quote from: lar.hed on January 23, 2024, 10:44:08 AM
I need to be more precis I think...
So, my current setup is OPNsense 23.7.11-amd64.
On this I have the two patches earlier referenced:
opnsense-patch a086f40b
opnsense-patch 845fbd384fe
The I have removed a two plugins: mDNS and IGMP Proxy - and is only running UDP Broadcast Relay: https://forum.opnsense.org/index.php?topic=38114.0
Also, since in my case there seem to be some kind of connection to IP adress changes or something I decided to uncheck "Register DHCP Leases" and "Register DHCP Static Mappings".
So in all 6 changes. I can not say that each change has anything to do with this challenge I have with Unbound, however, the changes above has made Unbound stable from 100% CPU Bound. Which one I would vote for? Patches all day long....
I have had one Unbound stop which I have no reference to why. Monit restarted Unbound directly and since I'm not at home where the OPNsense is installed, I have not been able to check anything....
I have not had any more 100% CPU on one core since I changed the above. Currently I do not know exactly which one that is most likely to have solved this. Although I have to say that removing the extra plugins should not be the reason....
Did you find any solution ? iam running in the same issue since 24.
Errors are:
2024-02-15T01:58:19 Error unbound [47314:1] error: reading root hints /root.hints 7:8: Syntax error, could not parse the RR's type
2024-02-15T01:58:19 Error unbound [47314:3] error: reading root hints /root.hints 2:13: Syntax error, could not parse the RR's type
i didnt change any unbound config but it randomly stopps working now.
General log also shows:
/usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '47314''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 47314: No such process'
I am running 24.1 nowdays, with the same challenge I might add - you can follow that in the new thread:
https://forum.opnsense.org/index.php?topic=38839.0