Unbound errors in log file

Started by lar.hed, December 31, 2023, 12:28:43 PM

Previous topic - Next topic
So I have, still, Unbound problemes....

With no clue what so ever it just hangs on one core (100%) and stops responding to DNS requests. This has made me, on top of other challenges (like filter rules and live view), a bit confused. So I have decided to see if anyone could provide any information on my log file (attached)?

What I have done so far is change from SMPT server name to SMTP IP adress so I can get emails from Monit which monitors if a) Unbound hits 100% on one core or b) the CPU goes very high (which so far only happened when Unbound hits 100%, but I think it is a good indicator that something might not be okay so it is a bit more of a general Monit script so to speak), if can email me (since no longer depending on Unbound to resolve SMTP name) and I have been starting to develop a script that will do kill -9 (yes, this is what seems to be needed when this occurs - I know it is last resort, but I can not have a firewall that hangs - need to fix it somehow) and then pluginctl unbound start. This will be completed next time this happens - which I never knows when it will happen, funny enough...

So, anyone that can bring light into my challenge? Or do I go the hard way (kill -9)?


I for sure will - next time, it may just take a few days....

Right, so this morning Unbound hit 100% CPU on one core - I copy&pasted the command from the other thread, and here is the result:

Quoteroot@OPNsense:~ # service unbound stop && rm -v /var/unbound/root.hints & cp -v /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints /var/unbound/root.hints && service unbound onestart
[1] 13720
/usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints -> /var/unbound/root.hints
Cannot 'stop' unbound. Set unbound_enable to YES in /etc/rc.conf or use 'onestop' instead of 'stop'.
Obtaining a trust anchor../var/unbound/root.hints
.
Starting unbound.
[1]  + Done                          ( service unbound stop && rm -v /var/unbound/root.hints )

And Unbound is still killing one core with 100% CPU - so that did not do any difference I say.

A bit curios though about the "cannot stop" - since well I have to use Kill -9 as I said - and this time I got some more messages, the one with "unbound_enable" to YES makes me most curios though since Unbound IS enabled. And running, just 100% CPU running and because of that not responding to any requests....

January 03, 2024, 11:00:54 AM #4 Last Edit: January 03, 2024, 11:03:06 AM by doktornotor
Use onestop. Or use configctl unbound <stop|start> - OPNs does not use service normally, hence no unbound_enable.

I have tried:
- the configctl stop - does not stop the Unbound process
- onestop - does not stop the Unbound process

The only way so far is:

pgrep "unbound" | grep -v "$$" | xargs kill -9

Or just kill the process manually of course - but I did a small script for this...

Does not stop does not help diagnosing anything. So, what is returned when running those commands?

Sorry for not being clear about that.

Now I can not remember any return at all, except that the process just keeps running 100% in one core. I will double check this next time this happens, and as I wrote before, I never know or understand why Unbound goes 100% CPU on one core. For example the error output from the log for this morning occurrence is only this:
Quote2024-01-03T10:28:26   Notice   unbound   Backgrounding unbound logging backend.   
2024-01-03T10:28:26   Notice   unbound   daemonize unbound dhcpd watcher.   
2024-01-03T10:28:26   Critical   unbound   [38957:2] fatal error: Could not initialize thread   
2024-01-03T10:28:26   Error   unbound   [38957:2] error: Could not set root or stub hints   
2024-01-03T10:28:26   Error   unbound   [38957:2] error: reading root hints /root.hints 2:2: Syntax error, could not parse the RR's TTL   
2024-01-03T10:28:26   Notice   unbound   [38957:0] notice: init module 2: iterator   
2024-01-03T10:28:26   Notice   unbound   [38957:0] notice: init module 1: validator   
2024-01-03T10:28:15   Notice   unbound   [38957:0] notice: init module 0: python   
2024-01-03T10:28:12   Notice   unbound   Closing logger/quote]

EDIT: CHanged to "Notice" on Log output.... It may give more of a hint?

If you don't mind me butting in, from your log and previous posts on the other thread, my thinking is that you have something in your additional Unbound settings that tickle the problem and eventually crashes it.
I imagine you are still using either blocklists or DoT or both and there is something in either that triggers it.
The other thread I always thought it was strange why just in your case you'd have a problem with the root.hints file. Script to get around it is masking the problem. Finding the root of the problem is the real pickle.
So, indulge my thinking. Are you using blocklists, DoT?
Be prepared, It might be that you'd want to consider we'd have to go to a mailing list https://nlnetlabs.nl/support/mailing-lists/ to get to it.

Yes, I use blocklists and DoT.

No Worries!

Here is the config part of Unbound (or UnboundPlus as it is called) - feel free :-)

    <unboundplus version="1.0.8">
      <general>
        <enabled>1</enabled>
        <port>53</port>
        <stats>1</stats>
        <active_interface/>
        <dnssec>1</dnssec>
        <dns64>0</dns64>
        <dns64prefix>64:ff9b::/96</dns64prefix>
        <noarecords>0</noarecords>
        <regdhcp>1</regdhcp>
        <regdhcpdomain/>
        <regdhcpstatic>1</regdhcpstatic>
        <noreglladdr6>0</noreglladdr6>
        <noregrecords>0</noregrecords>
        <txtsupport>0</txtsupport>
        <cacheflush>0</cacheflush>
        <local_zone_type>transparent</local_zone_type>
        <outgoing_interface/>
        <enable_wpad>0</enable_wpad>
      </general>
      <advanced>
        <hideidentity>0</hideidentity>
        <hideversion>0</hideversion>
        <prefetch>0</prefetch>
        <prefetchkey>0</prefetchkey>
        <dnssecstripped>0</dnssecstripped>
        <serveexpired>0</serveexpired>
        <serveexpiredreplyttl/>
        <serveexpiredttl/>
        <serveexpiredttlreset>0</serveexpiredttlreset>
        <serveexpiredclienttimeout/>
        <qnameminstrict>0</qnameminstrict>
        <extendedstatistics>0</extendedstatistics>
        <logqueries>0</logqueries>
        <logreplies>0</logreplies>
        <logtagqueryreply>0</logtagqueryreply>
        <logservfail>0</logservfail>
        <loglocalactions>0</loglocalactions>
        <logverbosity>1</logverbosity>
        <valloglevel>0</valloglevel>
        <privatedomain/>
        <privateaddress>0.0.0.0/8,10.0.0.0/8,100.64.0.0/10,169.254.0.0/16,172.16.0.0/12,192.0.2.0/24,192.168.0.0/16,198.18.0.0/15,198.51.100.0/24,203.0.113.0/24,233.252.0.0/24,::1/128,2001:db8::/32,fc00::/8,fd00::/8,fe80::/10</privateaddress>
        <insecuredomain/>
        <msgcachesize/>
        <rrsetcachesize/>
        <outgoingnumtcp/>
        <incomingnumtcp/>
        <numqueriesperthread/>
        <outgoingrange/>
        <jostletimeout/>
        <cachemaxttl/>
        <cachemaxnegativettl/>
        <cacheminttl/>
        <infrahostttl/>
        <infrakeepprobing>0</infrakeepprobing>
        <infracachenumhosts/>
        <unwantedreplythreshold/>
      </advanced>
      <acls>
        <default_action>allow</default_action>
      </acls>
      <dnsbl>
        <enabled>1</enabled>
        <safesearch>1</safesearch>
        <type>atf,oisd0,oisd1,oisd2,aa,ag,bla0,bla,blc,bld,blf,blf0,blg,blm,blp,blp0,blp1,blr,blr0,bls,blt,blt0,blt1,el,ep,nc,pa,pt,sa,st,sb,ws,yy</type>
        <lists>https://raw.githubusercontent.com/larhedse/hostnamelistan/master/BaraHostLista.txt</lists>
        <whitelists>dl-debug.dropbox.com</whitelists>
        <blocklists/>
        <wildcards/>
        <address/>
        <nxdomain>0</nxdomain>
      </dnsbl>
      <forwarding>
        <enabled>0</enabled>
      </forwarding>
      <dots>
        <dot uuid="f6bbdefd-289e-4e72-9ca0-fa089405a36c">
          <enabled>1</enabled>
          <type>dot</type>
          <domain/>
          <server>1.1.1.1</server>
          <port>853</port>
          <verify>cloudflare-dns.com</verify>
        </dot>
        <dot uuid="c9ec320c-281c-4183-bfbf-70169696bc92">
          <enabled>1</enabled>
          <type>dot</type>
          <domain/>
          <server>1.0.0.1</server>
          <port>853</port>
          <verify>cloudflare-dns.com</verify>
        </dot>
      </dots>
      <hosts/>
      <aliases/>
      <domains/>
    </unboundplus>

Where did you get this config file from? i wonder so compare mine with the UI.
Meantime the inner exception i see the common denominator a failure in python processing dnsbl_module.py which made me suspect the blocklist element, and makes me think the "integer required" is either the problem or a symptom. From there I'd be thinking something in the list can't be handled.
I'd remove it and observe if makes a difference. Ideally parse the list for anything obvious.
I don't use said blocklists on Unbound. As I've said in the past, I've seen them as causing difficult to diagnose problems. Instead I use AdGuardHome and lists go there.

That part of the config file comes from the backup config file (under System -> Backups -> Download configuration).

Just so I understand: Should I disable dnsbl? And move just that (blocking countries and lists) to AdGuardHome (which I know nothing about, never used so newbie on that)?

It is worth noting that I can not see any kind of pattern regarding WHEN this happens (when the CPU goes 100% on the Unbound process, on one core - I am great full that I have a bunch more cores for the rest...). It is not like x hours since last, or y number of calls or anything like "full disk" or "log full" or anything like that. It just starts to not respond and up goes the CPU% on that core (and the CPU get a bit warmer of course - and this is a passive cold CPU). And I can not get any for me useful information from the log in the GUI - is there anywhere else I could look?

My thinking is that this last screenshot of the log has been the most useful right now. Pity wasn't in code brackets, hate working off screenshots. I digress.
My thinking (which can very well be off) is that something on that blocklist(s) are what trip an Unbound process .
dnsbl_module.py seems to choke , being called by operate - module_event_moddone. The rest in this log is result of it, like the process running away and consuming cpu cycles. That's my take.
So I would disable block list and see if that resolves. Might need to wait some time.
If not, remove DoT too.
And if you must  have blocklists, use AdGH as workaround.

p.s. still IMHO this is one for the Unbound vendor to help solve, not for OPN. Hence the hint to their mail lists.

Sorry for that - still adapting to this forums software - I might be a bit to old ;-) Or something :-)

Anyway, I turned off block lists - now it is just a waiting game......