Unbound errors in log file

Started by lar.hed, December 31, 2023, 12:28:43 PM

Previous topic - Next topic
> but the interface goes up, down and then again up within seconds. I do not turn my PC on/off/on that fast - I would love to be able to do just that, but Windows does not seem to like that
:) yes, the up has a bit of a sequencing that makes it look that way. Cosmetic. Short of it is this: pc goes ON, interface will go UP albeit messages will show up, down, up.

> I have added two lines in my kill script. Copy root.hints and then fstat -u unbound. Anything else I could add that will be executed by automagic (Monit) if Unbound seems to hung again, anything that may or may not help out?
With much respect, I think your use of non-standard stops/starts/restarts i.e. work around the problem, could be making things worse.

You might or not still have that first mixture of (I think) DoT or DoH + blocklists (if I read correctly the size had over X million lines) + some whitelist(s), being got from what seemed your own curation and hosted on github + some Monit kill and start not using templates + maybe something else.
Hence just IMHO adding more layers to this is probably not helpful to a diagnostic. I apologise in advance if this is all vanilla now and I'm making the wrong assumptions.
Even if I was though, my stance is diagnose the problem instead of putting tactical fixes that just mask the problem. What do I know though :)

If I had to start all over as in zero config done (like suggested), I would go another route completely and install ProxMox, enable that firewall (filtering, no DPI or anything, however one could always add a virtual host with that and so on), and most likely install pi-hole in a separate virtual server and so on. I would most likely not give OPNsense another shot at the title...

Either we all fix this once and for all, or I'll leave it for ever. It is an OPNsense issue for sure, karlsson2k already stated this, and it did not occur with the exact same config of Unbound in 23.1 - it is a 23.7 introduced issue.

Oh and by the way, not being able to trust either restore of config or the upgrade process for a fully working unbound config is not something I consider a good solution. Yes my kill script is really bad to be honest, it is one of the worst work-arounds I have ever done (and I am an Unix Sysadm, and DBA, and a lot more regarding performance & tuning on really big systems - but I am not network admin), so it is even worst. However this is up to now the only solution that actually kind of works - just in a wrong way.... :-X

@lar.hed
Hi.
I'm sorry I'm missing (I don't have enough time right now to play around with OPNsense  :( ).
It turned out that it is much easier (but still quite difficult) to reproduce this behavior on a bare-metal OPNsense. I made a test setup on  i7-4770@3.40GHz(4 cores, 8 threads) with SSD.
I ran for several days with a cron job calling the "/usr/local/etc/rc.newwanip" at different intervals.
And although a race is possible in the OPN unbound startup procedure (plus there is a pretty wide window when the root hints file can be replaced), I could not force a race in the startup procedure to provoke an error reading the root's file.
In my opinion, there is a race in the threads creating procedure in the unbound itself:
- I don't understand why but the Unbound devs decided to read the root-hints file on each(!) thread init.
https://github.com/NLnetLabs/unbound/blob/352245160058e9419565f922d62ce01634280b9d/daemon/worker.c#L2278
-If you look at the unbound log, you can see that the thread creation processes overlap (the log shows messages from the threads not in the order in which they were launched)
-If you look at the error message itself in the format "reading root hints /root.hints 2:6: Syntax error, could not parse the RR's type", then the first number ("2") is the line number in the file + 1 ( in fact, we are talking about line No. 1), the second number ("6") is the offset.
https://github.com/NLnetLabs/unbound/blob/352245160058e9419565f922d62ce01634280b9d/iterator/iter_hints.c#L337
You may notice that there is an attempt, for example, to parse a comment line, which does not make sense. I also saw a record skip error due to a mismatch between the record type (AAAA) and the IP address format.
In my opinion, this indicates that due to the thread init race, a situation is possible when the record parsing procedure at some point is getting a line pointer from the other thread...
I didn't find anything in the changes in the last 6 months that could cause this behavior (but I'm very weak in the C language).
Looking at the thread startup procedure, I don't see the sources of other problems (each thread also loads forwarders settings, but they are read from memory, so imho this should not be a problem).

Therefore, IMHO a workaround at this moment could be to simply not to use the root-hints file:
You can try commenting out (adding a semicolon) all the lines in /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints  file - in this case, unbound will use the hard-coded roots (they updated them and they actually match the roots in the file).
I'll try to dig deeper into the unbound code, but as I said, I'm very weak in the C and don't have much time right now (

Hi @Fright!

Thanks for helping out.

I can add this that I wrote in the other Unbound thread:
Quote from: lar.hed on January 23, 2024, 10:44:08 AM
I need to be more precis I think...

So, my current setup is OPNsense 23.7.11-amd64.

On this I have the two patches earlier referenced:
opnsense-patch a086f40b
opnsense-patch 845fbd384fe


The I have removed a two plugins: mDNS and IGMP Proxy - and is only running UDP Broadcast Relay: https://forum.opnsense.org/index.php?topic=38114.0

Also, since in my case there seem to be some kind of connection to IP adress changes or something I decided to uncheck "Register DHCP Leases" and "Register DHCP Static Mappings".

So in all 6 changes. I can not say that each change has anything to do with this challenge I have with Unbound, however, the changes above has made Unbound stable from 100% CPU Bound. Which one I would vote for? Patches all day long....

I have had one Unbound stop which I have no reference to why. Monit restarted Unbound directly and since I'm not at home where the OPNsense is installed, I have not been able to check anything....

I have not had any more 100% CPU on one core since I changed the above. Currently I do not know exactly which one that is most likely to have solved this. Although I have to say that removing the extra plugins should not be the reason....

Did you find any solution ? iam running in the same issue since 24.

Errors are:
2024-02-15T01:58:19 Error unbound [47314:1] error: reading root hints /root.hints 7:8: Syntax error, could not parse the RR's type
2024-02-15T01:58:19 Error unbound [47314:3] error: reading root hints /root.hints 2:13: Syntax error, could not parse the RR's type


i didnt change any unbound config but it randomly stopps working now.

General log also shows:

/usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '47314''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 47314: No such process'

I am running 24.1 nowdays, with the same challenge I might add - you can follow that in the new thread:
https://forum.opnsense.org/index.php?topic=38839.0