Unbound just hit 100% CPU on one core again...

Started by lar.hed, February 14, 2024, 03:36:50 PM

Previous topic - Next topic
So I was happy that this version (24.1) had somehow magically solved the issue with 100% CPU Unbound on one core - it has NOT been solved. It is the same as before:

2024-02-14T15:19:55 Critical unbound [71574:6] fatal error: Could not initialize thread
2024-02-14T15:19:55 Error unbound [71574:6] error: Could not set root or stub hints
2024-02-14T15:19:55 Error unbound [71574:6] error: reading root hints /root.hints 2:8: Syntax error, could not parse the RR's type


So I still need to run "kill -9 <unbound pid>" to kill the process and then restart. It is just in 23.7 - no no difference at all. And this is after 5 days and a bit over 5 hours - however the thing I have to say is that this one could have been kicked of by me starting my PC which is connected direct / native to one of the interfaces. If that can be of any help. Oh and the roots.hint file was/is identical - it is not what is inside the file that is trouble, it is the access to it maybe? as in locked for another process to read it?

Edit: running bare metal with OPNsense 24.1.1-amd64

You could try to use truss or ktrace on the process to find out what the heck it is doing  ;)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Thanks - I will keep that in mind to next time, which we all of course wish will never happen, but then again, where is that egg timer?


Quote from: Fright on February 14, 2024, 06:36:29 PM
https://forum.opnsense.org/index.php?topic=37973.msg188912#msg188912

Quote from: Fright on February 04, 2024, 08:16:00 AM
still root-hint file read error?
best i can offer for now:
https://github.com/opnsense/core/commit/2e2294c0642cdc537cccd785464059edea4948a6
opnsense-patch -a kulikov-a 2e2294c
then enable "Use built-in root hints" in Services: Unbound DNS: General (with advanced mode "on") and Apply

Although I do not mind running any patches, they have a tendency to never be included in any updates, so in the end I will need to re-apply them after each upgrade. That might be okay also, however since this has been the case since 23.7.x, I would like either this patch or something else (sorry don't know what) more permanent.

Here is the thing: I have a restart Unbound from Monit service enabled with high CPU and/or high CPU temp (actually this time I got the high temp version that kicked in - it should have been the high CPU however my fingers seems to have modded the monit script to not work - my bad). This kind of works as a work-around, just like the patches?

I am seeing the exact same thing.

I tried applying the patches within the 23.7 unbound thread here - https://forum.opnsense.org/index.php?topic=35527.msg187426#msg187426

These haven't appeared to make any real difference. There was a small hotfix update after i applied them so not sure if they get overwritten each time or depending on what the update was.

I have just applied the linked patch and enabled so will see how I go

Could you try
cd /tmp
ktrace -p <pid of misbehaving unbound>
# wait a couple of seconds
ktrace -C
kdump


This will catch all system calls the process performs in that time and state. It will not catch if it's calculating "something" internally. But frequently this gives hints about problems. E.g. server processes trying to open a logfile in a nonexistent directory so they cannot log why the fail to start etc. For file accesses you want to look for NAMI calls, for example.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Patrick, I will do that "next time" since it is running just fine for the moment.

I have to be a more specific on the setup also: I am back to my old config so to speak, so I am NOT using any of the old patches from 23.7. I have also re-enabled DHCP registration ("Registrer DHCP Leases" and "Register DHCP Static Mappings") - and the reason for mentioning this is that my old 23.7 OPNsense installation was  a bit more stable without DHCP stuff enabled.... Remember there might, in my case with bare metal and direct attached LAN PC, something that triggers when I wake up my LAN attached PC...?

And just to be perfectly clear, I have applied the patch mentioned above:
opnsense-patch -a kulikov-a 2e2294c

I have done so to see if that makes any difference - I need to see that, so I am now just waiting for either nothing or, as I think, a  new 100% CPU Unbound... Eggtimer 8)

February 15, 2024, 10:11:45 AM #8 Last Edit: February 15, 2024, 10:16:36 AM by Sensler3000
Iam running in the same issues since 24.1 is there any way to help debug this? It looks like it appears every few days but couldnt find any pattern:

This night it appeard again since monit showed failing DNS resolution in the night. However once my PC was running again it seems to work again without any intervention. So iam not sure if this has some relation or if it just was some random event.

Error when it stopped working was this:

2024-02-15T01:58:19 Error unbound [47314:1] error: reading root hints /root.hints 7:8: Syntax error, could not parse the RR's type
2024-02-15T01:58:19 Error unbound [47314:3] error: reading root hints /root.hints 2:13: Syntax error, could not parse the RR's type
TypeError: an integer is required (got type NoneType)
os.write(self._pipe_fd, res.encode())
File "dnsbl_module.py", line 227, in log_entry
logger.log_entry(query)
File "dnsbl_module.py", line 548, in operate


Suspiciously the time i shutdown my PC was around 2PM and it looks like after that unbnound started failing. Booting it up again in the morning at around 08:50 and unbound worked again ? Could this have something to do with an specific interface going up or down ? Its the first time i see the link between client going up and down influencing unbound but maybe its just random. Error seems to be the same as from the Thread creator:

2024-02-15T08:46:24 Informational unbound [84952:0] info: generate keytag query _ta-4f66. NULL IN
2024-02-15T08:46:23 Informational unbound [84952:0] info: start of service (unbound 1.19.0).
2024-02-15T08:46:23 Notice unbound [84952:0] notice: init module 2: iterator
2024-02-15T08:46:23 Notice unbound [84952:0] notice: init module 1: validator
2024-02-15T08:46:23 Notice unbound [84952:0] notice: init module 0: python
2024-02-15T08:46:20 Notice unbound Closing logger
2024-02-15T01:58:19 Notice unbound Backgrounding unbound logging backend.
2024-02-15T01:58:19 Notice unbound daemonize unbound dhcpd watcher.
2024-02-15T01:58:19 Error unbound [47314:0] error: str: syscall error with errno No error: 0
2024-02-15T01:58:19 Notice unbound [47314:0] notice: failed connection from 127.0.0.1 port 44860
2024-02-15T01:58:19 Informational unbound [47314:0] info: start of service (unbound 1.19.0).
2024-02-15T01:58:19 Critical unbound [47314:1] fatal error: Could not initialize thread
2024-02-15T01:58:19 Critical unbound [47314:3] fatal error: Could not initialize thread
2024-02-15T01:58:19 Informational unbound [47314:3] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
2024-02-15T01:58:19 Informational unbound [47314:3] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-02-15T01:58:19 Informational unbound [47314:1] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
2024-02-15T01:58:19 Informational unbound [47314:1] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2024-02-15T01:58:19 Error unbound [47314:1] error: Could not set root or stub hints
2024-02-15T01:58:19 Error unbound [47314:3] error: Could not set root or stub hints
2024-02-15T01:58:19 Error unbound [47314:1] error: reading root hints /root.hints 7:8: Syntax error, could not parse the RR's type
2024-02-15T01:58:19 Error unbound [47314:3] error: reading root hints /root.hints 2:13: Syntax error, could not parse the RR's type


Also the General log shows this for the morning:

2024-02-15T08:46:20 Error opnsense /usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '47314''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 47314: No such process'
2024-02-15T08:46:10 Error opnsense /usr/local/etc/rc.linkup: The command `/sbin/ifconfig 'bridge0' addm 'igb1'' failed to execute

Quote from: Sensler3000 on February 15, 2024, 10:11:45 AM
Suspiciously the time i shutdown my PC was around 2PM and it looks like after that unbnound started failing. Booting it up again in the morning at around 08:50 and unbound worked again ? Could this have something to do with an specific interface going up or down ? Its the first time i see the link between client going up and down influencing unbound but maybe its just random.

So you have your PC connected directly to the firewall LAN interface, without a switch? That will cause tons of issues if that's the case.

Iam using a MiniPC with multiple ports and connected all Devices directly to it. Its running this way since 2 years so i dont see any issue with that ? I just saw the log so i assumed its worth posting this. Why would a switch between the PC and the Firewall change any of this ? i Dont need a switch the Firewall has enough ports to cover all devices. i dont have any issues with the devices whatsoever only unbound started causing the mentioned issues since the recent update.

Quote from: doktornotor on February 15, 2024, 10:28:46 AM
Quote from: Sensler3000 on February 15, 2024, 10:11:45 AM
Suspiciously the time i shutdown my PC was around 2PM and it looks like after that unbnound started failing. Booting it up again in the morning at around 08:50 and unbound worked again ? Could this have something to do with an specific interface going up or down ? Its the first time i see the link between client going up and down influencing unbound but maybe its just random.

So you have your PC connected directly to the firewall LAN interface, without a switch? That will cause tons of issues if that's the case.

I also have one of my PCs direct connected to OPNsense bare metal - I do not have a "ton" of issues with that, and pre23.7 even Unbound worked without ANY issues. There is something introduced in 23.7, and it is still present in 24.1 obviously, that gives OPNsense challenges at some point - interface up/down is most likely not part of Unbound issue - maybe DHCP registration of name/ip is though, I will have to test that at a later date when my current setup fails (if ever).

So, you bridge all the ports on the firewall and connect ALL devices directly to it?

Well, long story short - you absolutely DO need a switch. Total waste of time debugging similar setups. Anything you unplug disrupts the bridge, disrupts unbound which binds to all the interfaces by default - and generally fires up scripts on interface (un)plug events that disrupt things even further.

Just read the logs:

2024-02-15T08:46:10 Error opnsense /usr/local/etc/rc.linkup: The command `/sbin/ifconfig 'bridge0' addm 'igb1'' failed to execute


Not to mention the horrible performance penalty using SW bridge vs. HW switching with ASIC.

Do not do this. Ever.


Since OP has no bridge but the same error it seems to be unrelated. Also it runs like this for 2 years+ (with no  penalty affecting me) so maybe i was on the wrong track here and the interface has nothing to do with it. Pretty sure its still something with unbound.