Unbound just hit 100% CPU on one core again...

Started by lar.hed, February 14, 2024, 03:36:50 PM

Previous topic - Next topic
He doesn't bridge them, instead uses a port for a single device (pc) that I've been telling for some time to avoid for the reason you explained. Switches the device on, scripts fire, disruption ensues.
I've stated that that is the expected behaviour and unbound should not keel over but that is adding to the mix.

February 15, 2024, 10:47:00 AM #16 Last Edit: February 15, 2024, 10:49:50 AM by doktornotor
Quote from: lar.hed on February 15, 2024, 10:37:26 AM
I have a 8-port setup, no bridge needed.

The other guy clearly has one (at least).

Once again, the port will go down when you unplug device from it. Unbound listens on that interface. Unbound will get disrupted. Do NOT do this. You can get an unmanaged 8 port switch pretty much for free from a dumpster. Instead of setting up 8 different subnets and routing packets b/w them for no good reason at all.

SMH.

P.S. The Unbound interfaces are configurable. Exclude anything not permanently connected from its configuration. Will most likely need to listen on WAN only and and point any internal devices to WAN IP address for DNS with similar broken-by-design setups.

February 15, 2024, 10:47:42 AM #17 Last Edit: February 15, 2024, 10:51:44 AM by Sensler3000
i understand it fires a script when changes on the interface can happen. But still something seems to cause the issue when the script fires (at least i assume this based on your answer) so we should try to debug this instead of saying never up and down an interface to avoid running the script ?

Also i assume its a pretty normal setup for people connecting devices to the firewall itself without a switch ?

Also iam pretty sure this problem happend also when no device changes was happening (so all running). So maybe its just a coexistence ? @lar.hed did you ever see this error triggering when one of your devices connects / disconnects from the interface ?

Quote from: Sensler3000 on February 15, 2024, 10:47:42 AM
i understand it fires a script when changes on the interface can happen. But still something seems to cause the issue when the script fires (at least i assume this based on your answer) so we should try to debug this instead of saying never up and down an interface to avoid running the script ?

Also i assume its a pretty normal setup for people connecting devices to the firewall itself without a switch ?
Doubt it. Reason switches exist. People do it, sure. Want to live with the necessary disruption? I don't.

So you say the Unbound error is a directly connecting to not using a switch ? Or is this just an assumption? i know its not best practice but it worked flawlessless for years so suddenly this error appeared with some changes ?

February 15, 2024, 10:54:23 AM #20 Last Edit: February 15, 2024, 10:56:10 AM by doktornotor
Quote from: cookiemonster on February 15, 2024, 10:50:43 AM
Quote from: Sensler3000 on February 15, 2024, 10:47:42 AM
i understand it fires a script when changes on the interface can happen. But still something seems to cause the issue when the script fires (at least i assume this based on your answer) so we should try to debug this instead of saying never up and down an interface to avoid running the script ?

Also i assume its a pretty normal setup for people connecting devices to the firewall itself without a switch ?
Doubt it. Reason switches exist. People do it, sure. Want to live with the necessary disruption? I don't.

Pretty much. Except for things like a dedicated management port with a /30 subnet or similar, configured carefully in a way that you can plug in your laptop to it with a statically configured IP in order to fix issues on a headless box, or similar, and no unneeded services bound to that interface if possible. SSH (and possibly webgui) only.

Ok understood. Since i had this issue the last weeks as well without any device connecting or disconnecting i still think its at least not only related to not using a switch so i try to gather additional data.

Quote from: doktornotor on February 15, 2024, 10:54:23 AM
Pretty much.

Since I do not run bridge ports and direct connected PC, and I have ALSO had this challenge with Unbound when my direct connected LAN PC has NOT been used (as in I am not home, nothing started that particular PC so no interface up/down). So No, your assumption is not correct in this Unbound case. There has been other things related to interface up/down, but that is a totally different story. Do also note that there is at least ONE installation which is a virtual installation, that have been experiencing this Unbound challenge.

Development!!!! Yihaa - or maybe not, hang on ::)

So I just had another 100% CPU bound happening. This time NO roots.hint error. So that in a way is great, or is it?

Well, there is still that 100% Unbound CPU usage, I just do not get the roots.hint error anymore - and this might be related to the patch mentioned earlier:
opnsense-patch -a kulikov-a 2e2294c

So this removes an "error" about the roots.hint file that no one could relate to or anything - it simply put did not make anyone happier.

So I am still with the challenge that something gives Unbound a challenge at some point of time. My current test approach to see if I can get it more stable (do remember that it was a lot more stable in the end of my 23.7-testing with two patches (not applied) and me removing a few plugins (most likely not involved in anything related to Unbound), and then the final thing I did was to NOT update Unbound on DHCP. So now I will uncheck:
1) Register DHCP Leases
2) Register DHCP Static Mappings

This will of course sabotage my name resolution on my intranet - however out of pure luck I have never used name resolution at all on my intranet - everything goes over IP addresses... So the impact for me is slim to none, so this is an easy way to test/validate if the update of any IP address from DHCP might be involved in this Unbound challenge...... Stay tuned.... 8) Starting that egg timer, yet again ::)

Quote from: Patrick M. Hausen on February 14, 2024, 11:24:02 PM
Could you try
cd /tmp
ktrace -p <pid of misbehaving unbound>
# wait a couple of seconds
ktrace -C
kdump


This will catch all system calls the process performs in that time and state. It will not catch if it's calculating "something" internally. But frequently this gives hints about problems. E.g. server processes trying to open a logfile in a nonexistent directory so they cannot log why the fail to start etc. For file accesses you want to look for NAMI calls, for example.

At my latest 100% CPU Unbound happening, I have added thoose two commands to my kill script so that it will create something at each restart - I might not be home and all that, so it was an easy way. Well the files are empty from the two ktrace commands - maybe I do something wrong?

pgrep "unbound" | grep -v "$$" | xargs ktrace -p > /home/lars/ktracep_`date +'%y%m%d_%T'`
sleep 5
ktrace -C > /home/lars/ktraceC_`date +'%y%m%d_%T'`

Quote from: Patrick M. Hausen on February 14, 2024, 11:24:02 PM
Could you try
cd /tmp
ktrace -p <pid of misbehaving unbound>
# wait a couple of seconds
ktrace -C
kdump


I could catch a crash live today and tried this command. But the output was empty. To make sure i did it correct i tested it for another process and i got plenty of output, i tried mutliple times while waiting more than 5 minutes but output was still empty so i guess the process is just dead ? So the unbound process just sits there with 97% usage and DNS resolution does not work anymore until i kill it.

Output was again:

2024-02-15T19:34:08 Critical unbound [18464:3] fatal error: Could not initialize thread
2024-02-15T19:34:08 Error unbound [18464:3] error: Could not set root or stub hints
2024-02-15T19:34:08 Error unbound [18464:3] error: reading root hints /root.hints 8:14: Syntax error, could not parse the RR's class
TypeError: an integer is required (got type NoneType)
os.write(self._pipe_fd, res.encode())
File "dnsbl_module.py", line 227, in log_entry
mod_env['logger'].log_entry(
File "dnsbl_module.py", line 379, in cache_cb
logger.close()
File "dnsbl_module.py", line 444, in deinit

Not dead but mostly dead  ;D

So the process ist occupying one core with just internal calculations of *whatever* without issuing any system calls. Weird ...

DTrace would be the next larger gun.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

i might have a way to trigger it now but iam still testing. Could you tell me how to use DTrace than i happy to help debugging.

No, sorry. I know it exists and I did use it on one occasion or two but I am definitely not fluent.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

February 16, 2024, 12:34:15 AM #29 Last Edit: February 16, 2024, 12:38:12 AM by meyergru
I wonder if the problem persists when you revert to a "default" configuration. I can imagine such behaviour when you create an internal loop like a CNAME pointing to itself.

I know that unbound does some magic with appending the default domain when none is given, that could probably lead to something similar (like when there is a machine that has the same name as a "local" domain, such as home.home). Whatever, I would check what happens when you remove all of your overrides.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+