Unbound crashing

Started by seed, August 22, 2023, 08:18:48 AM

Previous topic - Next topic
Thanks again for the help.
I believe I now have everything setup as required, hopefully see less issues

It would appear that our issue may not get any real love based on this reply....

might be the nail in the coffin for opnsense for me....
Its been a good few years


Well as I wrote in my own little thread, there is an extra restart of Unbound that does not behave: https://forum.opnsense.org/index.php?topic=37840.msg186884#msg186884

I can manage thru the Monit script.

But I would love to solve this. It used to work pre 23.7 - now it well behaves not as good?

Quote from: joshndroid on January 17, 2024, 07:34:24 AM
It would appear that our issue may not get any real love based on this reply....

might be the nail in the coffin for opnsense for me....
Its been a good few years

Did you test the patch?

Which one?

Quoteopnsense-patch 7406a5067f8
opnsense-patch a086f40b
opnsense-patch 845fbd384fe

One thing I would love to see though is a added delay for say 3 seconds for debugpurpose inside the command:
/usr/local/sbin/pluginctl -c unbound_start

This way, will an added delay, one might see if it is a collision between different Unbound processes (the stop process is not finished before the new started process is well running - so they colide). Patch anyone that knows how this start command works?

I don't know which patch.  I don't have this issue so I haven't been tracking all of the developments.  I just disagree with Josh's perception of things.  It appears the OPNsense team is attempting to fix the issue but aren't being provided enough information and testing support to be able to get it fixed.  Therefore anyone who has this problem should test the provided patches and provide feedback so the investigation can continue.

I understand. However there are more than likely +10 installations that has this issue.

January 18, 2024, 10:37:03 AM #112 Last Edit: January 18, 2024, 10:56:45 AM by lar.hed
Anyone that has this issue with Unbound and 100% CPU on one core: May I ask if each and everyone of you could tell me (and everyone else) which CPU type / Bare metal / Virtualization you are running on? Reason: wonder if it could be a performance kind of thing that is part of this....

I'm on Intel i7-8550, 8 threads and 4 cores (yea I know I say 8 cores all the time - but that is another story...). Baremetal, 16GB.

Edit: And also, let me know if any of the interfaces has a direct connection to the OPNsense, for example a PC connected direct to LAN interface (the one used for setup for example) without any switch or anything between?

I apologize in advance, this is a really long post. But I've tried to give as much config detail and test results as I can for @franco to work with :), and I'm willing to do some detailed troubleshooting if that's what it takes.

I am experiencing Unbound crashes as well. When Unbound crashes every 1-2 days, causing me to lose all DNS on my home network. I am able to restart Unbound from the Dashboard after logging into the GUI using the router's IP address. I do not recall seeing 100% CPU (but I will confirm next time it happens). The only error I receive is

2024-01-21T01:57:34 Notice kernel: <6>pid 10957 (unbound), jid 0, uid 59: exited on signal 11

because I don't yet have more logging enabled.

Hardware: Deciso DEC740 (AMD Ryzen Embedded V1500B, 4 cores / 8 threads; 4 GB RAM; 128 GB internal storage)
Firmware: Opnsense Business 23.10.1_2, Commit 23fed1bcf
Unbound version: 1.19.0

Connectivity Audit: Pass, looks normal
Health Audit: Pass, looks normal
Security Audit: 1 Problem found: openssl111-1.1.1w, OpenSSL -- DoS in DH generation, CVE-2023-5678

Plugins installed:
os-OPNBEcore 1.2_1
os-git-backup 1.0_3
os-mdns-repeater 1.1_1
os-wireguard 2.5_2
os-wol 2.4_2


I have used Opnsense Business for approximately two years, and I did not experience this issue until I upgraded from Opnsense Business 23.10 to Opnsense Business 23.10.1 approximately two weeks ago.

My use case isn't very complicated: I am using Unbound without DnsCrypt-proxy or Dnsmasq. The only mildly complicated part of my configuration is my several VLANs and the Mullvad VPN client on my client machine. This is a home network and there is exactly one user: me :)

In answer to @lar.hed's network architecture question, I have one Deciso DEC740 RJ-45 port connected to a Netgear GS105E switch; and a Ubiquiti Unifi AC-Lite AP, which is how I use the network day-to-day (that is, wirelessly). The only mildly complicated part is the one LAGG, 5 VLANs, and perhaps the DHCP Options 121 and 249. I'm not a network engineer so there might be a few bugs in this part of the Opnsense/switch/AP configuration. I can provide more details if needed, up to and including the OS image (privately).

I am a 20+ year Linux user, and my day job is engineering hardware, firmware, and software for embedded devices. I do not know my way around FreeBSD very well though, unfortunately, but I am willing to continue troubleshooting if @franco has any more requests or ideas.

I have read through this thread in its entirety, but I have not yet attempted the patches 7406a50, a086f40, and 845fbd3. I will try each of them, but in the meantime, here is what I have examined:

I have looked through the logs, and I do not think my Unbound is restarting upon receiving a new DHCP lease. The logs in /var/log/system/ suggest DHCP renews from my ISP every 12 hours. I am not totally positive though.

I have hashed the root hints files, as @franco suggested previously:

root@OPNsense:~ # shasum -a 256 /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints /var/unbound/root.hints /root/named.root
a003be56acb66b2c9f77fb4685919bba36094f631b8b2f9bb6599220ebe31219  /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints
a003be56acb66b2c9f77fb4685919bba36094f631b8b2f9bb6599220ebe31219  /var/unbound/root.hints
f91549a77840b2d306fd49ad01facda1f4d4de0795f9f60844d6aea87a156429  /root/named.root

root@OPNsense:~ # md5sum /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints /var/unbound/root.hints /root/named.root
d090610a892c2e476d93042dc70dc393  /usr/local/opnsense/service/templates/OPNsense/Unbound/core/root.min.hints
d090610a892c2e476d93042dc70dc393  /var/unbound/root.hints
d22f17ab89749f32679cb1810d4b6109  /root/named.root


The root.min.hints file and the root.hints file match. However the /root/named.root I downloaded from https://www.internic.net/domain/named.root does not match. Not only are the dates different, but also the IPv{4,6} addresses of the B server has changed:

root@OPNsense:~ # diff -u /var/unbound/root.hints /root/named.root
--- /var/unbound/root.hints 2024-01-21 17:16:17.563320000 +0000
+++ /root/named.root 2024-01-21 17:36:08.329604000 +0000
@@ -8,10 +8,10 @@
;           file                /domain/named.cache
;           on server           FTP.INTERNIC.NET
;       -OR-                    RS.INTERNIC.NET
+;
+;       last update:     December 20, 2023
+;       related version of root zone:     2023122001
;
-;       last update:     July 09, 2018
-;       related version of root zone:     2018070901
-;
; FORMERLY NS.INTERNIC.NET
;
.                        3600000      NS    A.ROOT-SERVERS.NET.
@@ -21,8 +21,8 @@
; FORMERLY NS1.ISI.EDU
;
.                        3600000      NS    B.ROOT-SERVERS.NET.
-B.ROOT-SERVERS.NET.      3600000      A     199.9.14.201
-B.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:200::b
+B.ROOT-SERVERS.NET.      3600000      A     170.247.170.2
+B.ROOT-SERVERS.NET.      3600000      AAAA  2801:1b8:10::b
;
; FORMERLY C.PSI.NET
;


I am not knowledgeable enough about DNS to know: if the B server IP addresses are different or wrong, could that cause any sort of problem related to the possible parsing issue? Or will the protocol just use all the other root servers? I would also think that all Opnsense installations would be observing repeatable Unbound failures if this were the cause.

ICMP (ping) is responsive for all four IPv{4,6} addresses for the B server (the Unbound version and the authoritative version from InterNIC). I attempted DNS lookups direct to each, but apparently 'dig' and 'nslookup' are not installed on Opnsense, and I don't see packages for them :(

Finally some general Opnsense and FreeBSD questions:

- What is "DoT"?
- What is "so-reuseport"?
- How do I restart the Unbound service from the command line?
- Which CLI text editors are installed by default? I installed vim using 'pkg install vim', because I could not find any of my usual text editors.
- Which command-line tools are available to query DNS?

I'll try the three patches so far and report back. Please let me know if there is anything else I should try or any other information I should try to collect.

Thanks!

Let's start with the questions in the end:

DoT => DNS over TLS - https://en.wikipedia.org/wiki/DNS_over_TLS

"so-reuseport" => I don't exactly know where this option is set in the GUI (or is it direct into unbound config file maybe?), but it has to do with parallel handling I think. it might give greater UDP performance. One can read abit about it at: https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html

- How do I restart the Unbound service from the command line?

/usr/local/sbin/pluginctl -c unbound_stop
/usr/local/sbin/pluginctl -c unbound_start


- Which CLI text editors are installed by default?

The only one I know, and the one I use, is vi - it is very basic and not for each and everyone, but it always seems to be on every Linux/Unix installation by default :-)


Then about your challenge, since you can restart from the GUI you are not hit by the issue some of us others are with high CPU load. Because when 100% CPU on one core, you have to kill unbound with kill -9. And you don't need that....

Do you have any Monit setup to auto-restart Monit? If not, do so and you have a band aid solution... See my attached screendump.

Regarding patches, I would only install:
opnsense-patch a086f40b
opnsense-patch 845fbd384fe


One last thing: I had both mDNS and UDP Broadcast Relay installed (and even one more in the same "group") up till recent and when I removed mDNS (and the third one which I can not even recall which it was) I actually got a lot more stable Unbound. I can not see why this has happened, but still it did. So maybe consider remove (as in remove completely, not just disable but drop it from plugins) mDNS and run only UDP Broadcast Relay?

Quote from: CJ on January 17, 2024, 05:22:13 PM
I don't know which patch.  I don't have this issue so I haven't been tracking all of the developments.  I just disagree with Josh's perception of things.  It appears the OPNsense team is attempting to fix the issue but aren't being provided enough information and testing support to be able to get it fixed.  Therefore anyone who has this problem should test the provided patches and provide feedback so the investigation can continue.

There appear to be others here who understand under the hood a lot more than I. I am unsure on what other logs are required, apart from the one within unbound? I can happily provide.


Quote from: lar.hed on January 18, 2024, 10:37:03 AM
Anyone that has this issue with Unbound and 100% CPU on one core: May I ask if each and everyone of you could tell me (and everyone else) which CPU type / Bare metal / Virtualization you are running on? Reason: wonder if it could be a performance kind of thing that is part of this....

I'm on Intel i7-8550, 8 threads and 4 cores (yea I know I say 8 cores all the time - but that is another story...). Baremetal, 16GB.

Edit: And also, let me know if any of the interfaces has a direct connection to the OPNsense, for example a PC connected direct to LAN interface (the one used for setup for example) without any switch or anything between?

I was running on a AMD 2700, 16GB on an SSD... Plenty of horsepower.

I decided to try and reduce power consumption around my place so I have today moved to a Intel 8500T dell micro setup with an m.2 to intel ethernet setup, 16gb ram. Also may as well try a switch from AMD to Intel to see if that makes any difference at all.

LAN setup has always been from the router into a switch

@lar.hed: Thanks for the answers.

I checked and I have so-reuseport: yes in unbound.conf. Which is a sane default, I'll consider adjusting this after I've tried both patches.

But also after re-reading: earlier in the thread, @karlson2k had Unbound crashes with both "yes" and "no" settings. I took away that there was perhaps some small correlation to Unbound's rate of failure, but not enough to say with any confidence that this setting is adjacent to the cause (plus the small sample size).

I definitely did not check for vi (whoops!) but now I have vim so I am set :)

Since I don't see 100% CPU, perhaps that suggests multiple bugs with similar symptoms?

I do not have Monit set up to auto-restart DNS. I was unaware of the feature, and I might try it if the problem becomes worse, but for now I will just re-set it manually because that makes it easier for me to monitor. For my use case and symptoms, it's annoying, but not a huge impact... yet. And now that I'm trying to fix it, I actually WANT the problem to recur frequently so I can catch it in the act ;D

I downloaded the patch files to /root. Next time the problem occurs, I will apply a086f40 before re-starting Unbound.

Unbound is on UDP/53 and mDNS is UDP/5353 so I don't think there would be a conflict. I didn't know about the udp-broadcast-relay plugin though, I'll look at it and consider switching.

Hopefully my DNS will crash again soon...





Quote from: joshndroid on January 23, 2024, 06:39:33 AM
There appear to be others here who understand under the hood a lot more than I. I am unsure on what other logs are required, apart from the one within unbound? I can happily provide.

Based on everything upthread, I think this is one of those frustrating bugs where system logs unfortunately don't help much. Cranking the log level up changes the code path so much that the bug doesn't happen anymore, or happens a lot less frequently :(

Have you tested the patches offered upthread? One of the best ways you can help troubleshoot is to apply them one at a time, or in various combinations. I'm not an OPNsense developer so I don't really know for sure, but from upthread I think the first patch was unsuccessful, so I recommend trying the second and third patches:

The second patch, reply # 46:
Quote from: franco on September 14, 2023, 02:07:55 PM
Here is the promised patch:

https://github.com/opnsense/core/commit/a086f40b

# opnsense-patch a086f40b


Cheers,
Franco

The third patch, reply # 83:
Quote from: karlson2k on October 25, 2023, 07:08:43 AM
Quotehttps://github.com/opnsense/core/commit/845fbd384fe

# opnsense-patch 845fbd384fe
This patch significantly changed the situation.
Unbound is not crashing anymore, while without this patch Unbound was crashing daily.
I'm testing it for several days. The settings were chosen to trigger crash as much as possible (no debugging logging, parallel threads).

Probably without this patch the file is created in parallel with normal Unbound startup.
With this patch the file is created always before the start of Unbound.

Even though the author reported a crash two weeks later (reply # 84), the patch still definitely made a difference.

You can download a patch directly onto your OPNsense device by using its GitHub URL and adding ".patch" to the end:

root@OPNsense:~ # cd /root
root@OPNsense:~ # /usr/local/bin/curl https://github.com/opnsense/core/commit/7406a5067f8.patch -o 7406a5067f8.patch
root@OPNsense:~ # /usr/local/bin/curl https://github.com/opnsense/core/commit/a086f40b.patch -o a086f40b.patch
root@OPNsense:~ # /usr/local/bin/curl https://github.com/opnsense/core/commit/845fbd384fe.patch -o 845fbd384fe.patch


To apply, use the patch command:

root@OPNsense:~ # /usr/bin/patch --dry-run --backup --directory /usr/local --strip 2 --unified --version-control numbered < 7406a5067f8.patch
root@OPNsense:~ # /usr/bin/patch --dry-run --backup --directory /usr/local --strip 2 --unified --version-control numbered < a086f40b.patch
root@OPNsense:~ # /usr/bin/patch --dry-run --backup --directory /usr/local --strip 2 --unified --version-control numbered < 845fbd384fe.patch


I intentionally added the --dry-run argument to prevent breakage from blindly copying and pasting things from the Internet ;) Only if the patch command succeeds after the dry run should you remove the --dry-run argument, which will write the changes to the file you are patching. A dry run will look something like:

root@OPNsense:~ # patch --dry-run --backup --directory /usr/local --strip 2 --unified --version-control numbered < 845fbd384fe.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|From 845fbd384fe564a8b436a5a6475952f90183c188 Mon Sep 17 00:00:00 2001
|From: Franco Fichtner <franco@opnsense.org>
|Date: Fri, 13 Oct 2023 12:54:09 +0200
|Subject: [PATCH] unbound: diagnose tool for strange unbound issue
|
|PR: https://forum.opnsense.org/index.php?topic=36425.0
|---
| src/etc/inc/plugins.inc.d/unbound.inc | 6 +++++-
| 1 file changed, 5 insertions(+), 1 deletion(-)
|
|diff --git a/src/etc/inc/plugins.inc.d/unbound.inc b/src/etc/inc/plugins.inc.d/unbound.inc
|index f74ba58e78b..0b77f131c13 100644
|--- a/src/etc/inc/plugins.inc.d/unbound.inc
|+++ b/src/etc/inc/plugins.inc.d/unbound.inc
--------------------------
Patching file etc/inc/plugins.inc.d/unbound.inc using Plan A...
Hunk #1 succeeded at 143.
Hunk #2 succeeded at 287.
done


Quote from: joshndroid on January 23, 2024, 06:39:33 AM
I decided to try and reduce power consumption around my place so I have today moved to a Intel 8500T dell micro setup with an m.2 to intel ethernet setup, 16gb ram. Also may as well try a switch from AMD to Intel to see if that makes any difference at all.

I would be really, really surprised if this were a CPU-related problem. In my experience, this kind of issue will be in upstream Unbound, OPNsense's patches to the upstream, or the FreeBSD kernel... or even some interaction between more than one of these!

I need to be more precis I think...

So, my current setup is OPNsense 23.7.11-amd64.

On this I have the two patches earlier referenced:
opnsense-patch a086f40b
opnsense-patch 845fbd384fe


The I have removed a two plugins: mDNS and IGMP Proxy - and is only running UDP Broadcast Relay: https://forum.opnsense.org/index.php?topic=38114.0

Also, since in my case there seem to be some kind of connection to IP adress changes or something I decided to uncheck "Register DHCP Leases" and "Register DHCP Static Mappings".

So in all 6 changes. I can not say that each change has anything to do with this challenge I have with Unbound, however, the changes above has made Unbound stable from 100% CPU Bound. Which one I would vote for? Patches all day long....

I have had one Unbound stop which I have no reference to why. Monit restarted Unbound directly and since I'm not at home where the OPNsense is installed, I have not been able to check anything....

Unbound isn't crashing as often as I had seen before (a week passed without a crash)... but I did see a crash yesterday.

I installed FreeBSD ports and compiled and installed Unbound without stripping debug symbols. I also configured the kernel to write a core file at the next crash, so hopefully the crashes continue with the un-stripped Unbound :)

Two other things of note:


  • Apparently the first two patches (a086f40b and 7406a5067f8) are already installed in my installation, probably because I use OPNsense Business Edition. The third patch (845fbd384fe)is not installed, but I will try it after I get a core dump or two.
  • Signal 11 is SIGSEGV, aka a "segfault". That at least gives a small clue that Unbound is trying to access a NULL pointer or something, and so it is killed by the kernel.