Unbound crashing

Started by seed, August 22, 2023, 08:18:48 AM

Previous topic - Next topic
Or third option: the file is being created in parallel with the start of Unbound. Then file content should be OK, while Unbound may read the file with an error.
I summarised possible options here: https://forum.opnsense.org/index.php?topic=35527.msg177257#msg177257

We're talking about a very narrow window of opportunity here that seems to be 100% reproducible? I'm still a bit sceptical. But you could just try to make a copy of the file and change the source code for the unbound.conf to point to that one that is not touched...

These theories are very easy to test when you can reproduce. If not it's impossible.


Cheers,
Franco

October 11, 2023, 08:44:56 AM #77 Last Edit: October 12, 2023, 10:54:25 AM by nerf

Screenshot attached from the system log MAY be relevant. Going back to DNSMasq, till I can figure out what is actually happening.



Quote from: franco on October 10, 2023, 12:52:32 PM
We're talking about a very narrow window of opportunity here that seems to be 100% reproducible?
It happens one time out of 150-200 starts. So it is far from 100% reproducible.
Quote from: franco on October 10, 2023, 12:52:32 PMI'm still a bit sceptical. But you could just try to make a copy of the file and change the source code for the unbound.conf to point to that one that is not touched...

These theories are very easy to test when you can reproduce. If not it's impossible.

I tried to isolate Unbound-related changes from other changes in OPNsense between 23.1 and 23.7, but I failed.
Too many changes and I'm not sure that all of them could be reverted without making changes in other components or frameworks.

Could you, please, point me to the easiest way to test OPNsense changes for Unbound isolated from other changes?


Tried with the "diagnose tool" patch.

After ~100 restarts, Unbound silently crashed.
Had to (re-)start it manually.

The log:

2023-10-18T11:00:00 Informational unbound 24371 [24371:1] info: generate keytag query _ta-4f66. NULL IN
2023-10-18T11:00:00 Informational unbound 24371 [24371:0] info: start of service (unbound 1.18.0).
2023-10-18T11:00:00 Notice unbound 24371 [24371:0] notice: init module 2: iterator
2023-10-18T11:00:00 Notice unbound 24371 [24371:0] notice: init module 1: validator
2023-10-18T11:00:00 Notice unbound 24378 daemonize unbound dhcpd watcher.
2023-10-18T11:00:00 Notice unbound 24371 [24371:0] notice: init module 0: python
2023-10-18T10:52:02 Notice unbound 23736 [23736:0] notice: init module 2: iterator
2023-10-18T10:52:02 Notice unbound 23736 [23736:0] notice: init module 1: validator
2023-10-18T10:52:02 Notice unbound 23743 daemonize unbound dhcpd watcher.
2023-10-18T10:52:02 Notice unbound 23736 [23736:0] notice: init module 0: python


Relevant record from the system log:
2023-10-18T10:52:02 Notice kernel <6>pid 23736 (unbound), jid 0, uid 59: exited on signal 11

I'm sorry to say the root.hints theory then appears to be a straw man and Unbound crashes for other reasons.


Cheers,
Franco

The behaviour is different.
Unbound just crashed and I was able to simply restart it in GUI.
Without the test patch Unbound uses 100% CPU and had to be killed by 'kill -9' in CLI.

So something has changed.

October 25, 2023, 07:08:43 AM #83 Last Edit: October 26, 2023, 01:21:37 PM by karlson2k
Quotehttps://github.com/opnsense/core/commit/845fbd384fe

# opnsense-patch 845fbd384fe
This patch significantly changed the situation.
Unbound is not crashing anymore, while without this patch Unbound was crashing daily.
I'm testing it for several days. The settings were chosen to trigger crash as much as possible (no debugging logging, parallel threads).

Probably without this patch the file is created in parallel with normal Unbound startup.
With this patch the file is created always before the start of Unbound.

Only after two weeks of running with # opnsense-patch 845fbd384fe I got freeze of Unbound again.

The last lines in the log:
2023-11-01T16:06:07 Notice unbound 17147 [17147:0] notice: init module 0: python
2023-11-01T16:03:33 Informational unbound 15198 [15198:0] info: service stopped (unbound 1.18.0).
2023-11-01T15:53:34 Informational unbound 15198 [15198:1] info: generate keytag query _ta-4f66. NULL IN

Note: at 16:06 I manually killed unbound process with kill -9.

The system log:
2023-11-01T16:06:07 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : webgui_configure_do(,opt1))
2023-11-01T16:06:07 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : vxlan_configure_do())
2023-11-01T16:03:32 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : unbound_configure_do(,opt1))
2023-11-01T16:03:32 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : openssh_configure_do(,opt1))
2023-11-01T16:03:32 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : opendns_configure_do())
2023-11-01T16:03:32 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : ntpd_configure_do())
2023-11-01T16:03:32 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : dnsmasq_configure_do())
2023-11-01T16:03:32 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (,opt1)


Observations: Unbound wasn't actually stopped.

Probably the problem is not with Unbound start, but with Unboud stop procedure.

OPNsense thinks that Unbound is stopped (reported in Unbound log), but actually Unbound is still running. The parallel start of the Unbound, while another copy is still running resulted in various problems.

Previously reported high CPU load was produced by the stopping Unbound process, not by the just started process.

November 08, 2023, 05:19:26 AM #85 Last Edit: November 08, 2023, 05:24:02 AM by LOTRouter
I've been troublshooting this issue for months, and didn't fully realize it was Unbound until reading this post.  Now when my internet goes down, TOP shows the unbound process at 100%.  I block DoH, and intercept all DNS and forward all Unbound request to DNSCrypt-Proxy, which means I have no fallback when Unbound goes stupid.  When this setup works, it works wonderfully.  However, when it stops, all DNS queries on my network go unanswered.  Even with this patch my unbound process would end up at 100% utilization nightly and I had to kill -9 <pid> it to get it to recover.

For now I've switched to using DNSCrypt-Proxy native, and I haven't seen a problem in almost two weeks.  However, I'm willing to go back to Unbound in the name of testing and troubleshooting for any developer that is willing to look into fixing this. I just need hints of what to collect for when it goes stupid.

I do have Flush DNS Cache during reload enabled, which I now wonder if that exacerbated this issue.
Topton 4 x i225-v (Core i5-1135G7 * 32GB * 512SSD)
Xfinity Gigabit (1.2G Down * 200M Up)

Quote from: LOTRouter on November 08, 2023, 05:19:26 AM
I do have Flush DNS Cache during reload enabled, which I now wonder if that exacerbated this issue.
BTW I have it enabled as well.
Maybe "disabled" just slow down startup enough to have mask the issue?

November 10, 2023, 05:35:06 PM #87 Last Edit: November 10, 2023, 05:38:39 PM by zentoo
I join this boat: I spend two days to understand what was happening on one of our opnsense server with unbound since client servers using it as DNS resolver get regularly timeout on DNS requests.

When the problem arise one CPU was stuck at 100% by unbound process.
Web UI unbound stop or restart froze Web UI. I need to kill -9 unbound PID in order to start it back.

I understand latter that it was unbound restart the culprit because I didn't have this problem before.

I explain more in details: we got two opnsense server as Master/Slave firewalls with master one executing a cron every minute to synchronize slave configuration using command HA update and reconfigure backup.
Few days ago we have updated opnsense on slave then use it as temporary master for validation before updating  master one and for out of topic reason we left the slave opnsense as temporary master usage.
8 hours later unbound process get stuck using 100% of one CPU and unbound start to generate timeout for clients.
The process have stayed stuck for hours while I was investigating issue on client side before I understood the problem was opnsense unbound process.

I continue to monitor the process after have killed and restarted it and I was surprised to observe that unbound restart every minute while not on master opnsense. So I understood that unbound is restarted at each master/slave synchro (with unbound service selected for synchro) and that lead to the issue where unbound is stuck with one cpu at 100% after a while.

It seems that there is a kind of race condition when unbound process restart.
I don't know enough how service is managed on opnsense but it's possibly a problem with flock and/or PID detection/creation/deletion.

I have observed another problem: when unbound process is stuck and the sync cron try to restart unbound, it's possible that a new mount point appeared: devfs on /var/unbound/dev (devfs).
So after a while you can observe several times the same mount point of /var/unbound/dev and that need to be cleaned manually.

Looks like it is not a restart problem, but it is a start problem.

The process stuck at 100% CPU is the new Unbound process. Not the old one.

Two things may significantly reduce the chances of startup problems: Log level 3 or higher, not enabled "Flush DNS Cache during reload".
Both things seems to just slow down the startup and resolve the race condition.

To reproduce:
* Set Unbound log level to 1
* Enable "Flush DNS Cache during reload"
* Run as root: sh -c 'while :; do  pluginctl unbound_start; sleep 20; done'

After a few iterations the startup problem should be triggered.

The problem seems to be OPNsense-only.

I found no report about similar problems on Linux, pure FreeBSD, PFsense or anything.
However, it easy to find several similar reports for OPNsense.