Unbound crashing

karlson2k · October 10, 2023, 12:49:34 PM

Or third option: the file is being created in parallel with the start of Unbound. Then file content should be OK, while Unbound may read the file with an error.
I summarised possible options here: https://forum.opnsense.org/index.php?topic=35527.msg177257#msg177257

franco · October 10, 2023, 12:52:32 PM

We're talking about a very narrow window of opportunity here that seems to be 100% reproducible? I'm still a bit sceptical. But you could just try to make a copy of the file and change the source code for the unbound.conf to point to that one that is not touched...

These theories are very easy to test when you can reproduce. If not it's impossible.

Cheers,
Franco

nerf · October 11, 2023, 08:44:56 AM

Screenshot attached from the system log MAY be relevant. Going back to DNSMasq, till I can figure out what is actually happening.

karlson2k · October 17, 2023, 10:35:05 AM

Quote from: franco on October 10, 2023, 12:52:32 PM
We're talking about a very narrow window of opportunity here that seems to be 100% reproducible?

It happens one time out of 150-200 starts. So it is far from 100% reproducible.

Quote from: franco on October 10, 2023, 12:52:32 PMI'm still a bit sceptical. But you could just try to make a copy of the file and change the source code for the unbound.conf to point to that one that is not touched...

These theories are very easy to test when you can reproduce. If not it's impossible.

I tried to isolate Unbound-related changes from other changes in OPNsense between 23.1 and 23.7, but I failed.
Too many changes and I'm not sure that all of them could be reverted without making changes in other components or frameworks.

Could you, please, point me to the easiest way to test OPNsense changes for Unbound isolated from other changes?

franco · October 17, 2023, 10:37:06 AM

See https://forum.opnsense.org/index.php?topic=36425.msg177797#msg177797

karlson2k · October 18, 2023, 10:06:18 AM

Tried with the "diagnose tool" patch.

After ~100 restarts, Unbound silently crashed.
Had to (re-)start it manually.

The log:

Code Select

2023-10-18T11:00:00	Informational	unbound	24371	[24371:1] info: generate keytag query _ta-4f66. NULL IN	
2023-10-18T11:00:00	Informational	unbound	24371	[24371:0] info: start of service (unbound 1.18.0).	
2023-10-18T11:00:00	Notice	unbound	24371	[24371:0] notice: init module 2: iterator	
2023-10-18T11:00:00	Notice	unbound	24371	[24371:0] notice: init module 1: validator	
2023-10-18T11:00:00	Notice	unbound	24378	daemonize unbound dhcpd watcher.	
2023-10-18T11:00:00	Notice	unbound	24371	[24371:0] notice: init module 0: python	
2023-10-18T10:52:02	Notice	unbound	23736	[23736:0] notice: init module 2: iterator	
2023-10-18T10:52:02	Notice	unbound	23736	[23736:0] notice: init module 1: validator	
2023-10-18T10:52:02	Notice	unbound	23743	daemonize unbound dhcpd watcher.	
2023-10-18T10:52:02	Notice	unbound	23736	[23736:0] notice: init module 0: python

Relevant record from the system log:

Code Select

2023-10-18T10:52:02 Notice kernel <6>pid 23736 (unbound), jid 0, uid 59: exited on signal 11

franco · October 18, 2023, 10:08:25 AM

I'm sorry to say the root.hints theory then appears to be a straw man and Unbound crashes for other reasons.

Cheers,
Franco

karlson2k · October 19, 2023, 07:57:20 AM

The behaviour is different.
Unbound just crashed and I was able to simply restart it in GUI.
Without the test patch Unbound uses 100% CPU and had to be killed by 'kill -9' in CLI.

So something has changed.

karlson2k · October 25, 2023, 07:08:43 AM

Quotehttps://github.com/opnsense/core/commit/845fbd384fe

# opnsense-patch 845fbd384fe

This patch significantly changed the situation.
Unbound is not crashing anymore, while without this patch Unbound was crashing daily.
I'm testing it for several days. The settings were chosen to trigger crash as much as possible (no debugging logging, parallel threads).

Probably without this patch the file is created in parallel with normal Unbound startup.
With this patch the file is created always before the start of Unbound.

karlson2k · November 01, 2023, 02:24:00 PM

Only after two weeks of running with # opnsense-patch 845fbd384fe I got freeze of Unbound again.

The last lines in the log:

Code Select

2023-11-01T16:06:07	Notice	unbound	17147	[17147:0] notice: init module 0: python
2023-11-01T16:03:33	Informational	unbound	15198	[15198:0] info: service stopped (unbound 1.18.0).	
2023-11-01T15:53:34	Informational	unbound	15198	[15198:1] info: generate keytag query _ta-4f66. NULL IN

Note: at 16:06 I manually killed unbound process with kill -9.

The system log:

Code Select

2023-11-01T16:06:07	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : webgui_configure_do(,opt1))	 
2023-11-01T16:06:07	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : vxlan_configure_do())	 
2023-11-01T16:03:32	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : unbound_configure_do(,opt1))	 
2023-11-01T16:03:32	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : openssh_configure_do(,opt1))	 
2023-11-01T16:03:32	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : opendns_configure_do())	 
2023-11-01T16:03:32	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : ntpd_configure_do())	 
2023-11-01T16:03:32	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : dnsmasq_configure_do())	 
2023-11-01T16:03:32	Notice	opnsense	/usr/local/etc/rc.newwanip: plugins_configure newwanip (,opt1)

Observations: Unbound wasn't actually stopped.

Probably the problem is not with Unbound start, but with Unboud stop procedure.

OPNsense thinks that Unbound is stopped (reported in Unbound log), but actually Unbound is still running. The parallel start of the Unbound, while another copy is still running resulted in various problems.

Previously reported high CPU load was produced by the stopping Unbound process, not by the just started process.

LOTRouter · November 08, 2023, 05:19:26 AM

I've been troublshooting this issue for months, and didn't fully realize it was Unbound until reading this post. Now when my internet goes down, TOP shows the unbound process at 100%. I block DoH, and intercept all DNS and forward all Unbound request to DNSCrypt-Proxy, which means I have no fallback when Unbound goes stupid. When this setup works, it works wonderfully. However, when it stops, all DNS queries on my network go unanswered. Even with this patch my unbound process would end up at 100% utilization nightly and I had to kill -9 <pid> it to get it to recover.

For now I've switched to using DNSCrypt-Proxy native, and I haven't seen a problem in almost two weeks. However, I'm willing to go back to Unbound in the name of testing and troubleshooting for any developer that is willing to look into fixing this. I just need hints of what to collect for when it goes stupid.

I do have Flush DNS Cache during reload enabled, which I now wonder if that exacerbated this issue.

karlson2k · November 09, 2023, 07:13:38 PM

Quote from: LOTRouter on November 08, 2023, 05:19:26 AM
I do have Flush DNS Cache during reload enabled, which I now wonder if that exacerbated this issue.

BTW I have it enabled as well.
Maybe "disabled" just slow down startup enough to have mask the issue?

zentoo · November 10, 2023, 05:35:06 PM

I join this boat: I spend two days to understand what was happening on one of our opnsense server with unbound since client servers using it as DNS resolver get regularly timeout on DNS requests.

When the problem arise one CPU was stuck at 100% by unbound process.
Web UI unbound stop or restart froze Web UI. I need to kill -9 unbound PID in order to start it back.

I understand latter that it was unbound restart the culprit because I didn't have this problem before.

I explain more in details: we got two opnsense server as Master/Slave firewalls with master one executing a cron every minute to synchronize slave configuration using command

Code Select

HA update and reconfigure backup.
Few days ago we have updated opnsense on slave then use it as temporary master for validation before updating master one and for out of topic reason we left the slave opnsense as temporary master usage.
8 hours later unbound process get stuck using 100% of one CPU and unbound start to generate timeout for clients.
The process have stayed stuck for hours while I was investigating issue on client side before I understood the problem was opnsense unbound process.

I continue to monitor the process after have killed and restarted it and I was surprised to observe that unbound restart every minute while not on master opnsense. So I understood that unbound is restarted at each master/slave synchro (with unbound service selected for synchro) and that lead to the issue where unbound is stuck with one cpu at 100% after a while.

It seems that there is a kind of race condition when unbound process restart.
I don't know enough how service is managed on opnsense but it's possibly a problem with flock and/or PID detection/creation/deletion.

I have observed another problem: when unbound process is stuck and the sync cron try to restart unbound, it's possible that a new mount point appeared:

Code Select

devfs on /var/unbound/dev (devfs).
So after a while you can observe several times the same mount point of /var/unbound/dev and that need to be cleaned manually.

karlson2k · November 15, 2023, 09:04:12 AM

Looks like it is not a restart problem, but it is a start problem.

The process stuck at 100% CPU is the new Unbound process. Not the old one.

Two things may significantly reduce the chances of startup problems: Log level 3 or higher, not enabled "Flush DNS Cache during reload".
Both things seems to just slow down the startup and resolve the race condition.

To reproduce:
* Set Unbound log level to 1
* Enable "Flush DNS Cache during reload"
* Run as root: sh -c 'while :; do pluginctl unbound_start; sleep 20; done'

After a few iterations the startup problem should be triggered.

karlson2k · November 15, 2023, 09:07:19 AM

The problem seems to be OPNsense-only.

I found no report about similar problems on Linux, pure FreeBSD, PFsense or anything.
However, it easy to find several similar reports for OPNsense.

Unbound crashing

karlson2k

October 10, 2023, 12:49:34 PM #75

franco

October 10, 2023, 12:52:32 PM #76

nerf

October 11, 2023, 08:44:56 AM #77 Last Edit: October 12, 2023, 10:54:25 AM by nerf

karlson2k

October 17, 2023, 10:35:05 AM #78

franco

October 17, 2023, 10:37:06 AM #79

karlson2k

October 18, 2023, 10:06:18 AM #80

franco

October 18, 2023, 10:08:25 AM #81

karlson2k

October 19, 2023, 07:57:20 AM #82

karlson2k

October 25, 2023, 07:08:43 AM #83 Last Edit: October 26, 2023, 01:21:37 PM by karlson2k

karlson2k

November 01, 2023, 02:24:00 PM #84

LOTRouter

November 08, 2023, 05:19:26 AM #85 Last Edit: November 08, 2023, 05:24:02 AM by LOTRouter

karlson2k

November 09, 2023, 07:13:38 PM #86

zentoo

November 10, 2023, 05:35:06 PM #87 Last Edit: November 10, 2023, 05:38:39 PM by zentoo

karlson2k

November 15, 2023, 09:04:12 AM #88

karlson2k

November 15, 2023, 09:07:19 AM #89