Unbound crashing

karlson2k · November 16, 2023, 10:43:26 AM

Quote from: karlson2k on November 15, 2023, 09:04:12 AM
To reproduce:
* Set Unbound log level to 1
* Enable "Flush DNS Cache during reload"
* Run as root: sh -c 'while :; do pluginctl unbound_start; sleep 20; done'

After a few iterations the startup problem should be triggered.

It should be easier to reproduce the issue now.
I hope the developers will take a look on it.

zentoo · November 20, 2023, 02:47:22 PM

On my master/slave opnsense setup with a configuration synchronisation per minute (cron command: HA update and reconfigure backup) I've tried to debug further:

[System: High Availability: Settings] Unbound DNS: selected
=> unbound 100% CPU after a while on slave opnsense with an unbound restart every minute

[System: High Availability: Settings] Unbound DNS: not selected
=> unbound 100% CPU after a while on slave opnsense with an unbound restart every minute

No High Availability synchronisation between master and slave opnsense
=> no unbound restart on slave opnsense so no problem

I tried to understand which High Availability settings make the restart of unbound and in fact there is no dependencies logic at all. If there is any service selected for synchronisation, unbound will be restarted at synchronisation time so even if it's not needed.

I have tried to trigger the problem as you do but didn't succeed even with an unbound restart every 2s.

So I've explored unbound init script to see how it manages the pid file to avoid a double unbound process.
I didn't find specific clue because pid is managed by daemon utility.
On my precedent debug session I have noticed several /var/unbound/dev mount points and I think it is due to a race condition of several unbound starting in the same time.
So I've setup a simple way to check this:

Monitoring:

Code Select

while true; do echo "$(date) $(stat -x /var/run/unbound.pid | grep Change:) file: $(cat /var/run/unbound.pid) pid: $(pgrep unbound) mount: $(mount | grep -c /var/unbound/dev)"; sleep 0.1 ; done

Trigger (5 parallel start of unbound):

Code Select

pluginctl unbound_start & pluginctl unbound_start & pluginctl unbound_start & pluginctl unbound_start & pluginctl unbound_start &

I've succeed to get several /var/unbound/dev mount point instances and eventually get a stuck unbound with a 100% CPU.

IMHO I think the unbound problem can be triggered by multiple concurrent restart too.
So the unbound start need to use a lock mechanism of similar to avoid several unbound starts because the launch of unbound can take time and so to only check if PID exists in order to launch the process is not enough.

lar.hed · November 27, 2023, 06:44:45 PM

After reading this thread I can only say I think I have this issue also. When my OPNsense installation hits 100% unbound in one (of eight) cores, there are multiple unbound running. I will se if I can find anything more usable on my side....

H3n · December 20, 2023, 11:18:22 AM

Running into the same issue.
Our current workaround is that we do have a scheduled reboot each night, hoping that we resolve the issue.

We notice that the # of process rise as soon as the error message pops up

Is it possible to set `so-reuseport: no` via GUI?

rene_ · December 21, 2023, 08:51:04 AM

Quote from: zentoo on November 20, 2023, 02:47:22 PM
On my master/slave opnsense setup with a configuration synchronisation per minute (cron command: HA update and reconfigure backup) I've tried to debug further:

Do not do this.
Each config sync will restart the services on the slave firewalls, e.g. an ntp service will never finish its synchronisation and so on.
This will cause more trouble than it is worth.
Increase the interval to at least one hour.

doktornotor · December 21, 2023, 09:02:47 AM

Could someone explain to me what's the huge advantage of the HA DNS setup when it's causing nothing but trouble, while pretty much the same result can be achieved by simply pointing clients to multiple DNS servers? Certainly must be missing something here.

zentoo · December 27, 2023, 04:53:17 PM

Quote from: rene_ on December 21, 2023, 08:51:04 AM
Quote from: zentoo on November 20, 2023, 02:47:22 PM
On my master/slave opnsense setup with a configuration synchronisation per minute (cron command: HA update and reconfigure backup) I've tried to debug further:

Do not do this.
Each config sync will restart the services on the slave firewalls, e.g. an ntp service will never finish its synchronisation and so on.
This will cause more trouble than it is worth.
Increase the interval to at least one hour.

I understood it with this unbound issue and so proceed to extend sync time.

IMHO the design of configuration synchronization is really not the good one.
It would be clever to restart only services that have their configuration modified by the synchronization like usual operating systems. It's really a problem for a system that is designed to provide high availability.

At each configuration sync, the master XML file need to be split for each service and compared to related split slave service configuration in order to only restart the service if its configuration have been modified.
It shouldn't be so hard to implement.

joshndroid · January 12, 2024, 07:57:56 AM

Has anyone got a bit of a method to get this either bypassed, through a properly scheduled reboot or a patch working for this on latest?

It has been months of this issue for me, its suuuper random so even a daily scheduled reboot doesn't bypass it.

I am considering moving to a UniFi router as this is made my home network so unreliable it's really killing the family approval when i get calls/etc while im at work that the internet is down

lar.hed · January 12, 2024, 10:03:52 AM

Well I still don't know if I have the same issue. Unbound does not like my setup, or as I think something in the OS level is giving Unbound crap in return. So maybe not Unbound issue. And as I wrote in my own long thread, I will most likely have to eat up this statement sooner or later. I'm fine with that... I just would like to solve this.

I use Monit to figure out when it hits 100% CPU on one core, then I have a script that does kill -9, and then restart. It is a band aid kind of solution - however it works.

joshndroid · January 12, 2024, 10:07:11 AM

Can you share your monit setup/script so i can run something similar?

lar.hed · January 12, 2024, 10:14:15 AM

Sure!

I have my Monit scrips in:

/usr/local/opnsense/scripts/OPNsense/Monit

For finding, for me, the 100% CPU bound Unbound process, I use this:

Code Select

#!/bin/csh

set UnboundCPU=`ps auwwx | grep /usr/local/sbin/unbound | grep -v grep | awk '{print $3}' | awk -F. '{print $1}' | grep 100`

exit $UnboundCPU

For killing that process when the first Monit script above reacts, I use this for killing that process:

Code Select

#!/bin/csh

pgrep "unbound" | grep -v "$$" | xargs kill -9

And the start is normal, so nothing special about that:

Code Select

/usr/local/sbin/pluginctl -c unbound_start

I added a "Service test" under monit that tests the first scripts return, as "status > 90". The it is just normal setup of the rest under Monit.

joshndroid · January 12, 2024, 10:31:45 AM

Thanks for sharing.
I am currently trying to set this up.

I have created the 2 scripts in the monit folder as necessary with the contents as below.
I added a wait for a couple second between the killing and the starting.

I am unsure how to setup the service test.
Am i executing the first script with the condition of the status > 90?

lar.hed · January 12, 2024, 11:23:10 AM

I actually have TWO monit's to handle Unbound. It was by misstake we can say... So the first instance checks if Unbound is running or not, and has stop/start in the setup. The 2nd one has ONLY stop, since well I have autorestart due to the first one. They can easily be combined I guess. Although I think that the first one which does a normal stop, might be good to have around. The 2nd one, that does a kill -9, well it is the bandaid so maybe not that important. Anyway, when the 2nd one kills the Unbound process the first one will start Unbound since it is not running.... A strange way maybe to solve this, but it was by misstake I would say. It works anyway on my setup.

The thing here is that Unbound hitting >90 CPU (the first script returns the CPU usage for the Unbound process, as long as the return valye is below threshold it does not do anything. When the value goes above 90 it will fire the kill script (action=stop in my case).

joshndroid · January 13, 2024, 01:44:05 AM

thanks for the replies.

At this time i believe i have the service test setup correctly (i have attached an image).
I just am unsure how to setup the monit service settings entry correctly. TBH i have never really understood how to setup monit properly. Do you have a screenshot of your monit service settings entry for the unbound killer/etc so i can set mine up correctly?

lar.hed · January 14, 2024, 01:33:17 PM

Sorry for a bit of late reply. I think you have entered correct information, no worries there.

And here is the attachment for the service. Do note that the "stop" word is there since the service requires (!) an argument - however it is not used. I guess I should re-write the csh script to take an argument so it gets a bit more flexible...

Unbound crashing

karlson2k

November 16, 2023, 10:43:26 AM #90

zentoo

November 20, 2023, 02:47:22 PM #91

lar.hed

November 27, 2023, 06:44:45 PM #92

H3n

December 20, 2023, 11:18:22 AM #93

rene_

December 21, 2023, 08:51:04 AM #94

doktornotor

December 21, 2023, 09:02:47 AM #95

zentoo

December 27, 2023, 04:53:17 PM #96 Last Edit: December 27, 2023, 04:57:55 PM by zentoo

joshndroid

January 12, 2024, 07:57:56 AM #97

lar.hed

January 12, 2024, 10:03:52 AM #98

joshndroid

January 12, 2024, 10:07:11 AM #99

lar.hed

January 12, 2024, 10:14:15 AM #100

joshndroid

January 12, 2024, 10:31:45 AM #101

lar.hed

January 12, 2024, 11:23:10 AM #102

joshndroid

January 13, 2024, 01:44:05 AM #103

lar.hed

January 14, 2024, 01:33:17 PM #104 Last Edit: January 14, 2024, 01:36:08 PM by lar.hed