Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - zentoo

#1
23.7 Legacy Series / Re: Unbound crashing
December 27, 2023, 04:53:17 PM
Quote from: rene_ on December 21, 2023, 08:51:04 AM
Quote from: zentoo on November 20, 2023, 02:47:22 PM
On my master/slave opnsense setup with a configuration synchronisation per minute (cron command: HA update and reconfigure backup) I've tried to debug further:

Do not do this.
Each config sync will restart the services on the slave firewalls, e.g. an ntp service will never finish its synchronisation and so on.
This will cause more trouble than it is worth.
Increase the interval to at least one hour.

I understood it with this unbound issue and so proceed to extend sync time.

IMHO the design of configuration synchronization is really not the good one.
It would be clever to restart only services that have their configuration modified by the synchronization like usual operating systems. It's really a problem for a system that is designed to provide high availability.

At each configuration sync, the master XML file need to be split for each service and compared to related split slave service configuration in order to only restart the service if its configuration have been modified.
It shouldn't be so hard to implement.
#2
23.7 Legacy Series / Re: Unbound crashing
November 20, 2023, 02:47:22 PM
On my master/slave opnsense setup with a configuration synchronisation per minute (cron command: HA update and reconfigure backup) I've tried to debug further:

[System: High Availability: Settings] Unbound DNS: selected
=> unbound 100% CPU after a while on slave opnsense with an unbound restart every minute

[System: High Availability: Settings] Unbound DNS: not selected
=> unbound 100% CPU after a while on slave opnsense with an unbound restart every minute

No High Availability synchronisation between master and slave opnsense
=> no unbound restart on slave opnsense so no problem

I tried to understand which High Availability settings make the restart of unbound and in fact there is no dependencies logic at all. If there is any service selected for synchronisation, unbound will be restarted at synchronisation time so even if it's not needed.

I have tried to trigger the problem as you do but didn't succeed even with an unbound restart every 2s.

So I've explored unbound init script to see how it manages the pid file to avoid a double unbound process.
I didn't find specific clue because pid is managed by daemon utility.
On my precedent debug session I have noticed several /var/unbound/dev mount points and I think it is due to a race condition of several unbound starting in the same time.
So I've setup a simple way to check this:

Monitoring:
while true; do echo "$(date) $(stat -x /var/run/unbound.pid | grep Change:) file: $(cat /var/run/unbound.pid) pid: $(pgrep unbound) mount: $(mount | grep -c /var/unbound/dev)"; sleep 0.1 ; done

Trigger (5 parallel start of unbound):
pluginctl unbound_start & pluginctl unbound_start & pluginctl unbound_start & pluginctl unbound_start & pluginctl unbound_start &

I've succeed to get several /var/unbound/dev mount point instances and eventually get a stuck unbound with a 100% CPU.

IMHO I think the unbound problem can be triggered by multiple concurrent restart too.
So the unbound start need to use a lock mechanism of similar to avoid several unbound starts because the launch of unbound can take time and so to only check if PID exists in order to launch the process is not enough.
#3
23.7 Legacy Series / Re: Unbound crashing
November 10, 2023, 05:35:06 PM
I join this boat: I spend two days to understand what was happening on one of our opnsense server with unbound since client servers using it as DNS resolver get regularly timeout on DNS requests.

When the problem arise one CPU was stuck at 100% by unbound process.
Web UI unbound stop or restart froze Web UI. I need to kill -9 unbound PID in order to start it back.

I understand latter that it was unbound restart the culprit because I didn't have this problem before.

I explain more in details: we got two opnsense server as Master/Slave firewalls with master one executing a cron every minute to synchronize slave configuration using command HA update and reconfigure backup.
Few days ago we have updated opnsense on slave then use it as temporary master for validation before updating  master one and for out of topic reason we left the slave opnsense as temporary master usage.
8 hours later unbound process get stuck using 100% of one CPU and unbound start to generate timeout for clients.
The process have stayed stuck for hours while I was investigating issue on client side before I understood the problem was opnsense unbound process.

I continue to monitor the process after have killed and restarted it and I was surprised to observe that unbound restart every minute while not on master opnsense. So I understood that unbound is restarted at each master/slave synchro (with unbound service selected for synchro) and that lead to the issue where unbound is stuck with one cpu at 100% after a while.

It seems that there is a kind of race condition when unbound process restart.
I don't know enough how service is managed on opnsense but it's possibly a problem with flock and/or PID detection/creation/deletion.

I have observed another problem: when unbound process is stuck and the sync cron try to restart unbound, it's possible that a new mount point appeared: devfs on /var/unbound/dev (devfs).
So after a while you can observe several times the same mount point of /var/unbound/dev and that need to be cleaned manually.