BIND (named) hanging in unresponsive state

Started by Patrick M. Hausen, March 25, 2024, 11:30:53 AM

Previous topic - Next topic
Hi all,

I am using BIND instead of Unbound in most of my deployments. Recently the process seems to become unresponsive for no obvious reason every other day or so.

When I check the state on the firewall it looks like this:
root@opnsense:~ # ps awwux|grep named
root     4974   0.0  0.0   13488   3236  -  I    11:09       0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root    15735   0.0  0.0   13488   3244  -  I    11:09       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root    35956   0.0  0.0   13488   3236  -  I    11:11       0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root    48098   0.0  0.0   13488   3244  -  I    11:11       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root    51230   0.0  0.0   13488   3228  -  I    11:13       0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root    51746   0.0  0.0   13488   3232  -  I    11:15       0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
bind    53253   0.0  0.4  106704  33780  -  Ss   20:26       2:06.97 /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf
root    61439   0.0  0.0   13488   3236  -  I    11:13       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root    61879   0.0  0.0   13488   3236  -  I    11:17       0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root    62413   0.0  0.0   13488   3240  -  I    11:15       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root    74547   0.0  0.0   13488   3244  -  I    11:17       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root    17500   0.0  0.0   12720   2388  0  S+   11:20       0:00.00 grep named

So there are a handful of restart jobs piled up, but the restart is not really happening. The listening ports are gone already (I have BIND listen on 0.0.0.0/0 port 53):
netstat -na|fgrep .53
shows no result. When I truss the process it spends all of its time in nanosleep() calls:
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
[...]


Does anybody have an idea what might be going on? Which actions on the firewall do lead to a BIND restart, anyway?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I'd guess you have a daily cronjob for updating Blocklists and the script fails for whatever reason?

No blocklists in my BINDs - I chain AdGuard Home for that.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

This seems to happen whenever I reboot my switch. When I do this the lagg interface connected to OPNsense and carrying all my VLANs toggles. When the switch is back up and layer 2 connectivity restored this is the situation on OPNsense:
root@opnsense:~ # ps awwux|grep named
root    28282   0.0  0.0   13488   3236  -  S    19:40       0:00.01 /bin/sh /usr/local/etc/rc.d/named restart
root    34584   0.0  0.0   13488   3244  -  S    19:40       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root    52578   0.0  0.0   13488   3228  -  I    19:38       0:00.01 /bin/sh /usr/local/etc/rc.d/named restart
root    61143   0.0  0.0   13488   3236  -  I    19:38       0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
bind    96171   0.0  0.6  161148  46068  -  Ss   Thu11      10:36.60 /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf


named is unresponsive and the restart processes are "piling up".

Possibly I shall go back to bind to 127.0.0.1 only and use NAT port forwarding ...
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on April 23, 2024, 07:44:08 PM
Possibly I shall go back to bind to 127.0.0.1 only and use NAT port forwarding ...

Is there a specific reason you don't use Unbound ? I don't know if you're using lot of domains in Bind, but I can recommend (and running stable for years) Unbound with specific "Query Forwarding" domains pointing to Bind running on localhost port 53053.

I'm using only ±30 domains, there's of course some administrative overhead defining those forwards.

April 23, 2024, 08:21:00 PM #5 Last Edit: April 23, 2024, 10:25:53 PM by Patrick M. Hausen
I have locally maintained zones so I need BIND and running AdGuard Home I did not want to bring a third service into the mix.

EDIT: I just reworked all local zones into domain overrides at home. If that proves to be stable, I'll probably pick up your suggestion for the secondary zones I have at work.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on April 23, 2024, 08:21:00 PM
I have locally maintained zones so I need BIND and running AdGuard Home I did not want to bring a third service into the mix.

Makes sense :-). How many resolvers does a man need...

Based on your observation I noticed quite some restarts of Bind in my logs too with random intervals (never looked at it to be honest), but all clean without any restart zombies. Unfortunately I've no clue what is/was the trigger: Saving interface config ? Carrier Up/Down of directly connected host ?