Hi all,
I am using BIND instead of Unbound in most of my deployments. Recently the process seems to become unresponsive for no obvious reason every other day or so.
When I check the state on the firewall it looks like this:
root@opnsense:~ # ps awwux|grep named
root 4974 0.0 0.0 13488 3236 - I 11:09 0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root 15735 0.0 0.0 13488 3244 - I 11:09 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root 35956 0.0 0.0 13488 3236 - I 11:11 0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root 48098 0.0 0.0 13488 3244 - I 11:11 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root 51230 0.0 0.0 13488 3228 - I 11:13 0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root 51746 0.0 0.0 13488 3232 - I 11:15 0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
bind 53253 0.0 0.4 106704 33780 - Ss 20:26 2:06.97 /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf
root 61439 0.0 0.0 13488 3236 - I 11:13 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root 61879 0.0 0.0 13488 3236 - I 11:17 0:00.02 /bin/sh /usr/local/etc/rc.d/named restart
root 62413 0.0 0.0 13488 3240 - I 11:15 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root 74547 0.0 0.0 13488 3244 - I 11:17 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root 17500 0.0 0.0 12720 2388 0 S+ 11:20 0:00.00 grep named
So there are a handful of restart jobs piled up, but the restart is not really happening. The listening ports are gone already (I have BIND listen on 0.0.0.0/0 port 53):
netstat -na|fgrep .53
shows no result. When I truss the process it spends all of its time in nanosleep() calls:
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
nanosleep({ 0.010000000 }) = 0 (0x0)
[...]
Does anybody have an idea what might be going on? Which actions on the firewall do lead to a BIND restart, anyway?
I'd guess you have a daily cronjob for updating Blocklists and the script fails for whatever reason?
No blocklists in my BINDs - I chain AdGuard Home for that.
This seems to happen whenever I reboot my switch. When I do this the lagg interface connected to OPNsense and carrying all my VLANs toggles. When the switch is back up and layer 2 connectivity restored this is the situation on OPNsense:
root@opnsense:~ # ps awwux|grep named
root 28282 0.0 0.0 13488 3236 - S 19:40 0:00.01 /bin/sh /usr/local/etc/rc.d/named restart
root 34584 0.0 0.0 13488 3244 - S 19:40 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
root 52578 0.0 0.0 13488 3228 - I 19:38 0:00.01 /bin/sh /usr/local/etc/rc.d/named restart
root 61143 0.0 0.0 13488 3236 - I 19:38 0:00.00 /bin/sh /usr/local/etc/rc.d/named restart
bind 96171 0.0 0.6 161148 46068 - Ss Thu11 10:36.60 /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf
named is unresponsive and the restart processes are "piling up".
Possibly I shall go back to bind to 127.0.0.1 only and use NAT port forwarding ...
Quote from: Patrick M. Hausen on April 23, 2024, 07:44:08 PM
Possibly I shall go back to bind to 127.0.0.1 only and use NAT port forwarding ...
Is there a specific reason you don't use Unbound ? I don't know if you're using lot of domains in Bind, but I can recommend (and running stable for years) Unbound with specific "Query Forwarding" domains pointing to Bind running on localhost port 53053.
I'm using only ±30 domains, there's of course some administrative overhead defining those forwards.
I have locally maintained zones so I need BIND and running AdGuard Home I did not want to bring a third service into the mix.
EDIT: I just reworked all local zones into domain overrides at home. If that proves to be stable, I'll probably pick up your suggestion for the secondary zones I have at work.
Quote from: Patrick M. Hausen on April 23, 2024, 08:21:00 PM
I have locally maintained zones so I need BIND and running AdGuard Home I did not want to bring a third service into the mix.
Makes sense :-). How many resolvers does a man need...
Based on your observation I noticed quite some restarts of Bind in my logs too with random intervals (never looked at it to be honest), but all clean without any restart zombies. Unfortunately I've no clue what is/was the trigger: Saving interface config ? Carrier Up/Down of directly connected host ?