Unbound crashing whenever a link fails

Started by cpw, April 30, 2020, 09:32:45 PM

Previous topic - Next topic
Hi
I seem to be getting a dead unbound whenever a link changes state for some reason.


2020-04-30T15:17:41 configd.py: message c2a300b6-546a-43de-a439-881c67fc3ff3 [filter.refresh_aliases] returned {"status": "ok"}
2020-04-30T15:17:41 configd.py: [c2a300b6-546a-43de-a439-881c67fc3ff3] refresh url table aliases
2020-04-30T15:17:41 configd.py: OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2020-04-30T15:17:41 configd.py: OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2020-04-30T15:17:41 configd.py: generate template container OPNsense/Filter
2020-04-30T15:17:41 configd.py: [5d9a0fcc-1279-41d5-8c95-71604ef75565] generate template OPNsense/Filter
2020-04-30T15:17:40 configd.py: [9065aa66-743e-482a-b6a6-11a95e9bbc2f] Reloading filter
2020-04-30T15:17:39 configd.py: message 9f2df814-976c-48cc-be41-1d261e909ba5 [filter.refresh_aliases] returned {"status": "ok"}
2020-04-30T15:17:39 configd.py: message ac0718e4-092c-4bcc-97f7-28c30788ef86 [unbound.start] returned Error (1)
2020-04-30T15:17:39 configd.py: [ac0718e4-092c-4bcc-97f7-28c30788ef86] returned exit status 1
2020-04-30T15:17:39 configd.py: [ac0718e4-092c-4bcc-97f7-28c30788ef86] Start Unbound
2020-04-30T15:17:39 configd.py: OPNsense/Unbound/* generated //var/unbound/root.hints
2020-04-30T15:17:39 configd.py: generate template container OPNsense/Unbound/core
2020-04-30T15:17:39 configd.py: [da00957b-616a-45ff-87f2-35753f0d05a7] generate template OPNsense/Unbound/*
2020-04-30T15:17:39 configd.py: [9f2df814-976c-48cc-be41-1d261e909ba5] refresh url table aliases
2020-04-30T15:17:39 configd.py: OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2020-04-30T15:17:39 configd.py: OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2020-04-30T15:17:39 configd.py: generate template container OPNsense/Filter
2020-04-30T15:17:39 configd.py: [38bb817d-71e3-497d-bb5b-bf3c565f2d2a] generate template OPNsense/Filter
2020-04-30T15:17:37 configd.py: [07b6bf92-a183-41c7-bc36-6a917f9caa57] New IPv6 on pppoe0
2020-04-30T15:17:37 configd.py: [813cf8ea-6491-45d0-bcd7-ee5dbf9557ab] New IPv4 on pppoe0


It has to be manually restarted, which is less than desireable. This has only started happening since updating to 20.1.5.

Hi,
our Unbound DNS crashes only on the Backup-Node in our Cluster as soon as the Master has performed the HA-sync with the Backup-Node. We have the problem since updating to 20.1.6 (We went from 20.1.4 straight to 20.1.6).

Unbound DNS-Log:
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:47 unbound: [67336:0] info: server stats for thread 0: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:47 unbound: [67336:0] info: service stopped (unbound 1.10.0).
2020-05-05T13:34:44 unbound: [67336:0] info: start of service (unbound 1.10.0).
2020-05-05T13:34:44 unbound: [67336:0] notice: init module 1: iterator
2020-05-05T13:34:44 unbound: [67336:0] notice: init module 0: validator
2020-05-05T13:34:44 unbound: [67336:0] notice: Restart of unbound 1.10.0.
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
2020-05-05T13:34:44 unbound: [67336:0] info: server stats for thread 0: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2020-05-05T13:34:44 unbound: [67336:0] info: service stopped (unbound 1.10.0).


Quote from: mimugmail on May 06, 2020, 07:59:51 AM
Do you use DoT or Blacklist from Unboudplus plugin?

We do not use any additional plugins for unbound, only the basic dns service.


Quote from: mimugmail on May 06, 2020, 09:55:05 AM
opnsense-revert -r 20.1.4 unbound

Does this fix it?

No, unfortunately not.

May 06, 2020, 02:49:41 PM #6 Last Edit: May 06, 2020, 03:09:30 PM by agh1701
I too have been having problems.  I have 3 outgoing VPN's and decided to switch from dnscrypt to unbound plus Dot.  I found if I booted unbound had trouble starting and would not stay up.  I did not check the logs.  I went back to my dnscrypt configuration all is well.

After reading this, I thought I would try Dot again with unbound outgoing interfaces set to WAN.  Opnsense boots fine. I do not know what will happen if the wan interface goes down.

Mimugmail, thanks for the great plugins dnscrypt , unbound plus and all you do!

Is there any news already? In version 20.1.7 it's the same problem.

May 26, 2020, 04:24:16 PM #8 Last Edit: May 26, 2020, 04:35:47 PM by cpw
Nothing from me, it's still crashing repeatedly, in concert with a network connection failing.

I have setup monit jobs to kick it awake again when it dies, which is _mostly_ working, but it might mean you lose dns for a minute or two.

Update: it just happened again a few minutes ago and this was in my system "general" log file:


2020-05-26T10:01:53 monit[27368]: 'Unbound' process is running with pid 22616
2020-05-26T10:01:37 kernel: cannot forward from fe80::217:10ff:fe93:fd18 to ff02::1 nxt 0 received on igb2
2020-05-26T10:01:22 monit[27368]: 'Unbound' start: '/usr/local/sbin/pluginctl -s unbound start'
2020-05-26T10:01:22 monit[27368]: 'Unbound' trying to restart
2020-05-26T10:01:22 monit[27368]: 'Unbound' process is not running
2020-05-26T10:00:22 monit[27368]: 'Unbound' process is running with pid 41994
2020-05-26T09:59:50 monit[27368]: 'Unbound' start: '/usr/local/sbin/pluginctl -s unbound start'
2020-05-26T09:59:50 monit[27368]: 'Unbound' trying to restart
2020-05-26T09:59:50 monit[27368]: 'Unbound' process is not running
2020-05-26T09:59:47 kernel: cannot forward from 2607:f2c0:eb8a:500:f874:d3ff:fe9b:a952 to ff02::fb nxt 17 received on lagg0_vlan20
2020-05-26T09:59:32 kernel: cannot forward from fe80::217:10ff:fe93:fd18 to ff02::1 nxt 0 received on igb2
2020-05-26T09:59:04 /update_tables.py: unable to resolve cheesypuffs for alias cpwdesktop
2020-05-26T09:58:50 monit[27368]: 'Unbound' process is running with pid 33254
2020-05-26T09:58:19 monit[27368]: 'Unbound' start: '/usr/local/sbin/pluginctl -s unbound start'
2020-05-26T09:58:19 monit[27368]: 'Unbound' trying to restart
2020-05-26T09:58:19 monit[27368]: 'Unbound' process is not running
2020-05-26T09:57:37 sshd[60930]: Disconnected from 222.186.42.137 port 32185 [preauth]
2020-05-26T09:57:37 sshd[60930]: Received disconnect from 222.186.42.137 port 32185:11: [preauth]
2020-05-26T09:57:27 kernel: cannot forward from fe80::217:10ff:fe93:fd18 to ff02::1 nxt 0 received on igb2
2020-05-26T09:56:44 sshd[98123]: Disconnected from 222.186.15.115 port 21348 [preauth]
2020-05-26T09:56:44 sshd[98123]: Received disconnect from 222.186.15.115 port 21348:11: [preauth]
2020-05-26T09:56:41 kernel: cannot forward from 2607:f2c0:eb8a:500:2d8:61ff:fe77:13d3 to ff02::fb nxt 17 received on lagg0_vlan20
2020-05-26T09:56:39 kernel: pflog0: promiscuous mode enabled
2020-05-26T09:56:39 kernel: pflog0: promiscuous mode disabled
2020-05-26T09:56:38 opnsense: plugins_configure newwanip (execute task : webgui_configure_do(,opt2))
2020-05-26T09:56:38 opnsense: plugins_configure newwanip (execute task : vxlan_configure_interface())
2020-05-26T09:56:36 opnsense: plugins_configure newwanip (execute task : unbound_configure_do(,opt2))
2020-05-26T09:56:36 opnsense: plugins_configure newwanip (execute task : rfc2136_configure_do(,opt2))
2020-05-26T09:56:36 opnsense: plugins_configure newwanip (execute task : openssh_configure_do(,opt2))
2020-05-26T09:56:36 opnsense: plugins_configure newwanip (execute task : opendns_configure_do())
2020-05-26T09:56:36 opnsense: plugins_configure newwanip (execute task : ntpd_configure_defer())
2020-05-26T09:56:36 opnsense: plugins_configure newwanip (,opt2)
2020-05-26T09:56:36 opnsense: /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface DSL.
2020-05-26T09:56:36 opnsense: plugins_configure vpn (execute task : openvpn_configure_do(,opt2))
2020-05-26T09:56:36 opnsense: plugins_configure vpn (execute task : ipsec_configure_do(,opt2))
2020-05-26T09:56:36 kernel: pflog0: promiscuous mode enabled
2020-05-26T09:56:36 kernel: pflog0: promiscuous mode disabled
2020-05-26T09:56:36 opnsense: plugins_configure vpn (,opt2)
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: The VPN_GW monitor address is empty, skipping.
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: The DSL_GW monitor address is empty, skipping.
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: Adding static route for monitor 2001:4860:4860::8844 via fe80::217:10ff:fe93:fd18%igb2
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: Removing static route for monitor 2001:4860:4860::8844 via fe80::217:10ff:fe93:fd18%igb2
2020-05-26T09:56:35 opnsense: plugins_configure monitor (execute task : dpinger_configure_do())
2020-05-26T09:56:35 opnsense: plugins_configure monitor ()
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: ROUTING: IPv6 default gateway set to opt1
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to opt1
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt2'
2020-05-26T09:56:35 opnsense: plugins_configure hosts (execute task : unbound_hosts_generate())
2020-05-26T09:56:35 opnsense: plugins_configure hosts (execute task : dnsmasq_hosts_generate())
2020-05-26T09:56:35 opnsense: plugins_configure hosts ()
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: On (IP address: 69.165.220.173) (interface: DSL[opt2]) (real interface: pppoe0).
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'pppoe0'
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanipv6: Failed to detect IP for DSL[opt2]
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanipv6: On (IP address: ) (interface: DSL[opt2]) (real interface: pppoe0).
2020-05-26T09:56:35 opnsense: /usr/local/etc/rc.newwanipv6: IPv6 renewal is starting on 'pppoe0'


My interpretation is that somehow the ipv6 address tried to renew on pppoe0, which reset all of the dpinger stuff as well as the unbound configuration, but unbound saw an invalid state for some time (I'm guessing pppoe hadn't fully restarted yet, after whatever happened to it, so unbound needed a minute to become "valid" again).

Confirming from the PPPOE log files, that at 09:56:XX the PPPOE connection recycled.

I opened an issue on github https://github.com/opnsense/core/issues/4106 where mimugmail suggested to try and let the unbound service listen on all interfaces.

Now the service doesn't crash anymore but it may not be the best solution if you don't want to have the unbound listening on for example the WAN interface.

Quote from: bitmusician on June 15, 2020, 12:20:27 PM
Now the service doesn't crash anymore but it may not be the best solution if you don't want to have the unbound listening on for example the WAN interface.

Set a Block rule for UDP 53 in your firewall for any interfaces you don't unbound listening to.