Unbound DoT unable to recover from connection loss

Started by CJ, April 18, 2023, 05:06:30 PM

Previous topic - Next topic
I have Unbound configured to use Quad9 DoT.  My iSP connection died and when it came back up I continued getting SERVFAIL errors until I restarted Unbound.  SERVFAIL don't get cached, and if I queried the DNS servers using the DNS Lookup page, I was able to get valid results.

It seemed like the DoT wasn't able to connect to the specified servers but I can't find anything in the logs regarding this.  All I have are a bunch of SERVFAIL errors that say "all the configured stub or forward servers failed, at zone . no server to query nameserver addresses not usable have no nameserver names"

I'm not familiar enough with the way DoT works to be able to speculate why Unbound wouldn't have been able to reconnect until I restarted it.  Any suggestions for what to look for?

I experienced the same issue w/ Unbound, but on dedicated instances not running on my OPNsense firewall -- same root cause and solution.


I originally found the fix detailed in this post on the official Unbound project mailing list archive:

https://lists.nlnetlabs.nl/pipermail/unbound-users/2011-January/001608.html


THE FIX:

Reduce the infra-host-ttl value (in seconds) in Unbound's config file (unbound.conf) to a lower value that meets your needs. The default value (15 min / 900 sec) was way too high for my needs.

For example, to have Unbound recheck for restored connectivity every minute during an upstream network outage:


infra-host-ttl: 60


(I think the manpage could be clearer on this):


       infra-host-ttl: <seconds>
              Time  to live for entries in the host cache. The host cache contains roundtrip timing, lameness and EDNS support infor‐
              mation. Default is 900.

Interesting.  I had not realized that the host cache was for dns servers and not dns results.

The link in your link isn't valid anymore, but searching timeout in the documentation returns this.  https://unbound.docs.nlnetlabs.nl/en/latest/reference/history/info-timeout-server-selection.html

I'm not sure how OPNSense handles this as that link mentions a blocking regime causing what I observed, but I do not have the keep probing down hosts checked.  Based on the help text I would have assumed that having it unchecked would mean that the blocking regime wouldn't be used.

The 15m cache TTL would also explain why I didn't run into any issues when I had to reboot my modem.  Since it was only down for a minute or so, it never got marked as down.

Looks like I need to move my unbound optimization research higher up my todo list.  That said, I do wish OPNSense had an easy way to see what it's defaults are so that I can easily revert any changes.