Unbound DoT unable to recover from connection loss

CJ · April 18, 2023, 05:06:30 PM

I have Unbound configured to use Quad9 DoT. My iSP connection died and when it came back up I continued getting SERVFAIL errors until I restarted Unbound. SERVFAIL don't get cached, and if I queried the DNS servers using the DNS Lookup page, I was able to get valid results.

It seemed like the DoT wasn't able to connect to the specified servers but I can't find anything in the logs regarding this. All I have are a bunch of SERVFAIL errors that say "all the configured stub or forward servers failed, at zone . no server to query nameserver addresses not usable have no nameserver names"

I'm not familiar enough with the way DoT works to be able to speculate why Unbound wouldn't have been able to reconnect until I restarted it. Any suggestions for what to look for?

skavoovie · May 05, 2023, 07:12:57 PM

I experienced the same issue w/ Unbound, but on dedicated instances not running on my OPNsense firewall -- same root cause and solution.

I originally found the fix detailed in this post on the official Unbound project mailing list archive:

https://lists.nlnetlabs.nl/pipermail/unbound-users/2011-January/001608.html

THE FIX:

Reduce the infra-host-ttl value (in seconds) in Unbound's config file (unbound.conf) to a lower value that meets your needs. The default value (15 min / 900 sec) was way too high for my needs.

For example, to have Unbound recheck for restored connectivity every minute during an upstream network outage:

Code Select

infra-host-ttl: 60

(I think the manpage could be clearer on this):

Code Select


       infra-host-ttl: <seconds>
              Time  to live for entries in the host cache. The host cache contains roundtrip timing, lameness and EDNS support infor‐
              mation. Default is 900.

CJ · May 08, 2023, 06:15:56 PM

Interesting. I had not realized that the host cache was for dns servers and not dns results.

The link in your link isn't valid anymore, but searching timeout in the documentation returns this. https://unbound.docs.nlnetlabs.nl/en/latest/reference/history/info-timeout-server-selection.html

I'm not sure how OPNSense handles this as that link mentions a blocking regime causing what I observed, but I do not have the keep probing down hosts checked. Based on the help text I would have assumed that having it unchecked would mean that the blocking regime wouldn't be used.

The 15m cache TTL would also explain why I didn't run into any issues when I had to reboot my modem. Since it was only down for a minute or so, it never got marked as down.

Looks like I need to move my unbound optimization research higher up my todo list. That said, I do wish OPNSense had an easy way to see what it's defaults are so that I can easily revert any changes.

Unbound DoT unable to recover from connection loss

CJ

April 18, 2023, 05:06:30 PM

skavoovie

May 05, 2023, 07:12:57 PM #1

CJ

May 08, 2023, 06:15:56 PM #2