Unbound does not resolve queries after reboot (broken default GW selection)

Started by klosz007, January 28, 2023, 10:10:45 PM

Previous topic - Next topic
Hi,

I have been dealing with this issue for some time with no luck. It first appeared for me in 22.7.10 (I believe). After each reboot it took significant time (say 2 minutes) for Unbound to accept UDP connections. During that time opnsense was not accepting Web interface or SSH connections. If I just waited for these 2 minutes, then Unbound eventually started and it all worked fine then.

It got worse in 22.7.11 though.
I was hoping that 23.1 will resolve it but I upgraded to 23.1 today and the same issue is still there.

Now it not only takes 2 minutes for Unbound Dto accept UDP connections (and for Web GUI and SSH to start working), but after that time all DNS queries fail with 'serverfail' message. Theres nothing really informative in Unbound log (/var/log/resolver/latest.log).

When I increased logging to level 2 then each query in the log looks like that:

<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="701"] [25609:3] info: resolving mtalk.google.com. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="702"] [25609:3] info: priming . IN NS
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="703"] [25609:0] info: resolving mtalk.google.com. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="704"] [25609:0] info: priming . IN NS
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="705"] [25609:3] info: resolving mtalk.google.com.localdomain. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="706"] [25609:3] info: priming . IN NS


So basically every attempt to resolve a name is followed by "priming . IN NS" message.


Fix is very easy - manually restart Unbound from GUI. Then, instantly, everything returns to normal.
Then few first messages in the log looks like that:

<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4577"] [94863:3] info: resolving prod.amcs-tachyon.com. A IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4578"] [94863:3] info: priming . IN NS
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4579"] [94863:3] info: response for . NS IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4580"] [94863:3] info: reply from <.> 199.7.91.13#53
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4581"] [94863:3] info: query response was ANSWER
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4582"] [94863:3] info: priming successful for . NS IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4583"] [94863:0] info: control cmd:  list_local_data


(priming successful for . NS IN message appears and following queries are resolved).

If I reboot opnsense, same story repeats, I have to wait ~2 minutes, then restart Unbound to make it work.

Have you encountered such issue ? What may be causing this ?

Thanks,
Zbigniew





Maybe it simply how long it takes with the given hardware to get up and running.

Nope. If it would take that long to start due to weak HW, why does it start instantly following manual process restart ?

This is a VM, VM specs have not changed and I can switch between versions (rollback) instantly to test different versions.


In 22.7.9 and before nothing like that was happening. Unbound was instantly accepting connections and replying to queries following a reboot.

In 22.7.10 this 2-minute delay appeared but after that it eventually worked fine. With standard logging here's how this delay looks like:

<165>1 2023-01-28T20:40:00+01:00 opnsense.localdomain unbound 46405 - [meta sequenceId="70"] daemonize unbound dhcpd watcher.
<30>1 2023-01-28T20:41:43+01:00 opnsense.localdomain unbound 36408 - [meta sequenceId="1"] [36408:3] info: generate keytag query _ta-4f66. NULL IN
<30>1 2023-01-28T20:41:43+01:00 opnsense.localdomain unbound 36408 - [meta sequenceId="2"] [36408:1] info: generate keytag query _ta-4f66. NULL IN


In 22.7.11 and 23.1, it takes these two minutes to accept connections plus it never replies to queries ("priming . IN NS" repeat forever) until I restart Unbound.

What does this "priming . IN NS" message mean ?


A quick start when manual is what I would expect. I was referring to the time from boot to have all services including Unbound up and serving requests. 2 mins sounds about right on weak hardware and/or very many services. There is only so much parellelism can achieve.
It would probably be useful to give the settings you have in use to see what might be happening.

My HW is nowhere to be weak (4-core i3-10100K), everything works fine when I go back to 22.7.9.

I'm still investigating and have no proof for that but I think it is a kind of timing/race condition between Unbound and WAN interface(s).

After enabling higher level of logging in Unbound, it seems like Unbound behaves if it had no network connection to Internet when it was launched during boot. WAN interface goes up in the meantime and traffic is already passing from LAN to WAN but Unbound cannot see it and still waits for WAN connection.

Lack of Internet/WAN access creates this 100 second delay (it's not 2 minutes as I said originally), eventually, after 100 seconds, something in Unbound times out and Unbound starts and accepts connections on port 53 but then it is hung forever on "priming root servers" and does not resolve any queries until it is restarted manually. For some reason it is not able to see/detect that WAN became accessible in the meantime.

If I restart it manually after boot, then it sees WAN connection from the beginning and starts almost immediately.

I have to investigate deeper but what makes it difficult is the fact that it happens only during boot and all manual changes to config files to test things get erased during reboot.


I have nothing special set in Unbound configuration.
It is allowed to use all interfaces for outgoing connections (recommended setting).
It is configured to listen on two LAN interfaces (I have tow LAN subnets).

Other enabled "general" options:

  • Enable DNSSEC Support (disabling it does not make it go away)
  • Register DHCP leases
  • Register DHCP static mappings
  • There's one host override

Everything else on defaults, DDNSBL disabled.

It might be worth raising on github as an issue. It looks a very simple setup, strange indeed.
For keeping logs, something I've done sometimes is temporarily send logs to a separate log server. It hasn't always helped unfortunately.
One more thought. Are you using IPv6? If not, have you disabled it on your setup? The thinking is (along the same lines as yours) that Unbound can't reach the root servers until the WAN is passing traffic, and maybe attempting to setup routes in v6. What settings do you have in System > Settings > General.?

Hello,

I spent whole evening researching this issue.

What I'm certain of is that last known good OPNsense version for me is 22.7.9_3. The issue started with 22.7.10_2, possibly I ignored it then (restarted Unbound manually and forgot about it). Then I noticed it again in 22.7.11, started researching and meanwhile 23.1 got released, it has not improved in 23.1 so asked for help here.

I can't blame Unbound directly because 22.7.10_2 uses Unbound 1.17.0 i.e. the same Unbound version as 22.7.9_3 uses, which works fine there but not in 22.7.10. Unbound 1.17.1 first appeared in 22.7.11 I believe.

Albeit I noticed there were some changes in 22.7.10 around Unbound because OPNsense's dnsbl_module.py is significantly different in 22.7.10 than in 22.7.9. Yet I tried to use this script taken from 22.7.9 in later versions of OPNsense and it does not help.


In good case (22.7.9 and below), as soon I get logon prompt on VM "physical" console (probalby less than a minute after FreeBSD bootloader starts) and I logon as root, nslookup <domain> works fine and is able to resolve queries instantly. Also SSH and Web GUI work instantly as soon as I get logon prompt on console.


When the issue happens (OPNsense 22.7.10 and above), it exhibists itself this way:
FreeBSD bootloader starts, I'm presented with logon prompt on VM "physical" console in probably less than a minute (so no change versus 22.7.9).
I login to console and try to execute nslookup <domain> and I'm getting timeouts connecting to 127.0.0.1#53. If I retry this command a few times command - timeouts again and again.
(At this time it is not possible to log in via SSH or open Web interface.)
After a minute or more, I no longer get timeouts in nslookup, instead I get SERVFAIL messages when trying to resolve any domain name.
(At this stage I can already connect via SSH/Web).
This state will last "forever" (the longest time I waited was probably ~10 mins) until I restart Unbound from GUI. Then Unbound is able to accept connections and resolve queries instantly.


It's easy to capture it in the resolver log - enable higher level of logging in Unbound and reboot. Then review what's in the log. The problem is I cannot interpret it sufficently well myself. With default level of logging (1) there are hardly any messages in it. With higher levels of logging it gets flooded with messages.

I'm happy to upload the logs but I'm sure where I should do it. Unbound 1.17.0 does not seem to be at fault - it does work well in 22.7.9 but not in 22.7.10. 22.7.11 and 23.1 use 1.17.11. So I rather suspect OPNsense.

Quote from: klosz007 on January 31, 2023, 01:06:10 AM
In good case (22.7.9 and below), as soon I get logon prompt on VM "physical" console
Is this a Virtualised OPN?

Yes, it's a VM on ESXi 7.0U3. I have been running it as a VM since very beginning (=migration from pfSense which I also ran as a VM on ESXi).

I'm bailing out then, sorry. There are too many factors at play and my ESXi hosts are shutdown. Even those are on version 6 I think.

January 31, 2023, 12:27:01 PM #10 Last Edit: January 31, 2023, 12:52:50 PM by Fright
QuoteLack of Internet/WAN access creates this 100 second delay (it's not 2 minutes as I said originally)
what is the nature of this lack? dhcp on wan or something else?
perhaps the reason is some use of old states when wan conditions change. because of what a successful priming does not occur
QuoteWhat does this "priming . IN NS" message mean ?
initial requests to root servers for "." records
obviously, if these requests do not end successfully, then the recursive server cannot serve client requests



Yes, my primary WAN uses DHCP from provider (it's dynamic public IP).
However, at the moment when I'm able to log on at console, WAN IP is already displayed in the summary table that precedes logon prompt so IP address is acquired quickly/during opnsense bootup. Nothing changed in this respect ever since I'm using opnsense.

100 seconds (rougly) is how much time it takes for unbound to start accepting UDP connections since its own startup (after that it is not able to resolve any queries until manually restarted). It started to behave like this in 22.7.10, in 22.7.9 unbound process startup is near instant.

I'm wondering hat has changed in in regards to unbound between opnsense 22.7.9 and 22.7.10 (unbound version is still the same - 1.17.0).

thanks. so what messages in general log you can see if you filter it with "execute task : unbound_configure_do" after boot (is newwanip there?)?
Quoteunbound process startup is near instant
funny, but maybe that's the reason (if the case turns out to be in the states)?  :o

February 02, 2023, 12:00:18 AM #13 Last Edit: February 02, 2023, 12:02:55 AM by klosz007
Hello again,

Tadaaaa.....

Long story short - I found out what is causing that but not a root cause. Must be a bug in OPNsense itself, starting from 22.7.10 and up. Keep on reading (will be quite long).

I visited my physical work office today (usually I work from home), met and spoke with a good colleague whom I convinced to use OPNsense previously. We did some brainstorming, I returned home, then made a few more tests and voila !

Neither Unbound is at fault, neither my WAN, nor the fact that my installation runs on a VM.

I got fooled by a fact that I was able to ping Internet addresses from my PC (using IP, not DNS name) instantly once OPNsense VM booted up, which made me assume that routing on firewall is fine. I was so wrong.

I have multiple gateways - two WAN gateways (primary via DSL cable, backup via LTE modem) and multiple gateways assosciated with VPN servers and clients (OVPN, Zerotier, Wireguard).

Traffic from LAN is routed (via firewall rule) through gateway group (i.e. both WAN gateways with tiers defined) - that part works flawlessly and that's why I was able to ping INternet from LAN.

But the firewall itself (including Unbound) uses system routing table (not the gateway group) and default gateway switching.

WAN gateways have lower priority set (such as 200 for primary and 210 for backup) while all remaining gateways use default 255. Additionaly WAN gateways are marked as 'upstream' gateways.

Until 22.7.9 everything worked fine in this part - as soon as OPNsense boots and I can login on VM console, my primary WAN gateway is selected as default gateway (netstat -rn or GUI).

Starting form 22.7.10 this is badly broken. As soon as OPNsense boots up, both WAN gateways quickly receive their IPs but default gateway (netstat -rn) is set to... ovpns1's IP 10.8.0.2... WT* ???

It takes a couple of minutes until correct (=primary WAN GW) is selected to be the default GW. Before that happens, Unbound obviously does not have access to Internet hence struggling ot start up and prime root servers. It surrenders before correct GW is elected.

If you go to gateway list in GUI during that time, both WAN GWs are marked red Down, even though both are responding to pings, have short RTT and ovpns1 is marked as "active" gateway.


Later on, when I restart Unbound manually from GUI, correct GW is already elected and then Unbound starts up instantly, as it would during boot up before and at 22.7.9.

So I was right that it is a 'timing' issue but in a different place then I initially thought. Unbound starts before correct GW is elected (hence it has no access to Internet) but it's the default GW selection taking way too much time in 22.7.10 and up.

I have tested all versions that I had in VM snapshots and the last version that selects primary WAN as default GW instantly, during boot up, is 22.7.9_3. 22.7.10 has it broken up to current 23.1_6.

In last leap of faith that I can repair it, I set up new VM with 23.1_6 (I improved things versus previous incarnation - now UEFI and ZFS), restored config from the old/existing 23.1 VM, instantly same issue returns.

What should I do to get it fixed ? What logs are needed ?
Maybe I should start another thread ?




One more update - it's probably becuase of how dpinger works in 22.10 and up.  Have there been any changes to dpinger in 22.10 ?

Even though both WAN GW's instantly respond to pings once they get their DHCP IPs (I'm using Google DNS as 'test' IPs), they are both initially marked down. That's becuase they both seem to start with 100% reported packed loss and this number slowly goes down with time until they go below 'bad' threshold are both considered alive. Then one of them (primary) is elected to become default GW, rather than my OpenVPN server's GW. This all takes too much time for Unbound to start succesfully. It surrenders and is not able to self-recover from this situation of no Internet access (that another question why). Manual restart of Unbound (after correct default GW is selected) repairs Unbound instantly.

This must have worked differently in 22.9. Both GWs are conisdered good right from the start, correct GW (primary WAN) is selected as soon OPNsense boots up, hence Unbound has no issues accessing Internet, primes root servers and goes up fast.