OPNsense Forum

Archive => 23.1 Legacy Series => Topic started by: klosz007 on January 28, 2023, 10:10:45 PM

Title: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: klosz007 on January 28, 2023, 10:10:45 PM
Hi,

I have been dealing with this issue for some time with no luck. It first appeared for me in 22.7.10 (I believe). After each reboot it took significant time (say 2 minutes) for Unbound to accept UDP connections. During that time opnsense was not accepting Web interface or SSH connections. If I just waited for these 2 minutes, then Unbound eventually started and it all worked fine then.

It got worse in 22.7.11 though.
I was hoping that 23.1 will resolve it but I upgraded to 23.1 today and the same issue is still there.

Now it not only takes 2 minutes for Unbound Dto accept UDP connections (and for Web GUI and SSH to start working), but after that time all DNS queries fail with 'serverfail' message. Theres nothing really informative in Unbound log (/var/log/resolver/latest.log).

When I increased logging to level 2 then each query in the log looks like that:

<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="701"] [25609:3] info: resolving mtalk.google.com. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="702"] [25609:3] info: priming . IN NS
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="703"] [25609:0] info: resolving mtalk.google.com. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="704"] [25609:0] info: priming . IN NS
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="705"] [25609:3] info: resolving mtalk.google.com.localdomain. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="706"] [25609:3] info: priming . IN NS


So basically every attempt to resolve a name is followed by "priming . IN NS" message.


Fix is very easy - manually restart Unbound from GUI. Then, instantly, everything returns to normal.
Then few first messages in the log looks like that:

<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4577"] [94863:3] info: resolving prod.amcs-tachyon.com. A IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4578"] [94863:3] info: priming . IN NS
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4579"] [94863:3] info: response for . NS IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4580"] [94863:3] info: reply from <.> 199.7.91.13#53
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4581"] [94863:3] info: query response was ANSWER
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4582"] [94863:3] info: priming successful for . NS IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4583"] [94863:0] info: control cmd:  list_local_data


(priming successful for . NS IN message appears and following queries are resolved).

If I reboot opnsense, same story repeats, I have to wait ~2 minutes, then restart Unbound to make it work.

Have you encountered such issue ? What may be causing this ?

Thanks,
Zbigniew




Title: Re: Unbound does not resolve queries after reboot
Post by: cookiemonster on January 28, 2023, 10:46:27 PM
Maybe it simply how long it takes with the given hardware to get up and running.
Title: Re: Unbound does not resolve queries after reboot
Post by: klosz007 on January 28, 2023, 11:10:55 PM
Nope. If it would take that long to start due to weak HW, why does it start instantly following manual process restart ?

This is a VM, VM specs have not changed and I can switch between versions (rollback) instantly to test different versions.


In 22.7.9 and before nothing like that was happening. Unbound was instantly accepting connections and replying to queries following a reboot.

In 22.7.10 this 2-minute delay appeared but after that it eventually worked fine. With standard logging here's how this delay looks like:

<165>1 2023-01-28T20:40:00+01:00 opnsense.localdomain unbound 46405 - [meta sequenceId="70"] daemonize unbound dhcpd watcher.
<30>1 2023-01-28T20:41:43+01:00 opnsense.localdomain unbound 36408 - [meta sequenceId="1"] [36408:3] info: generate keytag query _ta-4f66. NULL IN
<30>1 2023-01-28T20:41:43+01:00 opnsense.localdomain unbound 36408 - [meta sequenceId="2"] [36408:1] info: generate keytag query _ta-4f66. NULL IN


In 22.7.11 and 23.1, it takes these two minutes to accept connections plus it never replies to queries ("priming . IN NS" repeat forever) until I restart Unbound.

What does this "priming . IN NS" message mean ?

Title: Re: Unbound does not resolve queries after reboot
Post by: cookiemonster on January 28, 2023, 11:17:39 PM
A quick start when manual is what I would expect. I was referring to the time from boot to have all services including Unbound up and serving requests. 2 mins sounds about right on weak hardware and/or very many services. There is only so much parellelism can achieve.
It would probably be useful to give the settings you have in use to see what might be happening.
Title: Re: Unbound does not resolve queries after reboot
Post by: klosz007 on January 29, 2023, 03:14:50 PM
My HW is nowhere to be weak (4-core i3-10100K), everything works fine when I go back to 22.7.9.

I'm still investigating and have no proof for that but I think it is a kind of timing/race condition between Unbound and WAN interface(s).

After enabling higher level of logging in Unbound, it seems like Unbound behaves if it had no network connection to Internet when it was launched during boot. WAN interface goes up in the meantime and traffic is already passing from LAN to WAN but Unbound cannot see it and still waits for WAN connection.

Lack of Internet/WAN access creates this 100 second delay (it's not 2 minutes as I said originally), eventually, after 100 seconds, something in Unbound times out and Unbound starts and accepts connections on port 53 but then it is hung forever on "priming root servers" and does not resolve any queries until it is restarted manually. For some reason it is not able to see/detect that WAN became accessible in the meantime.

If I restart it manually after boot, then it sees WAN connection from the beginning and starts almost immediately.

I have to investigate deeper but what makes it difficult is the fact that it happens only during boot and all manual changes to config files to test things get erased during reboot.


I have nothing special set in Unbound configuration.
It is allowed to use all interfaces for outgoing connections (recommended setting).
It is configured to listen on two LAN interfaces (I have tow LAN subnets).

Other enabled "general" options:

Everything else on defaults, DDNSBL disabled.
Title: Re: Unbound does not resolve queries after reboot
Post by: cookiemonster on January 29, 2023, 08:45:38 PM
It might be worth raising on github as an issue. It looks a very simple setup, strange indeed.
For keeping logs, something I've done sometimes is temporarily send logs to a separate log server. It hasn't always helped unfortunately.
One more thought. Are you using IPv6? If not, have you disabled it on your setup? The thinking is (along the same lines as yours) that Unbound can't reach the root servers until the WAN is passing traffic, and maybe attempting to setup routes in v6. What settings do you have in System > Settings > General.?
Title: Re: Unbound does not resolve queries after reboot
Post by: klosz007 on January 31, 2023, 01:06:10 AM
Hello,

I spent whole evening researching this issue.

What I'm certain of is that last known good OPNsense version for me is 22.7.9_3. The issue started with 22.7.10_2, possibly I ignored it then (restarted Unbound manually and forgot about it). Then I noticed it again in 22.7.11, started researching and meanwhile 23.1 got released, it has not improved in 23.1 so asked for help here.

I can't blame Unbound directly because 22.7.10_2 uses Unbound 1.17.0 i.e. the same Unbound version as 22.7.9_3 uses, which works fine there but not in 22.7.10. Unbound 1.17.1 first appeared in 22.7.11 I believe.

Albeit I noticed there were some changes in 22.7.10 around Unbound because OPNsense's dnsbl_module.py is significantly different in 22.7.10 than in 22.7.9. Yet I tried to use this script taken from 22.7.9 in later versions of OPNsense and it does not help.


In good case (22.7.9 and below), as soon I get logon prompt on VM "physical" console (probalby less than a minute after FreeBSD bootloader starts) and I logon as root, nslookup <domain> works fine and is able to resolve queries instantly. Also SSH and Web GUI work instantly as soon as I get logon prompt on console.


When the issue happens (OPNsense 22.7.10 and above), it exhibists itself this way:
FreeBSD bootloader starts, I'm presented with logon prompt on VM "physical" console in probably less than a minute (so no change versus 22.7.9).
I login to console and try to execute nslookup <domain> and I'm getting timeouts connecting to 127.0.0.1#53. If I retry this command a few times command - timeouts again and again.
(At this time it is not possible to log in via SSH or open Web interface.)
After a minute or more, I no longer get timeouts in nslookup, instead I get SERVFAIL messages when trying to resolve any domain name.
(At this stage I can already connect via SSH/Web).
This state will last "forever" (the longest time I waited was probably ~10 mins) until I restart Unbound from GUI. Then Unbound is able to accept connections and resolve queries instantly.


It's easy to capture it in the resolver log - enable higher level of logging in Unbound and reboot. Then review what's in the log. The problem is I cannot interpret it sufficently well myself. With default level of logging (1) there are hardly any messages in it. With higher levels of logging it gets flooded with messages.

I'm happy to upload the logs but I'm sure where I should do it. Unbound 1.17.0 does not seem to be at fault - it does work well in 22.7.9 but not in 22.7.10. 22.7.11 and 23.1 use 1.17.11. So I rather suspect OPNsense.
Title: Re: Unbound does not resolve queries after reboot
Post by: cookiemonster on January 31, 2023, 10:30:20 AM
Quote from: klosz007 on January 31, 2023, 01:06:10 AM
In good case (22.7.9 and below), as soon I get logon prompt on VM "physical" console
Is this a Virtualised OPN?
Title: Re: Unbound does not resolve queries after reboot
Post by: klosz007 on January 31, 2023, 10:54:07 AM
Yes, it's a VM on ESXi 7.0U3. I have been running it as a VM since very beginning (=migration from pfSense which I also ran as a VM on ESXi).
Title: Re: Unbound does not resolve queries after reboot
Post by: cookiemonster on January 31, 2023, 11:53:23 AM
I'm bailing out then, sorry. There are too many factors at play and my ESXi hosts are shutdown. Even those are on version 6 I think.
Title: Re: Unbound does not resolve queries after reboot
Post by: Fright on January 31, 2023, 12:27:01 PM
QuoteLack of Internet/WAN access creates this 100 second delay (it's not 2 minutes as I said originally)
what is the nature of this lack? dhcp on wan or something else?
perhaps the reason is some use of old states when wan conditions change. because of what a successful priming does not occur
QuoteWhat does this "priming . IN NS" message mean ?
initial requests to root servers for "." records
obviously, if these requests do not end successfully, then the recursive server cannot serve client requests


Title: Re: Unbound does not resolve queries after reboot
Post by: klosz007 on January 31, 2023, 06:28:22 PM
Yes, my primary WAN uses DHCP from provider (it's dynamic public IP).
However, at the moment when I'm able to log on at console, WAN IP is already displayed in the summary table that precedes logon prompt so IP address is acquired quickly/during opnsense bootup. Nothing changed in this respect ever since I'm using opnsense.

100 seconds (rougly) is how much time it takes for unbound to start accepting UDP connections since its own startup (after that it is not able to resolve any queries until manually restarted). It started to behave like this in 22.7.10, in 22.7.9 unbound process startup is near instant.

I'm wondering hat has changed in in regards to unbound between opnsense 22.7.9 and 22.7.10 (unbound version is still the same - 1.17.0).
Title: Re: Unbound does not resolve queries after reboot
Post by: Fright on January 31, 2023, 07:11:37 PM
thanks. so what messages in general log you can see if you filter it with "execute task : unbound_configure_do" after boot (is newwanip there?)?
Quoteunbound process startup is near instant
funny, but maybe that's the reason (if the case turns out to be in the states)?  :o
Title: Re: Unbound does not resolve queries after reboot
Post by: klosz007 on February 02, 2023, 12:00:18 AM
Hello again,

Tadaaaa.....

Long story short - I found out what is causing that but not a root cause. Must be a bug in OPNsense itself, starting from 22.7.10 and up. Keep on reading (will be quite long).

I visited my physical work office today (usually I work from home), met and spoke with a good colleague whom I convinced to use OPNsense previously. We did some brainstorming, I returned home, then made a few more tests and voila !

Neither Unbound is at fault, neither my WAN, nor the fact that my installation runs on a VM.

I got fooled by a fact that I was able to ping Internet addresses from my PC (using IP, not DNS name) instantly once OPNsense VM booted up, which made me assume that routing on firewall is fine. I was so wrong.

I have multiple gateways - two WAN gateways (primary via DSL cable, backup via LTE modem) and multiple gateways assosciated with VPN servers and clients (OVPN, Zerotier, Wireguard).

Traffic from LAN is routed (via firewall rule) through gateway group (i.e. both WAN gateways with tiers defined) - that part works flawlessly and that's why I was able to ping INternet from LAN.

But the firewall itself (including Unbound) uses system routing table (not the gateway group) and default gateway switching.

WAN gateways have lower priority set (such as 200 for primary and 210 for backup) while all remaining gateways use default 255. Additionaly WAN gateways are marked as 'upstream' gateways.

Until 22.7.9 everything worked fine in this part - as soon as OPNsense boots and I can login on VM console, my primary WAN gateway is selected as default gateway (netstat -rn or GUI).

Starting form 22.7.10 this is badly broken. As soon as OPNsense boots up, both WAN gateways quickly receive their IPs but default gateway (netstat -rn) is set to... ovpns1's IP 10.8.0.2... WT* ???

It takes a couple of minutes until correct (=primary WAN GW) is selected to be the default GW. Before that happens, Unbound obviously does not have access to Internet hence struggling ot start up and prime root servers. It surrenders before correct GW is elected.

If you go to gateway list in GUI during that time, both WAN GWs are marked red Down, even though both are responding to pings, have short RTT and ovpns1 is marked as "active" gateway.


Later on, when I restart Unbound manually from GUI, correct GW is already elected and then Unbound starts up instantly, as it would during boot up before and at 22.7.9.

So I was right that it is a 'timing' issue but in a different place then I initially thought. Unbound starts before correct GW is elected (hence it has no access to Internet) but it's the default GW selection taking way too much time in 22.7.10 and up.

I have tested all versions that I had in VM snapshots and the last version that selects primary WAN as default GW instantly, during boot up, is 22.7.9_3. 22.7.10 has it broken up to current 23.1_6.

In last leap of faith that I can repair it, I set up new VM with 23.1_6 (I improved things versus previous incarnation - now UEFI and ZFS), restored config from the old/existing 23.1 VM, instantly same issue returns.

What should I do to get it fixed ? What logs are needed ?
Maybe I should start another thread ?



Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: klosz007 on February 02, 2023, 12:30:39 AM
One more update - it's probably becuase of how dpinger works in 22.10 and up.  Have there been any changes to dpinger in 22.10 ?

Even though both WAN GW's instantly respond to pings once they get their DHCP IPs (I'm using Google DNS as 'test' IPs), they are both initially marked down. That's becuase they both seem to start with 100% reported packed loss and this number slowly goes down with time until they go below 'bad' threshold are both considered alive. Then one of them (primary) is elected to become default GW, rather than my OpenVPN server's GW. This all takes too much time for Unbound to start succesfully. It surrenders and is not able to self-recover from this situation of no Internet access (that another question why). Manual restart of Unbound (after correct default GW is selected) repairs Unbound instantly.

This must have worked differently in 22.9. Both GWs are conisdered good right from the start, correct GW (primary WAN) is selected as soon OPNsense boots up, hence Unbound has no issues accessing Internet, primes root servers and goes up fast.
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: klosz007 on February 09, 2023, 11:46:40 PM
Hi again,

After tedious investigation I found root cause of this but I still believe it is bug (or change in behavior at least...) in 22.7.10+.

I realized that upon each reboot, I am not able to ping anything from OPNsense for some time (time is random/varies with each reboot, can be as short as 30 seconds, can be las long as 2 minutes). I cannot ping anything in my two LANs, I cannot ping my WAN LTE router, not mentioning Internet.

Any attempt to run ping from firewall results in this weird message:

ping: sendto: permission denied

This has consequences - monitoring of both WAN gateways fails so they are considered "down", hence OpenVPN gateway is selected as default gateway, Without proper default gateway, Unbound cannot contact root servers so it cannot start up.

After that "some time", pings eventually start to work, then both WAN GW's recover and come online, finally my primary WAN/DSL GW becomes active one. Unfortunately Unbound cannot self-recover from this situation (prolonged lack of access to root servers) and has to be manually restarted, then it comes up instantly.


So why is this happening ? I realized that if I reset my OPNsense config to default and go through at least minimal setup then this issue is not there. So it must be something in my (rather complex) configuration.

I took me a lot of time to find this and it all seems to be related to shaper. I have created shaper config (quite simple setup, like the one described here: https://maltechx.de/en/2021/03/opnsense-setup-traffic-shaping-and-reduce-bufferbloat) a long time ago, to resolve bufferbloat issues and have not been changing/touching this ever since.

It is sufficent to just create and enable pipes, without queues or rules even created for this weird issue to appear in 22.7.10+.
If I delete (or disable) pipes, everything goes back to normal. As soon they get created and enabled (without queues/rules even created or enabled), it instantly returns - it is not possible to ping anything after each reboot.
And as said previously, it does not happen in 22.7.9, all version starting from 22.7.10 are affected for me.

So for the moment I just disabled shaper and I'm on 23.1.6 and all is fine.

Any ideas why this is happening ?
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: SLNetworks on February 12, 2023, 10:30:36 PM
Maybe I was lucky with 22.7.* as I never had any issues.

23.1 however, is a different story. My computers all pass traffic through OPNsense, but pinging domain names from the OPNsense interface, has no such luck. Why OPNsense can ping the IP in the audit below and get a reply is puzzling.

Do I need to revert back to 22.*?

***GOT REQUEST TO AUDIT CONNECTIVITY***
Currently running OPNsense 23.1_6 at Sun Feb 12 15:36:29 CST 2023
Checking connectivity for host: pkg.opnsense.org -> 89.149.211.205
PING 89.149.211.205 (89.149.211.205): 1500 data bytes
1508 bytes from 89.149.211.205: icmp_seq=0 ttl=46 time=130.093 ms
1508 bytes from 89.149.211.205: icmp_seq=1 ttl=46 time=119.261 ms
1508 bytes from 89.149.211.205: icmp_seq=2 ttl=46 time=129.910 ms
1508 bytes from 89.149.211.205: icmp_seq=3 ttl=46 time=252.718 ms

--- 89.149.211.205 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 119.261/157.996/252.718/54.863 ms
Checking connectivity for repository (IPv4): https://pkg.opnsense.org/FreeBSD:13:amd64/23.1
Updating OPNsense repository catalogue...
pkg: https://pkg.opnsense.org/FreeBSD:13:amd64/23.1/latest/meta.txz: No address record
repository OPNsense has no meta file, using default settings
pkg: https://pkg.opnsense.org/FreeBSD:13:amd64/23.1/latest/packagesite.pkg: No address record
pkg: https://pkg.opnsense.org/FreeBSD:13:amd64/23.1/latest/packagesite.txz: No address record
Unable to update repository OPNsense
Error updating repositories!
Checking connectivity for host: pkg.opnsense.org -> 2001:1af8:4f00:a005:5::
ping: UDP connect: No route to host
Checking connectivity for repository (IPv6): https://pkg.opnsense.org/FreeBSD:13:amd64/23.1
Updating OPNsense repository catalogue...
pkg: https://pkg.opnsense.org/FreeBSD:13:amd64/23.1/latest/meta.txz: Non-recoverable resolver failure
repository OPNsense has no meta file, using default settings
pkg: https://pkg.opnsense.org/FreeBSD:13:amd64/23.1/latest/packagesite.pkg: Non-recoverable resolver failure
pkg: https://pkg.opnsense.org/FreeBSD:13:amd64/23.1/latest/packagesite.txz: Non-recoverable resolver failure
Unable to update repository OPNsense
Error updating repositories!
***DONE***
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: franco on February 13, 2023, 07:32:57 AM
Some of this reboot-related breakage might be https://github.com/opnsense/core/commit/6d22e7b68a2a fixed last week and going into 23.1.1.

# opnsense-patch 6d22e7b68a2a

It looks like this is influenced by FreeBSD 13 since 22.1 on top of one security fix and one reliability fix both in 22.7 which surfaced the problem when default gateway switching was enabled and the gateway chosen by the default gateway switching was a static address...


Cheers,
Franco
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: SLNetworks on February 16, 2023, 09:49:36 AM
When should we expect a fix for this?
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: franco on February 16, 2023, 10:25:03 AM
Simple answer: yesterday, perhaps?

But maybe the question wasn't precise enough regarding which "patch". The commit as published above? A release? A patch for a different issue? A follow up patch for the patch for "reasons"?


Cheers,
Franco
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: SLNetworks on February 16, 2023, 10:38:58 AM
So to skip your smartassery, has that patch been applied to the lease, or not yet? The version hasn't changed to indicate something has been fixed, unless it's too small to warrant such..?
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: franco on February 16, 2023, 10:50:02 AM
I'm not sure there is anything to add here that can't be read from my last two replies. ;)

And if you want to report a new bug feel free, but don't use and derail an ongoing discussion.

Your behaviour has been noted and so consider this as a warning to remain civil.


Cheers,
Franco
Title: Re: Unbound does not resolve queries after reboot (broken default GW selection)
Post by: franco on February 16, 2023, 12:57:35 PM
Pretty confident we cleared the air via PM. The question here still stands: 23.1.1 behaving better in this regard or not? If not it might be better to add a fresh Github issue to add technical depth and hopefully steps to reproduce.


Cheers,
Franco