Dnsmasq+Unbound observations in 25.1.7

Started by OPNenthu, May 19, 2025, 07:13:28 PM

Previous topic - Next topic
May 21, 2025, 02:26:08 AM #45 Last Edit: May 21, 2025, 02:34:33 AM by OPNenthu
Quote from: meyergru on May 21, 2025, 01:35:07 AMYes, with or without DoT configured, I can see outbound traffic on port 53 with strict DoT on. Switching "Use System Nameservers" changed nothing on that. The targets always were my system resolvers. IDK why this is, because even if DoT is disabled, I would expect Unbound to act as DNS resolver, not using any upstream servers.

And then - oh, well: https://github.com/opnsense/core/issues/7639 and: https://github.com/NLnetLabs/unbound/issues/451


Fascinating.

So the reason I never saw this until now is because my system was configured in a specific way:
  - No global resolvers defined in System->Settings->General
  - Only one internal domain (h1.home.arpa) for all my VLANs

I only moved to multiple domains now because of the Dnsmasq migration guide.

While the issue of external requests getting forwarded to the internal zone may have been present, I at least was not seeing any unexpected traffic to public resolvers on port 53 because... I simply didn't have any defined.  I also was not seeing other internal zones not getting forwarded to because... I simply didn't have any defined.

Seems Unbound forwarding is fundamentally broken, and has been. 
"The power of the People is greater than the people in power." - Wael Ghonim

Site 1 | N5105 | 8GB | 256GB | 4x2.5GbE i226-v
Site 2 |  J4125 | 8GB | 256GB | 4x1GbE i225-v

May 21, 2025, 09:44:34 AM #46 Last Edit: May 21, 2025, 09:46:08 AM by Monviech (Cedrik)
So I tested it this morning.

It might be a case of testing the domain resolution wrong, and might be specific to windows clients as suggested before.

When querying with nslookup:

# nslookup opnsense.org

This will append the Domain suffix to the query on the windows client itself, turning it into:

"opnsense.org.lan.internal"

This is seen in the Unbound logs and that is what causes the slow initial resolution (because Unbound tries to indeed forward "lan.internal" to dnsmasq since it has a query forwarding for it, but dnsmasq does not know that hostname so it fails).

I have verified that with Wireshark, the client will first query for domain+suffix, and then afterwards just for the domain.

In contrast, when using:

# nslookup opnsense.org.

(notice the dot after org)

This will resolve right away.

I have also tried stripping option 15 so that windows does not receive a domain suffix to use, and it had the same result.

--------

So, maybe the test is wrong and there are no issues after all?

"Wer viel misst misst Mist"?
Hardware:
DEC740

May 21, 2025, 10:43:47 AM #47 Last Edit: May 21, 2025, 10:49:20 AM by meyergru
@Monviech: Windows is a special case, as you found out. E.g., it will per default not heed a DNS search list and domain name via DHCP (the domain name that is used is the one from the Windows hostname).
I take great lengths to make sure that both requests for a plain hostname and hostname.domain work on the DNS server side because of that, inlcuding aliases for each any every name.



However, aside these Windows-specific problems, we can see now that Unbound forwarding seems fundamentally broken, which is why OPNnthu saw the problems without any upstream servers.

Behind this, there lurks another problem: If you want to use private DNS via DoT, DoH, or DNSCrypt you are stuck right now, because:

1. With Unbound, DoT is implemented using forwarding. I have verified that to (sometimes) not work at all and essentially leak all DNS requests via unencrypted channels to upstream servers. This is an unfixed upstream issue. Not configuring upstream servers lead to timeout symptoms, so it is no viable alternative.

2. Unbound actually would allow for DoH (which uses no broken forwarding), but OpnSense GUI does not offer it, so it is an alternative with manual configuration at best. Also, it is doubtful if Unbound can be used at all in combination with DNSmasq, because, still the forwarded local domains do not work correctly.

3. The most often used way to implement encrypted upstream DNS with DNSmasq is Stubby, but there is no package or plugin for OpnSense. Mimugmail has a package for Blocky, but that is manually configured and has basically no GUI.

4. There is still the DNSCrypt-Proxy plugin, which can handle DoT and DNSCrypt upstream and also working DNS forwards for local domains, but at this time, it will not work because of this unfixed GUI validation bug: https://github.com/opnsense/plugins/issues/4697, which effectively prevents using forwards for local domains to DNSmasq.

Currently, I know of no easy, working solution for a DNS server that handles encrypted upstream DNS with OpnSense.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

BTW if it helps, being a long-time Stubby user, I could contribute a writeup of my setup of it. I thought I had done it in the Tutorials sections but I could be wrong. Not all my intentions make it there.

May 21, 2025, 11:47:22 AM #49 Last Edit: May 21, 2025, 11:49:55 AM by meyergru
I can create a band-aid or manually configured variant, that works for me, as well, but I think normal users should have an option that is supported via the GUI.

My preferred way of solving this would be to make DNSCrypt-Proxy work by removing the validation bug. Matter-of-fact, all that is needed here is a DNS server capable of DNS forwarding and encryption. Unbound with all of its complexity should not even be needed on top of DNSmasq, as others have pointed out. If DNSmasq was capable of encryption, this was not needed at all.

I proposed a pull request for DNSCrypt-Proxy, but I understand that the way I think the validation should work violates Deciso's coding guidelines. On the other hand, using the pre-fabricated validation type breaks existing setups. It is up to @Franco and @Monviech on how to eventually solve it. Also, plugins normally have lower priority, that is why I wanted to highlight the implications of this Unbound defect unfolding.

P.S.: Even Kea + Unbound is not a viable solution in the light of this problem.

Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

https://github.com/opnsense/core/issues/8708

Please no noise in that ticket if possible, its only about the specific forwarding issues that happen due to domain suffixes being appended by some clients.
Hardware:
DEC740

Quote from: meyergru on May 21, 2025, 11:47:22 AMI can create a band-aid or manually configured variant, that works for me, as well, but I think normal users should have an option that is supported via the GUI.

For what it's worth, I'm using AGH that is listening on port 53 and forwarding queries for local and reverse domains to dnsmasq running on a different port. So similar to having unbound running on port 53 and handling everything non-local. AGH can forward to upstream providers using DoT or DoH. It is also available to configure via the GUI.

Granted, it's not part of the default opnsense offering and one has to add Mimugmail's repo to enable it.

May 21, 2025, 12:48:08 PM #52 Last Edit: May 21, 2025, 01:07:56 PM by OPNenthu
Duly noted regarding the Windows behavior; I'll keep that in mind.  On that point I'd like to offer a rebuttal, however, on two grounds:

1. My web browser surely doesn't use nslookup, but since this issue started my web pages and mobile apps have been laggy.  There is a systemic DNS performance hit, but I don't have measurements to offer.

2. Adding the trailing dot (.) to nslookup only helps with external resolutions.  There is still a problem getting internal names to resolve in general.  Plain names don't resolve, with or without a dot.

Plain name, no ending dot:

C:\>nslookup pve
Server:  UnKnown
Address:  192.168.30.1

DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out

Plain name, with dot (not sure if this is valid):

C:\>nslookup pve.
Server:  UnKnown
Address:  192.168.30.1

*** UnKnown can't find pve.: Non-existent domain

Qualified name, with dot:

C:\>nslookup pve.lab.internal.
Server:  UnKnown
Address:  192.168.30.1

Non-authoritative answer:
DNS request timed out.
    timeout was 2 seconds.   <-- timeout
Name:    pve.lab.internal
Address:  192.168.60.2

Qualified name, no dot:

C:\>nslookup pve.lab.internal
Server:  UnKnown
Address:  192.168.30.1

DNS request timed out.
    timeout was 2 seconds.   <-- timeout
Non-authoritative answer:
DNS request timed out.
    timeout was 2 seconds.   <-- timeout
Name:    pve.lab.internal
Address:  192.168.60.2

I think I mentioned this in an earlier post- Dnsmasq is offering the wrong domain to static clients (which my Windows machine is).

Because my desktop IP falls outside of the DHCP range, it is not getting the expected "home.internal" domain.  Rather, it is getting the default "h1.home.arpa" domain.

I might expect this behavior if the static client were not entered into Dnsmasq at all, however it is.  I have a Host entry with "home.internal".  Screens attached.  (Note also that 'pve.lab.internal' is in there.)

Despite this host entry, the DHCP offer contains "h1.home.arpa":

Dynamic Host Configuration Protocol (Offer)
    Message type: Boot Reply (2)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0xfa34428a
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 192.168.30.2
    Next server IP address: 192.168.30.1
    Relay agent IP address: 0.0.0.0
    Client MAC address: ASUSTekCOMPU_xx:xx:xx (xx:xx:xx:xx:xx:xx)  (*redacted)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (Offer)
    Option: (54) DHCP Server Identifier (192.168.30.1)
    Option: (51) IP Address Lease Time
    Option: (58) Renewal Time Value
    Option: (59) Rebinding Time Value
    Option: (1) Subnet Mask (255.255.255.0)
    Option: (28) Broadcast Address (192.168.30.255)
    Option: (3) Router
    Option: (15) Domain Name
        Length: 12
        Domain Name: h1.home.arpa
    Option: (6) Domain Name Server
        Length: 4
        Domain Name Server: 192.168.30.1
    Option: (255) End

And this is backed up by ipconfig (also attached).


Anyhow, this is only one issue.  I just wanted to get this one acknowledged before moving on to the others.

---
EDIT:  To be clear, when I say my desktop PC is static, I dot not mean it has a static IP in Windows.  It's set to DHCP.  What I mean is that there is a static reservation in Dnsmasq, with an IP, MAC, and domain.

The 'pve' host is in fact static.  I just added that Host entry for resolution purposes.
"The power of the People is greater than the people in power." - Wael Ghonim

Site 1 | N5105 | 8GB | 256GB | 4x2.5GbE i226-v
Site 2 |  J4125 | 8GB | 256GB | 4x1GbE i225-v

I can now confirm all timeouts for nslookup has now been resolved by adding a single DNS server (1.1.1.1) in System > Settings > General without Use System Nameservers checked in Unbound DNS Query Forwarding.

I can see in the firewall logs outbound 1.1.1.1:53 connections from the WAN address in addition to DoH connections that are configured in Unbound. If I create a firewall rule to block :53, the nslookup timeouts return.

I removed all Query Forwarding rules and included them in a /usr/local/etc/unbound.opnsense.d/local.conf file as per https://github.com/opnsense/core/issues/7639. This did not work; I recieved SERVFAIL errors for every query sent for internal domains so this doesn't look like a solution - for my installation anyway.

As @OPNenthu mentioned, I've only started to experience this because I moved to multiple internal domains by following the Dnsmasq migration guide.

Quote from: Monviech (Cedrik) on May 21, 2025, 12:40:55 PMhttps://github.com/opnsense/core/issues/8708

Please no noise in that ticket if possible, its only about the specific forwarding issues that happen due to domain suffixes being appended by some clients.

Thank you for logging this.  I apologize for asking (I'm mentally dragging a bit today) does this address the internal resolution failures I posted about above as well?

Quote from: RutgerDiehard on May 21, 2025, 12:51:05 PMI can see in the firewall logs outbound 1.1.1.1:53 connections from the WAN address in addition to DoH connections that are configured in Unbound. If I create a firewall rule to block :53, the nslookup timeouts return.

Yes, and I want to add an observation here as well (the next issue).

Encrypted DNS is not a factor in the leakage to the system resolvers.  I noticed that even when Unbound is in recursive mode with no DoT forwards enabled, the default resolver (1.1.1.1:53) will appear in the pf logs sprinkled in between the recursive hits to the authoritative name servers.

The leak does not depend on DoT being active.

I want to make sure this leak issue is also acknowledged.
"The power of the People is greater than the people in power." - Wael Ghonim

Site 1 | N5105 | 8GB | 256GB | 4x2.5GbE i226-v
Site 2 |  J4125 | 8GB | 256GB | 4x1GbE i225-v

Quote from: Drinyth on May 21, 2025, 12:47:47 PM
Quote from: meyergru on May 21, 2025, 11:47:22 AMI can create a band-aid or manually configured variant, that works for me, as well, but I think normal users should have an option that is supported via the GUI.

For what it's worth, I'm using AGH that is listening on port 53 and forwarding queries for local and reverse domains to dnsmasq running on a different port. So similar to having unbound running on port 53 and handling everything non-local. AGH can forward to upstream providers using DoT or DoH. It is also available to configure via the GUI.

Granted, it's not part of the default opnsense offering and one has to add Mimugmail's repo to enable it.

I am also using AGH running on 53. So I thought this will be a simple enough fix to bypass Unbound and use AGH to forward local domains and rDNS queries to Dnsmasq to resolve while everything else goes out over DoT (or even DoH).

Unfortunately, even when configured correctly (I think), there is still lookup timeouts with nslookup. The only thing that stops it is to add a DNS entry (1.1.1.1) in System > Settings > General. As soon as that's added, 1.1.1.1:53 traffic starts appearing in the firewall logs.

Whilst I originally thought this was only Windows boxes that had the issue, I've seen the same on Linux (Ubuntu Server 24.04 LTS) also.

Taking another look at the configuration in System > Settings > General, there is an option to NOT use the local DNS service as the nameserver for this system. It was initially unchecked so I removed 1.1.1.1 in the DNS servers list and checking this box to not use local DNS.

With AGH configured as above, no delays were noticed on any client running nslookup and no traffic to 1.1.1.1:53 was observed in the firewall logs - as to be expected.

Is anybody able to test this when Unbound is providing DNS duties.

@OPNenthu

Can you try:

opnsense-patch https://github.com/opnsense/core/commit/220dbc7931
Afterwards in dnsmasq General Settings in DNS Forwarding select:

"Do not forward to system defined DNS servers" and apply.

With this it should not matter which nameservers are defined by the system, dnsmasq will not forward anything it does not know (if no other forwarders are defined in its own Server tab), but reply with a failure right away, increasing dns lookup times for windows clients even if they first ask wrongly with a suffix first.
Hardware:
DEC740

May 21, 2025, 04:39:39 PM #57 Last Edit: May 21, 2025, 04:45:59 PM by OPNenthu
@Monviech- patch applied.  Will run through some scenarios and post back in some time. 

A few clarifying questions from early in this thread, to make sure the results are correct:

1) In order to do proper resolution of a host name on a different Dnsmasq domain (say client A on home.internal wants to resolve client B on iot.internal), do I need to add an additional 'domain-search[119]' DHCP option to offer 'iot.internal' as a search domain to A? 

2) Is it required in Dnsmasq to create a separate static range (start/end address) for hosts not taking dynamic leases but are in the same subnet?

3) Is it expected that a static Host entry with MAC+IP+domain will register with DNS and be resolved?
    3a) Also for clients which are having static IPs on the host level and not taking DHCP offers?

4) Do static Host entries always get the system default domain in the DHCP offer, or should they get the domain which is listed in the Host entry?
"The power of the People is greater than the people in power." - Wael Ghonim

Site 1 | N5105 | 8GB | 256GB | 4x2.5GbE i226-v
Site 2 |  J4125 | 8GB | 256GB | 4x1GbE i225-v

1. No. Option 15 is sent automatically which includes the domain suffix to use. Option 119 is not needed.
2. Not always. If you have a dynamic range and specify static leases in it they will be sent out too. But the lease could be already taken and your static lease will not work then.
3. Yes, if its in /var/etc/dnsmasq-hosts
4. They get the domain that is defined in /var/etc/dnsmasq-hosts, and if you leave it empty they register the one of the dhcp-range instead.
Hardware:
DEC740

Quote from: Monviech (Cedrik) on May 21, 2025, 05:11:26 PM2. Not always. If you have a dynamic range and specify static leases in it they will be sent out too. But the lease could be already taken and your static lease will not work then.

All clear, except #2 because I asked incorrectly.  My fault.

What I meant to ask:  If the dhcp-range is for example 192.168.1.100 - 192.168.1.199, but I have a Host entry for a client at 192.168.1.20 (outside of the dhcp-range), this will work automatically?  Or requires a dedicated static range (and if so, how to create it)?

Thanks
"The power of the People is greater than the people in power." - Wael Ghonim

Site 1 | N5105 | 8GB | 256GB | 4x2.5GbE i226-v
Site 2 |  J4125 | 8GB | 256GB | 4x1GbE i225-v